Graduate Urban Economics Lecture 11, May 10, 2018

Programming in Empirical Work: Recommendations

Avoid Mistakes

  • Empirical work in economics is becoming complicated: big datasets, difficult merging procedures, complicated data structures (ex: relational databases vs flat files), complex analysis (structural, imputations, creation of variables through machine learning)
  • Given all of this, it is very easy to make mistakes (programming errors)
  • Often you will do some analysis that leads to a surprising result, which you may then try to justify with an economic argument, only to later realize it was simply a coding error
  • This is a huge waste of time: it is worth investing time upfront to try and avoid this problem

Work like a professional programmer

Gentzkow and Shapiro wrote a very helpful document called "Code and Data for the Social Sciences: A Practitioner's Guide"

  • automation: tasks that are done multiple times should be automated
  • version control: use software that automatically keeps track of all changes to programs and documents rather than successively naming files (file_v1, file_v2)
  • professional programmers use GIT to do version control (ex: github); I mostly rely on Dropbox
  • abstraction: use functions to do repeated subroutines; never hard code anything (ex: reg y x if duration==4) but instead define constants and variables up front (ex: reg y x if duration==${analysis_dur})

More from Gentzkow and Shapiro

  • directories: use a modular file structure that allows for easy replication and working with co-authors; you want to be able to remember how you did everything if you leave the project for six months
  • use "local" directories: this will allow you to work on multiple computers without issues

My Stata trick:

  • cap cd "C:_SHARED" //NS laptop
  • cap cd "D:_SHARED" //NS office computer
  • global projDir: pwd //stores above Dropbox path in global ${pwd}
  • cd "$projDir\directory_analysis_misc_entrant_files" //This should be the directory of THIS DO file

My thoughts: working with complicated procedures

Write a test procedure

  • if you do complicated programming (structural estimation, complex data merging) you should first write a test case
  • the idea is you create a test case where you know what the correct result should be
  • anytime you make a major change to your code, you run this test case to make sure it still leads to what you expect
  • I also use this procedure for structural etsimation: first simulate the DGP and then see if your estimation can uncover the parameters you chose

My thoughts: simple things

Know your data: I have caught countless errors by realizing data changed unexpectedly

  • check the observation count after every file change (merge, append, etc..)
  • memorize basic descriptive stats (mean, sd) of key variables
  • pay attention to coefficient magnitudes! these should make sense
  • cross validate: can you compare your results, or descriptive stats of your data, to any outside sources?

Basic Mapping

Preliminaries: load in libraries

library(rgdal) #manipulating shape files (.shp)
library(ggplot2) #very advanced plotting/graphics library
library(sp) #spatial data utility
library(maptools) #mapping utility
library(rgeos) #additional geographic utility
## Warning: package 'rgeos' was built under R version 3.3.2
library(raster) #raster data
## Warning: package 'raster' was built under R version 3.3.2
library(spatstat) #spatial statistics
#library(ggmap) #interfacing with Google Maps (VPN required)
library(OpenStreetMap)
## Warning: package 'OpenStreetMap' was built under R version 3.3.2

Reading in Map Data

#chinaMap<-getData('GADM',country="CHN",level=1) #download data from GADM; getData is part of package raster
ChinaRD0<-readRDS("data/CHN_adm0.rds") #Country outline; data folder is local
ChinaRD1<-readRDS("data/CHN_adm1.rds") #provinces
ChinaRD2<-readRDS("data/CHN_adm2.rds") #prefectures
ChinaRD3<-readRDS("data/CHN_adm3.rds") #counties
plot(ChinaRD0,main="Mainland China country outline") #Plot Mainland China 

Look at properties of data frame

ChinaRD1@data[1:5,1:7] #shows first 5 rows 7 cols of ChinaRD1; could also use: slot(ChinaRD1,"data")[1:10,]
##   OBJECTID ID_0 ISO NAME_0 ID_1    NAME_1 HASC_1
## 1        1   49 CHN  China    1     Anhui  CN.AH
## 2        2   49 CHN  China    2   Beijing  CN.BJ
## 3        3   49 CHN  China    3 Chongqing  CN.CQ
## 4        4   49 CHN  China    4    Fujian  CN.FJ
## 5        5   49 CHN  China    5     Gansu  CN.GS
names(ChinaRD3) #names of columns in data
##  [1] "OBJECTID"  "ID_0"      "ISO"       "NAME_0"    "ID_1"     
##  [6] "NAME_1"    "ID_2"      "NAME_2"    "ID_3"      "NAME_3"   
## [11] "CCN_3"     "CCA_3"     "TYPE_3"    "ENGTYPE_3" "NL_NAME_3"
## [16] "VARNAME_3"
shanghaiCities<-ChinaRD3[ChinaRD3$NAME_2=="Shanghai",] #Selects only cities in Shanghai municipality

Plot geographic subset of data with names

plot(shanghaiCities)
plot(shanghaiCities[shanghaiCities$NAME_3=="Shanghai",],border="blue",add=TRUE) #add=T draws on current plot
text(coordinates(shanghaiCities),
     labels=shanghaiCities$NAME_3,col="red",cex=0.7) #Adds names

Merge population data onto map

popData<-read.table("data/china_pop_9293.csv",header=T,sep=",",stringsAsFactors=FALSE) #examine: str(popData)
ChinaData2 <- slot(ChinaRD2, "data")
names(ChinaData2)[c(6,8)]<-c("province_name","city_name")
table(ChinaData2$province_name) #Show frequency table of prefectures by province
prefectureData<-merge(ChinaData2, popData, by=c("province_name", "city_name"),all.x=TRUE) #merging by two columns
prefectureData$city_name[is.na(prefectureData$pop92)] #not a great way to match
names(ChinaRD2)[c(6,8)]<-c("province_name","city_name") #SP package allows merging onto SpatialPolygonsDataFrame
popDataSP<-merge(ChinaRD2, popData, by=c("province_name", "city_name"),all.x=TRUE)
#rm(ChinaRD2) #Removes old object--not really necessary
summary(popDataSP) #shows summary statistics on new object

Plot Population Data with ggplot2 (chloropleth plot)

gg_map_obj<-fortify(ChinaRD2, region= "OBJECTID")
china_map<-ggplot() + geom_map(data = slot(popDataSP, "data"),
    aes(map_id = OBJECTID, fill = pop92), map = gg_map_obj) + 
  expand_limits(x = gg_map_obj$long, y = gg_map_obj$lat) + ggtitle("Population in 1992, Selected Prefectures")
china_map

#ggsave("china_pop92.pdf", china_map, width = 10, height = 6)

Points Data: which prefecture contains province's centroid?

provinceCentroids<-coordinates(ChinaRD1) #gets coordinates
plot(ChinaRD1)
points(provinceCentroids,col="red") #adds points to current plot

#plot(ChinaRD3,border="blue",add=T) #draws borders of prefectures

Geographic Projections, Spatial Overlays, Write to CSV

spdf_pts <- SpatialPointsDataFrame(coords=provinceCentroids, data=data.frame(ChinaRD1$NAME_1))
proj4string(spdf_pts) <-proj4string(popDataSP) #Make sure projection is the same (should be since comes from same data source)
#proj4string(spdf_pts)<-CRS(" +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0") #could use this code to adjust
pref_pts<-over(spdf_pts,popDataSP)
pref_pts$x<-slot(spdf_pts,"coords")[,1]
pref_pts$y<-slot(spdf_pts,"coords")[,2]
#write.table(pref_pts, file = "centroids_in_cities.CSV", sep = ",", col.names=names(pref_pts),row.names=FALSE,na="") #write CSV

Plot Province Centroids with Prefecture Name

plot(ChinaRD1)
points(provinceCentroids,col="red") #adds points to current plot
text(pref_pts$x,pref_pts$y, labels=pref_pts$city_name,col="blue",cex=0.7) #Add names

Using Open Street Map Tiles

sufe_names="SUFE"
sufe_pts<-SpatialPointsDataFrame(coords=cbind(121.5153883,31.2948375), data=data.frame(sufe_names))
sufe_pts$lon<-coordinates(sufe_pts)[1,1]
sufe_pts$lat<-coordinates(sufe_pts)[1,2]
sufe_map <- invisible(openmap(c(sufe_pts$lat+0.03,sufe_pts$lon-0.03), c(sufe_pts$lat-0.03,sufe_pts$lon+0.03),
               minNumTiles=10,type="osm")) #invisible suppresses warning messages

SUFE Plot from Open Street Map

plot(sufe_map)

Geocoding

  • Geocoding is the process of converting street addresses to latitude and longitude

  • This can be done with a geodatabase or by calling a service (ex: Google, Baidu)

  • In R, there are multiple packages to do this, including geoChina and ggmap

  • You will need an API and the output will often come back as a JSON; Google gives 2500 free requests per day, Baidu seems to allow even more

  • Google will also allow network analysis (distance between two points by travel mode)

  • Note that Chinese maps use a shifted coordinate system (GCJ02) that may require additional work to use

Working with Point Pattern Data

Bring in restaurants point data and plot

We use a projection (NAD1983 Long Island, feet) to do simple distance calcuations

rest_data<-read.table("data/menupages_Wb126.csv", header=T, sep=",",stringsAsFactors =FALSE)
nyc_tracts<-readOGR("data","manhattan_tracts") #Read shapefiles with readOGR  
ggplot() + geom_polygon(data = nyc_tracts, aes(x=long, y = lat, group = group), fill = NA, color = "black") + 
   geom_point(data=rest_data,aes(x=x_feet, y=y_feet), size=1, color = "blue") + ggtitle("Manhattan Restaurants")
## Regions defined for each Polygons

Subset Plot: Chinese Restaurants

ggplot() + geom_polygon(data = nyc_tracts, aes(x=long, y = lat, group = group), fill = NA, color = "black") + 
   geom_point(data=rest_data[rest_data$cuisName1=="Chinese",],aes(x=x_feet, y=y_feet), size=1, color = "red") + ggtitle("Chinese Restaurants")
## Regions defined for each Polygons

Convert to Point Pattern Object for Analysis

min_x<-min(rest_data$x_feet)
max_x<-max(rest_data$x_feet)
min_y<-min(rest_data$y_feet)
max_y<-max(rest_data$y_feet)
W <- owin(xrange=c(min_x,max_x),yrange=c(min_y,max_y)) #a window defined by limits of data
allrests.ppp<-ppp(rest_data$x_feet,rest_data$y_feet,window=W)
## Warning in ppp(rest_data$x_feet, rest_data$y_feet, window = W): data
## contain duplicated points
italian.ppp<-allrests.ppp[rest_data$cuisName1=="Italian",]
chinese.ppp<-allrests.ppp[rest_data$cuisName1=="Chinese",]
pizza.ppp<-allrests.ppp[rest_data$cuisName1=="Pizza",]

Look at count of Chinese restaurants within given radii

neighDist<-pairdist(chinese.ppp)
radii_set<-seq(200,3600,200)
chinese_mat<-array(,c(length(radii_set),2))
for (i in 1:length(radii_set)) {
  chinese_mat[i,1]<-radii_set[i]
  chinese_mat[i,2]<-mean(rowSums((neighDist<radii_set[i]), na.rm=TRUE))-1 #remove own distance
}
plot(chinese_mat,xlab="radius",ylab="Count",main="Count Chinese Restaurants within Radius")

Compare to pizza restaurants

Now compare Chinese to simulated measure

numSims<-1000
sim_mat<-array(,c(length(chinese_mat[,1]),numSims))
for (i in 1:numSims) {
  sim.ppp<-allrests.ppp[sample(1:length(allrests.ppp$x),length(chinese.ppp$x),replace=F)]
  neighDist_sim<-pairdist(sim.ppp)
  for (j in 1:length(radii_set)) {
    sim_mat[j,i]<-mean(rowSums(neighDist_sim<radii_set[j], na.rm=TRUE))-1
  }
}
conf_mat<-array(,c(length(chinese_mat[,1]),2))
for (j in 1:length(radii_set)) {
  conf_mat[j,1]<-quantile(sim_mat[j,],probs=.025,na.rm=TRUE)
  conf_mat[j,2]<-quantile(sim_mat[j,],probs=.975,na.rm=TRUE)
}
chinese_mat<-cbind(chinese_mat,conf_mat)

Plot with Confidence Bands

Compare to pizza