Panel data in R for dummies

Since I’ve recently been in the business of making general rules, here’s another: the geniuses who can write brilliant R packages still write documentation like it was the 1990s. Anyone old enough to have lived through the horror of 1990s technical user manuals will know what I mean. Not for them the friendly Ikea-style quick start guide. These manuals usually started thus:

Chapter 1: Configuring the IDE/DMMA Bus Jumper.
Before connecting LPT or COMx peripherals to this device, set the IDE/DMMA Bus Jumper depending on the communication mode of your peripheral. Choosing this jumper setting incorrectly may result in unpredictable results which may damage this device.

So it is with the excellent panel data package for R, called plm. The documentation begins with the following sentence:

In plm the data argument may be an ordinary data.frame but, in this case, an argument called index has to be added to indicate the structure of the data. This can be NULL (the default value), it is then assumed that the first two columns contain the individual and the time index and that observations are ordered by individual and by time period

and continues in much the same vein.

Anyway, in the spirit of relieving my fellow researcher from ever having to read this stuff to get started with panel data regression in R, here is a much-condensed version of what I’ve read which will get one started. Hurrah for me.

library(package=RPostgreSQL)
library(package=plm)
## Load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host="xxx",dbname="xxx",user="xxx",password="xxx")

## Read the whole contents of a table into a dataframers
rs <- dbReadTable(con,"xxx_table_name")

## create a pdata.frame from the dataframe. This is a technicality allowing the plm package to work with the data.
paneldata <- pdata.frame(rs, index=c("category_variable_name","time_variable_name"))

## we can now 'do stuff' with the data in a panel sense: things like lagging/leading and calculating differences:
head(paneldata$year)
head(lag(paneldata$year),1)
head(lag(paneldata$year),0:2)
Advertisement

Wanted: people to exploit

The first rule of academia: never talk to anyone from the ‘real world’ about your research.

I was reminded of this rule yesterday when I spoke to my flatmate about what I’m trying to do with complexity science-style modelling and development. I told him of the need to stop seeing development as the search for a missing ingredient, and of how being a developed economy doesn’t imply that you know what it takes to become a developed economy any more than being healthy implies you know anything about medicine. In fact, the opposite could perhaps be argued: it’s through my illnesses that I’ve learned such physiology as I know, not through my wellnesses.

I told him of the prospect of investigating a more subtle model of society which recognised that institutions, policy, wealth and technology are all interacting systems and that sudden and radical changes in state (such as that which occured during the industrial revolution) are well understood in chaos theory, but terribly predicted by linear regression. I gesticulated wildly and my cheeks grew pink with enthusiasm.

He made a point, though, which entirely deflated my new-found sense of purpose in studying why some countries are rich, safe and just and others are poor, dangerous and corrupt. Surely, he said, there can be no study of development which doesn’t account for the fact that we are where we are because our ancestors successfully stole labour and resources from parts of the world they colonised for four hundred years. In our quest to make developing countries ‘more like us’, are we going to prescribe four centuries of stealing labour and resources from us? He thought not.

The problem with using the past of our development to study the future of others’ development is that we want to improve their lot without affecting our lot in any way at all. That’s what makes development hard, and it goes beyond the usual ‘world resource constraint’ argument which says that the world doesn’t have enough stuff for everyone to live like an American. I’ve never really bought into that argument. This new argument is more pernicious: we’re not going to run out of stuff for developed countries to consume, we’re going to run out of people for developing countries to exploit.

Lagging panel data in R

I never thought I’d say this, but Stata rules the roost at at least one thing: lagging a column in panel data. What does that mean?

Imagine you’re looking at test scores, and you think this year’s test score depends on last year’s (a sensible assumption perhaps). Your dataset would look something like this:

Year Name Score Previous Score  Score Before That
2010 Bob  30    .               .
2011 Bob  59    30              .
2012 Bob  63    59              30

The previous scores are calculated by ‘lagging’ the data by one and two periods (note that the dot represents a missing value. Stata has “.” and R has “NA”)
To lag data in Stata you simply type L.Score and for inclusion of lots of lags, L(0/3).Score will give you the score, last year’s score, the year before that AND the year before that one too. Amazing!

Now, as I search for a way to do this in R, imaging my horror on stumbling upon this syntactical nightmare. Being involved with computing gets you used to difference syntax systems. For example, I know that the SQL-like “WHERE Name=’Rob’ AND Home=’London'” looks like “AND(A$2$=’Rob’,A$3$=’London’) in Excel. But you get the feeling that, as computer languages beging to be able to claim some maturity of design, some kind of basic tenets of language syntax would begin to emerge. (Note that, were Excel to be designed today, there’s no way the developers would opt for the AND(A,B) tomfoolery.)
So, to lag in R, I have to do this:

plm(Name ~ Score, lag(Score=2,c(2,3)))

Try getting your head around that! Score for readability of code? Nought! Score for being able to remember how to do this next time I want to? Nought point nought!

Ah, the joys of community-developed software.

Learning about panel data in R

So far so simple in terms of reading data into the R environment (I’m using the wonderful RStudio by the way). Download a little library, run about 5 lines of code and, boom, my PostgreSQL view is available as a data frame in R. Easy peasy.

library(package=RPostgreSQL)
library(package=plm)
## Load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host="xxxxxxxx",dbname="xxxxxxx",user="xxxxxx",password="xxxxxx")

## Read the whole contents of a table into a dataframers
rs <- dbReadTable(con,"rlvw_cntry_year_aid_affected")

More tricky is the whole panel data regression part. Panel data has two dimensions, a time dimension (in my case the years from about 1960 to 2008) and an “individuals” dimension, in my case countries. So I have aid received data for all countries for each year in the set, making a lot of observations overall.

The hard part is using both dimensions when running a regression. In principle, the maths is not complicated. And in the statistics package I’m used to using, it’s straightforward (once you know how!). You tell the software which column represents your time dimension, and which your individual dimension and off you go. In R, I’m not sure yet how that stuff works, so it’s back to the reading board for me as I trawl through online tutorials etc. I’ll report back once I’ve worked out how to do it.

Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!

ISO3 Country Codes (things we learn as researchers)

This year I have mostly been mapping country names from data sources found online to ISO3 country codes which are actually useful for something.

The continuing proliferation of data sources whose only country indicator is something like “Virgin Is. (UK)” or “Korea, People’s Rep Of” has staggered me over the past few years of being a development aid researcher. It makes the providers of the data look like amateurs and makes life much harder than necessary for the poor researchers trying to eak some meaning out of the profusion of numbers. (Mentioning no names. But the delightful Penn World Tables are a rare exception.)

I’ve created various little tricks for turning these into the far more useful ISO3 country codes, examples of which include GBR for Great Britain, CHN for the People’s Republic of You-know-who, and TCA for the wonderfully obscure Turks and Caicos Islands (variously represented in datasets with an ampersand, “Islands” inexplicably shortened to Is. etc. etc. ad nauseam). These techniques mostly involve fancy use of the mostly-magical and semi-mythological VLookup in Excel: the topic of more dinner-table conversations than any other mostly-undocumented software functionality since the “discovery” of BCC in the 90s.

Anyway, to cut a long story short, I am about to embark on this operation once more as I launch myself into the vaguaries of the EM-DAT natural disaster database. See you on the other side.

Rob

Directly query DB in QGIS

Sometimes a plugin comes along which changes your life. Fast SQL Layer from Pablo T. Carreira is just such a plugin.

It allows you to execute arbitrary SQL queries against your underlying database, meaning that views and queries can now live in your QGIS project where they belong, instead of cluttering up the poor unsuspecting PostGIS database that’s having to do all the hard work.

To install it (on Windows at least), add the QGIS Contributed repository to your QGIS repository list (http://pyqgis.org/repo/contributed) and you’ll be writing queries in no time at all. Amazing.

My first choropleth

I’ve managed to get some data onto my PostGIS database, and have got my first visualisation result.

Since the aid dataset is large (c. 1,000,000, source: aiddata.org, the dataset is called ‘aid data 2 short’) I’ve summed over donors and within years, to leave me with a total aid received (or rather, promised, since the dataset has amounts pledged, not amounts donated) per recipient country per year. This set is a far more manageable size.

Having added population data from the UN Department for Economic and Social Affairs (World Population Prospects, the 2010 revision) I’m able to draw a choropleth map of the world with the colour of a country representing amount of aid committed to that country in a the current year (the example uses 2000 as a randomly chosen benchmark). It’s work which is very similar to this.

Nice, isn’t it?

levels of aid choropleth
my first choropleth (QGIS)