Panel data in R for dummies

Since I’ve recently been in the business of making general rules, here’s another: the geniuses who can write brilliant R packages still write documentation like it was the 1990s. Anyone old enough to have lived through the horror of 1990s technical user manuals will know what I mean. Not for them the friendly Ikea-style quick start guide. These manuals usually started thus:

Chapter 1: Configuring the IDE/DMMA Bus Jumper.
Before connecting LPT or COMx peripherals to this device, set the IDE/DMMA Bus Jumper depending on the communication mode of your peripheral. Choosing this jumper setting incorrectly may result in unpredictable results which may damage this device.

So it is with the excellent panel data package for R, called plm. The documentation begins with the following sentence:

In plm the data argument may be an ordinary data.frame but, in this case, an argument called index has to be added to indicate the structure of the data. This can be NULL (the default value), it is then assumed that the first two columns contain the individual and the time index and that observations are ordered by individual and by time period

and continues in much the same vein.

Anyway, in the spirit of relieving my fellow researcher from ever having to read this stuff to get started with panel data regression in R, here is a much-condensed version of what I’ve read which will get one started. Hurrah for me.

library(package=RPostgreSQL)
library(package=plm)
## Load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host="xxx",dbname="xxx",user="xxx",password="xxx")

## Read the whole contents of a table into a dataframers
rs <- dbReadTable(con,"xxx_table_name")

## create a pdata.frame from the dataframe. This is a technicality allowing the plm package to work with the data.
paneldata <- pdata.frame(rs, index=c("category_variable_name","time_variable_name"))

## we can now 'do stuff' with the data in a panel sense: things like lagging/leading and calculating differences:
head(paneldata$year)
head(lag(paneldata$year),1)
head(lag(paneldata$year),0:2)
Advertisement

Lagging panel data in R

I never thought I’d say this, but Stata rules the roost at at least one thing: lagging a column in panel data. What does that mean?

Imagine you’re looking at test scores, and you think this year’s test score depends on last year’s (a sensible assumption perhaps). Your dataset would look something like this:

Year Name Score Previous Score  Score Before That
2010 Bob  30    .               .
2011 Bob  59    30              .
2012 Bob  63    59              30

The previous scores are calculated by ‘lagging’ the data by one and two periods (note that the dot represents a missing value. Stata has “.” and R has “NA”)
To lag data in Stata you simply type L.Score and for inclusion of lots of lags, L(0/3).Score will give you the score, last year’s score, the year before that AND the year before that one too. Amazing!

Now, as I search for a way to do this in R, imaging my horror on stumbling upon this syntactical nightmare. Being involved with computing gets you used to difference syntax systems. For example, I know that the SQL-like “WHERE Name=’Rob’ AND Home=’London'” looks like “AND(A$2$=’Rob’,A$3$=’London’) in Excel. But you get the feeling that, as computer languages beging to be able to claim some maturity of design, some kind of basic tenets of language syntax would begin to emerge. (Note that, were Excel to be designed today, there’s no way the developers would opt for the AND(A,B) tomfoolery.)
So, to lag in R, I have to do this:

plm(Name ~ Score, lag(Score=2,c(2,3)))

Try getting your head around that! Score for readability of code? Nought! Score for being able to remember how to do this next time I want to? Nought point nought!

Ah, the joys of community-developed software.