Lagging panel data in R

I never thought I’d say this, but Stata rules the roost at at least one thing: lagging a column in panel data. What does that mean?

Imagine you’re looking at test scores, and you think this year’s test score depends on last year’s (a sensible assumption perhaps). Your dataset would look something like this:

Year Name Score Previous Score  Score Before That
2010 Bob  30    .               .
2011 Bob  59    30              .
2012 Bob  63    59              30

The previous scores are calculated by ‘lagging’ the data by one and two periods (note that the dot represents a missing value. Stata has “.” and R has “NA”)
To lag data in Stata you simply type L.Score and for inclusion of lots of lags, L(0/3).Score will give you the score, last year’s score, the year before that AND the year before that one too. Amazing!

Now, as I search for a way to do this in R, imaging my horror on stumbling upon this syntactical nightmare. Being involved with computing gets you used to difference syntax systems. For example, I know that the SQL-like “WHERE Name=’Rob’ AND Home=’London'” looks like “AND(A$2$=’Rob’,A$3$=’London’) in Excel. But you get the feeling that, as computer languages beging to be able to claim some maturity of design, some kind of basic tenets of language syntax would begin to emerge. (Note that, were Excel to be designed today, there’s no way the developers would opt for the AND(A,B) tomfoolery.)
So, to lag in R, I have to do this:

plm(Name ~ Score, lag(Score=2,c(2,3)))

Try getting your head around that! Score for readability of code? Nought! Score for being able to remember how to do this next time I want to? Nought point nought!

Ah, the joys of community-developed software.


Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!