Lagging panel data in R

I never thought I’d say this, but Stata rules the roost at at least one thing: lagging a column in panel data. What does that mean?

Imagine you’re looking at test scores, and you think this year’s test score depends on last year’s (a sensible assumption perhaps). Your dataset would look something like this:

Year Name Score Previous Score  Score Before That
2010 Bob  30    .               .
2011 Bob  59    30              .
2012 Bob  63    59              30

The previous scores are calculated by ‘lagging’ the data by one and two periods (note that the dot represents a missing value. Stata has “.” and R has “NA”)
To lag data in Stata you simply type L.Score and for inclusion of lots of lags, L(0/3).Score will give you the score, last year’s score, the year before that AND the year before that one too. Amazing!

Now, as I search for a way to do this in R, imaging my horror on stumbling upon this syntactical nightmare. Being involved with computing gets you used to difference syntax systems. For example, I know that the SQL-like “WHERE Name=’Rob’ AND Home=’London'” looks like “AND(A$2$=’Rob’,A$3$=’London’) in Excel. But you get the feeling that, as computer languages beging to be able to claim some maturity of design, some kind of basic tenets of language syntax would begin to emerge. (Note that, were Excel to be designed today, there’s no way the developers would opt for the AND(A,B) tomfoolery.)
So, to lag in R, I have to do this:

plm(Name ~ Score, lag(Score=2,c(2,3)))

Try getting your head around that! Score for readability of code? Nought! Score for being able to remember how to do this next time I want to? Nought point nought!

Ah, the joys of community-developed software.

Advertisement

Learning about panel data in R

So far so simple in terms of reading data into the R environment (I’m using the wonderful RStudio by the way). Download a little library, run about 5 lines of code and, boom, my PostgreSQL view is available as a data frame in R. Easy peasy.

library(package=RPostgreSQL)
library(package=plm)
## Load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host="xxxxxxxx",dbname="xxxxxxx",user="xxxxxx",password="xxxxxx")

## Read the whole contents of a table into a dataframers
rs <- dbReadTable(con,"rlvw_cntry_year_aid_affected")

More tricky is the whole panel data regression part. Panel data has two dimensions, a time dimension (in my case the years from about 1960 to 2008) and an “individuals” dimension, in my case countries. So I have aid received data for all countries for each year in the set, making a lot of observations overall.

The hard part is using both dimensions when running a regression. In principle, the maths is not complicated. And in the statistics package I’m used to using, it’s straightforward (once you know how!). You tell the software which column represents your time dimension, and which your individual dimension and off you go. In R, I’m not sure yet how that stuff works, so it’s back to the reading board for me as I trawl through online tutorials etc. I’ll report back once I’ve worked out how to do it.

Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!

ISO3 Country Codes (things we learn as researchers)

This year I have mostly been mapping country names from data sources found online to ISO3 country codes which are actually useful for something.

The continuing proliferation of data sources whose only country indicator is something like “Virgin Is. (UK)” or “Korea, People’s Rep Of” has staggered me over the past few years of being a development aid researcher. It makes the providers of the data look like amateurs and makes life much harder than necessary for the poor researchers trying to eak some meaning out of the profusion of numbers. (Mentioning no names. But the delightful Penn World Tables are a rare exception.)

I’ve created various little tricks for turning these into the far more useful ISO3 country codes, examples of which include GBR for Great Britain, CHN for the People’s Republic of You-know-who, and TCA for the wonderfully obscure Turks and Caicos Islands (variously represented in datasets with an ampersand, “Islands” inexplicably shortened to Is. etc. etc. ad nauseam). These techniques mostly involve fancy use of the mostly-magical and semi-mythological VLookup in Excel: the topic of more dinner-table conversations than any other mostly-undocumented software functionality since the “discovery” of BCC in the 90s.

Anyway, to cut a long story short, I am about to embark on this operation once more as I launch myself into the vaguaries of the EM-DAT natural disaster database. See you on the other side.

Rob

Directly query DB in QGIS

Sometimes a plugin comes along which changes your life. Fast SQL Layer from Pablo T. Carreira is just such a plugin.

It allows you to execute arbitrary SQL queries against your underlying database, meaning that views and queries can now live in your QGIS project where they belong, instead of cluttering up the poor unsuspecting PostGIS database that’s having to do all the hard work.

To install it (on Windows at least), add the QGIS Contributed repository to your QGIS repository list (http://pyqgis.org/repo/contributed) and you’ll be writing queries in no time at all. Amazing.

My first choropleth

I’ve managed to get some data onto my PostGIS database, and have got my first visualisation result.

Since the aid dataset is large (c. 1,000,000, source: aiddata.org, the dataset is called ‘aid data 2 short’) I’ve summed over donors and within years, to leave me with a total aid received (or rather, promised, since the dataset has amounts pledged, not amounts donated) per recipient country per year. This set is a far more manageable size.

Having added population data from the UN Department for Economic and Social Affairs (World Population Prospects, the 2010 revision) I’m able to draw a choropleth map of the world with the colour of a country representing amount of aid committed to that country in a the current year (the example uses 2000 as a randomly chosen benchmark). It’s work which is very similar to this.

Nice, isn’t it?

levels of aid choropleth
my first choropleth (QGIS)

Giving aid to non-existent countries

In the big, fat dataset I’m using to study what affects the global flow of development aid, I’ve come across a problem which researchers must have dealt with a million times from scratch, and never published the results of how they got around it.

The data set has donor/recipient pairs for over a million promised amounts of development aid, along with a date and a reason for giving the aid.

Sounds simple, but the problem comes from data relating to countries which, according to my separate dataset of country boundaries (the wonderful CShapes) didn’t exist at the time the aid was promised.

For example, aid was given by the World Bank to Botswana in 1965 even though that country only achieved independence from Great Britain in 1966. My world boundaries dataset reflects this accurately giving a “start date” for Botswana of 1966. Before this date, the country was simple considered not to have existed and a map drawn of all the countries as of 1965 just has a gap where Botswana should be.

This is a pretty serious problem in the field of development aid, since many of the relevant countries are former colonies and aid was often given to those countries before they were independent from their colonial masters.

This problem must have been solved a million times before. But by whom?

Categorising data

In developing a data structure which will provide for the needs of many different research projects, there’s a difficult balance to be struck.
The structure needs to be generic enough to cope with the very different needs of all these projects (some are modelling cities across a couple of days, others modelling the whole world for several decades!) but still offer some kind of separation between the projects’ data so a project can easily select just “its” data and not be bothered with hundreds of events which don’t concern them. I want project working to be straight-forward and, crucially, collaboration to be as easy as changing a WHERE statement.
In the coming weeks I’m going to try and research ways of sensibly dividing up data which is all stored in one big table but which relates to different project areas (but may relate to more than one project area, or just be generic).

Watch this space. Thoughts are welcome.

A first attempt at a spatio-temporal constraint

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_overlap(geometry, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE ST_Overlaps(sh1.shape, $1)
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_uid_existence(integer, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE $1 = id_from_source
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_concurrent_overlap_ck
CHECK(e_atlas.shape_concurrent_overlap(shape, valid_from, valid_to) < 1);

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_primary_key_ck
CHECK(e_atlas.shape_concurrent_uid_existence(id_from_source,valid_from,valid_to) < 1);

Countries

I’m sure this has been blogged about many, many times, but it’s hard to believe just what a pain doing research using global country data over any significant time span can be. Obviously I was delighted to find the marvellously assembled CShapes, one man’s attempt to gather all the information about countries throughout history into one place. Countries come and counties go, and representing this seemingly simple fact requires three things which really don’t go together well in any of the database management systems around. These are:

  1. Space – what does this country look like from space? Where are it’s borders?
  2. Time – when did the country appear/disappear/start to look like this.
  3. A history of events to string all these spaces and times together.

So when East and West Germany become just Germany in 1989, what is the researcher attempting to do a time-series type analysis to do with the attributes of each country. Let’s say we’re interested in how much aid a certain country gives under particular circumstances; we may be asking the question “does a natural disaster in recipient country x cause a donor country a to give less money to recipient country y? And if so, how long does the effect take to work?”

To begin to answer this question we need to ‘do something’ with the information we have about how much East Germany and West Germany give once the year we’re looking at gets beyond 1989. “Just add it up,” you cry! “What could be simpler?” But the system needs to ‘know’ that these two countries became one. So we need to store a kind of ‘transformations’ register which keeps track of which shapes morphed into which other shapes and when.

But then things get complicated: how can I make sure that the data in the transformations register leads to a view of the world which makes sense? Standard database constraints areĀ  no longer enough. Countries are allowed to exist twice, unlike the standard database uniqueness constraints, just not at the same time. How is this represented in a database?

Well, the outstandingly named Richard T. Snodgrass (in his book Developing Time-Oriented Database Applications in SQL) would have me add something called a ‘check assertion’. This basically ensures that before any updates are commited to a table, certain rules are followed, one of which might be that a certain country cannot exist twice at the same time. My problem with all of this? PostGreSQL doesn’t support assertions, despite it being a part of the SQL92 standard. Why not?!

Apparently no database system does. And we poor adventurers surfing the edge of what can be done with information systems are left scratching our heads and writing blog entries instead…