Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!

Advertisements

My first choropleth

I’ve managed to get some data onto my PostGIS database, and have got my first visualisation result.

Since the aid dataset is large (c. 1,000,000, source: aiddata.org, the dataset is called ‘aid data 2 short’) I’ve summed over donors and within years, to leave me with a total aid received (or rather, promised, since the dataset has amounts pledged, not amounts donated) per recipient country per year. This set is a far more manageable size.

Having added population data from the UN Department for Economic and Social Affairs (World Population Prospects, the 2010 revision) I’m able to draw a choropleth map of the world with the colour of a country representing amount of aid committed to that country in a the current year (the example uses 2000 as a randomly chosen benchmark). It’s work which is very similar to this.

Nice, isn’t it?

levels of aid choropleth
my first choropleth (QGIS)

Giving aid to non-existent countries

In the big, fat dataset I’m using to study what affects the global flow of development aid, I’ve come across a problem which researchers must have dealt with a million times from scratch, and never published the results of how they got around it.

The data set has donor/recipient pairs for over a million promised amounts of development aid, along with a date and a reason for giving the aid.

Sounds simple, but the problem comes from data relating to countries which, according to my separate dataset of country boundaries (the wonderful CShapes) didn’t exist at the time the aid was promised.

For example, aid was given by the World Bank to Botswana in 1965 even though that country only achieved independence from Great Britain in 1966. My world boundaries dataset reflects this accurately giving a “start date” for Botswana of 1966. Before this date, the country was simple considered not to have existed and a map drawn of all the countries as of 1965 just has a gap where Botswana should be.

This is a pretty serious problem in the field of development aid, since many of the relevant countries are former colonies and aid was often given to those countries before they were independent from their colonial masters.

This problem must have been solved a million times before. But by whom?

Categorising data

In developing a data structure which will provide for the needs of many different research projects, there’s a difficult balance to be struck.
The structure needs to be generic enough to cope with the very different needs of all these projects (some are modelling cities across a couple of days, others modelling the whole world for several decades!) but still offer some kind of separation between the projects’ data so a project can easily select just “its” data and not be bothered with hundreds of events which don’t concern them. I want project working to be straight-forward and, crucially, collaboration to be as easy as changing a WHERE statement.
In the coming weeks I’m going to try and research ways of sensibly dividing up data which is all stored in one big table but which relates to different project areas (but may relate to more than one project area, or just be generic).

Watch this space. Thoughts are welcome.