Giving aid to non-existent countries

In the big, fat dataset I’m using to study what affects the global flow of development aid, I’ve come across a problem which researchers must have dealt with a million times from scratch, and never published the results of how they got around it.

The data set has donor/recipient pairs for over a million promised amounts of development aid, along with a date and a reason for giving the aid.

Sounds simple, but the problem comes from data relating to countries which, according to my separate dataset of country boundaries (the wonderful CShapes) didn’t exist at the time the aid was promised.

For example, aid was given by the World Bank to Botswana in 1965 even though that country only achieved independence from Great Britain in 1966. My world boundaries dataset reflects this accurately giving a “start date” for Botswana of 1966. Before this date, the country was simple considered not to have existed and a map drawn of all the countries as of 1965 just has a gap where Botswana should be.

This is a pretty serious problem in the field of development aid, since many of the relevant countries are former colonies and aid was often given to those countries before they were independent from their colonial masters.

This problem must have been solved a million times before. But by whom?

Advertisement

Categorising data

In developing a data structure which will provide for the needs of many different research projects, there’s a difficult balance to be struck.
The structure needs to be generic enough to cope with the very different needs of all these projects (some are modelling cities across a couple of days, others modelling the whole world for several decades!) but still offer some kind of separation between the projects’ data so a project can easily select just “its” data and not be bothered with hundreds of events which don’t concern them. I want project working to be straight-forward and, crucially, collaboration to be as easy as changing a WHERE statement.
In the coming weeks I’m going to try and research ways of sensibly dividing up data which is all stored in one big table but which relates to different project areas (but may relate to more than one project area, or just be generic).

Watch this space. Thoughts are welcome.

A first attempt at a spatio-temporal constraint

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_overlap(geometry, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE ST_Overlaps(sh1.shape, $1)
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_uid_existence(integer, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE $1 = id_from_source
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_concurrent_overlap_ck
CHECK(e_atlas.shape_concurrent_overlap(shape, valid_from, valid_to) < 1);

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_primary_key_ck
CHECK(e_atlas.shape_concurrent_uid_existence(id_from_source,valid_from,valid_to) < 1);

Countries

I’m sure this has been blogged about many, many times, but it’s hard to believe just what a pain doing research using global country data over any significant time span can be. Obviously I was delighted to find the marvellously assembled CShapes, one man’s attempt to gather all the information about countries throughout history into one place. Countries come and counties go, and representing this seemingly simple fact requires three things which really don’t go together well in any of the database management systems around. These are:

  1. Space – what does this country look like from space? Where are it’s borders?
  2. Time – when did the country appear/disappear/start to look like this.
  3. A history of events to string all these spaces and times together.

So when East and West Germany become just Germany in 1989, what is the researcher attempting to do a time-series type analysis to do with the attributes of each country. Let’s say we’re interested in how much aid a certain country gives under particular circumstances; we may be asking the question “does a natural disaster in recipient country x cause a donor country a to give less money to recipient country y? And if so, how long does the effect take to work?”

To begin to answer this question we need to ‘do something’ with the information we have about how much East Germany and West Germany give once the year we’re looking at gets beyond 1989. “Just add it up,” you cry! “What could be simpler?” But the system needs to ‘know’ that these two countries became one. So we need to store a kind of ‘transformations’ register which keeps track of which shapes morphed into which other shapes and when.

But then things get complicated: how can I make sure that the data in the transformations register leads to a view of the world which makes sense? Standard database constraints areĀ  no longer enough. Countries are allowed to exist twice, unlike the standard database uniqueness constraints, just not at the same time. How is this represented in a database?

Well, the outstandingly named Richard T. Snodgrass (in his book Developing Time-Oriented Database Applications in SQL) would have me add something called a ‘check assertion’. This basically ensures that before any updates are commited to a table, certain rules are followed, one of which might be that a certain country cannot exist twice at the same time. My problem with all of this? PostGreSQL doesn’t support assertions, despite it being a part of the SQL92 standard. Why not?!

Apparently no database system does. And we poor adventurers surfing the edge of what can be done with information systems are left scratching our heads and writing blog entries instead…

All about databases

Oh my goodness, you cry in thrilled elation. A blog all about one researcher’s investigations into the vagueries and subtleties of database design?

Whilst I acknowledge that for a good proportion of the seven billion potential readers of this blog, the prospect of a blog about data is not going to compete with the universes of Twitter, YouTube and gathering fresh water (yes, I realise that not all of the seven billion spend their time idly wondering what to do with their computer-bound, office tarnished days) but for those happy few for whom this counts as a ‘hobby or interest’ on dating websites and LinkedIn profiles, a great day has dawned.

I’ll be filling these pages with interesting stuff I come across in my attempt to build the ultimate all-emcompassing “Global Intelligence System”, first reviews of which included the now-fabled “it’s not intelligent, and it’s certainly not global”. At least said critic acknowledged that it was a system, eh?!

Till next time, DBMS-fans.

Rob