Gineau-Bissau imported 500,000 litres of ‘Ice, Snow and Potable Water’ in 2003

So, like, trade data, yeah? It’s great isn’t it? All the world’s trade in tangible products, itemised by zealous customs officials, and browsable by the idle researcher, thumbing through a Who’s Who-style almanac of fascinating trade numbers, discovering little gems in the diamond stats here, unusual themes of the international art market there. A joy.

Or so you’d think. The reality is significantly more painful than my sepia-tinted dusty old office with copies of the UN’s famous COMTRADE database lying around like old Yellow Pages waiting to be perused.

What you actually have to do if you want, in a quiet moment between world-saving academic discoveries, to know of the trade patterns of the world is use this horrible-looking and overloaded website. Somehow, you’d think the world’s most glamorous and most-studied database would look more, well, bling. But there it is. It also operates incredibly slowly. Here’s an entertaining stat from the website itself:

The blistering speed demonstrated by the UN's flagship data product. COMTRADE returns a single row in just 20 seconds.
The blistering speed demonstrated by the UN's flagship data product. COMTRADE returns a single row in just 20 seconds.

The database boasts close to two billion records. This means that if a densely typed book were produced with one trade record per line, the book would be around 217 metres thick (See the bottom of this post for the calculation). In order that the UN doesn’t spend all its limited resources on server power, they’ve limited the queries you can submit to the database to be those that would return fewer than 50,000 records. So you can’t just ask: “how much stuff does the UK export to the rest of the world?” because, with around 6,300 product categories, 200-odd countries, and around fifty years of data, you quickly hit that ceiling.

The maximum will be relaxed if you contribute to server costs: for a mere $1,000US the limited is upped to 50 million records. This means that, in principle at least, you could download the entire database in just 35 queries. But how to put those queries together? We can select “all products” and “all years” and then a random bunch of countries, in the hope that the limit won’t be exceeded. But it’s impossible to know a priori how many countries will fit into a single 50 million-record query.

So I decided to do things the ‘brute force’ way: no single country exceeds the 50 million record data limit (as far as I can tell) so by submitting queries country by country, I should be able safely to avoid the ceiling. But this is still a tedious process for 200 countries: queries must be submitted via click-boxes on the website (which is painfully slow running as I’ve mentioned) and then, once the query is ready, an email is sent and you go back to the website to download a file containing the data. This file must be named appropriately (by hand) and saved somewhere appropriate before being uploaded to our data server. Keeping track of which countries you’ve submitted, which are ready, which you’ve downloaded and which uploaded to the database is a painful process.

So you can imagine my horror (or, if you can’t, think blood draining from face, dry mouth, bulging eyes, exploding brain) upon discovering that, 75 countries in to this long, boring process, I’ve been asking the server for the wrong pieces of information.

Instead of the dollar value of each transaction, I’ve ended up with quantity of a product traded. This means I now know, for example, that the Bahamas traded nine live horses with the US, but not how much those horses were worth. I also know that Swaziland bought 10 kilos of used postage stamps from South Africa, but not how much they spent buying them.

For aggregation purposes, this information is utterly useless. What is the total export value of the Solomon Islands? Well, it’s 70 tons of “Ornamental fish, live”, plus 19 kg of edible offal, plus 317 tons of “Palm kernel or babassu oil, crude”. It’s just not going to work.

So it’s back to square 1 with the downloading of trade data from the UN website. If anyone knows of a better way of doing this, let this weary researcher know quick, or there may be one fewer “Professional brainbox, unfrozen” exported from the UK in future editions of the data.

Here’s how the thickness of our imaginary book of trade data was calculated. Microsoft Word can squeeze 46 lines of fairly dense data onto an A4 page. My copy of Pemnberton & Rau’s “Mathematics for Economists” is 4cm thick and has 700 pages, or 0.0057cm per page. The COMTRADE database has 1.75 billion records, which means it’d need 38 million pages, for a total book thickness of 217,391cm or 217 metres.

Advertisement

Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!

ISO3 Country Codes (things we learn as researchers)

This year I have mostly been mapping country names from data sources found online to ISO3 country codes which are actually useful for something.

The continuing proliferation of data sources whose only country indicator is something like “Virgin Is. (UK)” or “Korea, People’s Rep Of” has staggered me over the past few years of being a development aid researcher. It makes the providers of the data look like amateurs and makes life much harder than necessary for the poor researchers trying to eak some meaning out of the profusion of numbers. (Mentioning no names. But the delightful Penn World Tables are a rare exception.)

I’ve created various little tricks for turning these into the far more useful ISO3 country codes, examples of which include GBR for Great Britain, CHN for the People’s Republic of You-know-who, and TCA for the wonderfully obscure Turks and Caicos Islands (variously represented in datasets with an ampersand, “Islands” inexplicably shortened to Is. etc. etc. ad nauseam). These techniques mostly involve fancy use of the mostly-magical and semi-mythological VLookup in Excel: the topic of more dinner-table conversations than any other mostly-undocumented software functionality since the “discovery” of BCC in the 90s.

Anyway, to cut a long story short, I am about to embark on this operation once more as I launch myself into the vaguaries of the EM-DAT natural disaster database. See you on the other side.

Rob

Directly query DB in QGIS

Sometimes a plugin comes along which changes your life. Fast SQL Layer from Pablo T. Carreira is just such a plugin.

It allows you to execute arbitrary SQL queries against your underlying database, meaning that views and queries can now live in your QGIS project where they belong, instead of cluttering up the poor unsuspecting PostGIS database that’s having to do all the hard work.

To install it (on Windows at least), add the QGIS Contributed repository to your QGIS repository list (http://pyqgis.org/repo/contributed) and you’ll be writing queries in no time at all. Amazing.

Giving aid to non-existent countries

In the big, fat dataset I’m using to study what affects the global flow of development aid, I’ve come across a problem which researchers must have dealt with a million times from scratch, and never published the results of how they got around it.

The data set has donor/recipient pairs for over a million promised amounts of development aid, along with a date and a reason for giving the aid.

Sounds simple, but the problem comes from data relating to countries which, according to my separate dataset of country boundaries (the wonderful CShapes) didn’t exist at the time the aid was promised.

For example, aid was given by the World Bank to Botswana in 1965 even though that country only achieved independence from Great Britain in 1966. My world boundaries dataset reflects this accurately giving a “start date” for Botswana of 1966. Before this date, the country was simple considered not to have existed and a map drawn of all the countries as of 1965 just has a gap where Botswana should be.

This is a pretty serious problem in the field of development aid, since many of the relevant countries are former colonies and aid was often given to those countries before they were independent from their colonial masters.

This problem must have been solved a million times before. But by whom?

Categorising data

In developing a data structure which will provide for the needs of many different research projects, there’s a difficult balance to be struck.
The structure needs to be generic enough to cope with the very different needs of all these projects (some are modelling cities across a couple of days, others modelling the whole world for several decades!) but still offer some kind of separation between the projects’ data so a project can easily select just “its” data and not be bothered with hundreds of events which don’t concern them. I want project working to be straight-forward and, crucially, collaboration to be as easy as changing a WHERE statement.
In the coming weeks I’m going to try and research ways of sensibly dividing up data which is all stored in one big table but which relates to different project areas (but may relate to more than one project area, or just be generic).

Watch this space. Thoughts are welcome.

A first attempt at a spatio-temporal constraint

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_overlap(geometry, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE ST_Overlaps(sh1.shape, $1)
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_uid_existence(integer, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE $1 = id_from_source
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_concurrent_overlap_ck
CHECK(e_atlas.shape_concurrent_overlap(shape, valid_from, valid_to) < 1);

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_primary_key_ck
CHECK(e_atlas.shape_concurrent_uid_existence(id_from_source,valid_from,valid_to) < 1);

Countries

I’m sure this has been blogged about many, many times, but it’s hard to believe just what a pain doing research using global country data over any significant time span can be. Obviously I was delighted to find the marvellously assembled CShapes, one man’s attempt to gather all the information about countries throughout history into one place. Countries come and counties go, and representing this seemingly simple fact requires three things which really don’t go together well in any of the database management systems around. These are:

  1. Space – what does this country look like from space? Where are it’s borders?
  2. Time – when did the country appear/disappear/start to look like this.
  3. A history of events to string all these spaces and times together.

So when East and West Germany become just Germany in 1989, what is the researcher attempting to do a time-series type analysis to do with the attributes of each country. Let’s say we’re interested in how much aid a certain country gives under particular circumstances; we may be asking the question “does a natural disaster in recipient country x cause a donor country a to give less money to recipient country y? And if so, how long does the effect take to work?”

To begin to answer this question we need to ‘do something’ with the information we have about how much East Germany and West Germany give once the year we’re looking at gets beyond 1989. “Just add it up,” you cry! “What could be simpler?” But the system needs to ‘know’ that these two countries became one. So we need to store a kind of ‘transformations’ register which keeps track of which shapes morphed into which other shapes and when.

But then things get complicated: how can I make sure that the data in the transformations register leads to a view of the world which makes sense? Standard database constraints areĀ  no longer enough. Countries are allowed to exist twice, unlike the standard database uniqueness constraints, just not at the same time. How is this represented in a database?

Well, the outstandingly named Richard T. Snodgrass (in his book Developing Time-Oriented Database Applications in SQL) would have me add something called a ‘check assertion’. This basically ensures that before any updates are commited to a table, certain rules are followed, one of which might be that a certain country cannot exist twice at the same time. My problem with all of this? PostGreSQL doesn’t support assertions, despite it being a part of the SQL92 standard. Why not?!

Apparently no database system does. And we poor adventurers surfing the edge of what can be done with information systems are left scratching our heads and writing blog entries instead…

All about databases

Oh my goodness, you cry in thrilled elation. A blog all about one researcher’s investigations into the vagueries and subtleties of database design?

Whilst I acknowledge that for a good proportion of the seven billion potential readers of this blog, the prospect of a blog about data is not going to compete with the universes of Twitter, YouTube and gathering fresh water (yes, I realise that not all of the seven billion spend their time idly wondering what to do with their computer-bound, office tarnished days) but for those happy few for whom this counts as a ‘hobby or interest’ on dating websites and LinkedIn profiles, a great day has dawned.

I’ll be filling these pages with interesting stuff I come across in my attempt to build the ultimate all-emcompassing “Global Intelligence System”, first reviews of which included the now-fabled “it’s not intelligent, and it’s certainly not global”. At least said critic acknowledged that it was a system, eh?!

Till next time, DBMS-fans.

Rob