Gineau-Bissau imported 500,000 litres of ‘Ice, Snow and Potable Water’ in 2003

So, like, trade data, yeah? It’s great isn’t it? All the world’s trade in tangible products, itemised by zealous customs officials, and browsable by the idle researcher, thumbing through a Who’s Who-style almanac of fascinating trade numbers, discovering little gems in the diamond stats here, unusual themes of the international art market there. A joy.

Or so you’d think. The reality is significantly more painful than my sepia-tinted dusty old office with copies of the UN’s famous COMTRADE database lying around like old Yellow Pages waiting to be perused.

What you actually have to do if you want, in a quiet moment between world-saving academic discoveries, to know of the trade patterns of the world is use this horrible-looking and overloaded website. Somehow, you’d think the world’s most glamorous and most-studied database would look more, well, bling. But there it is. It also operates incredibly slowly. Here’s an entertaining stat from the website itself:

The blistering speed demonstrated by the UN's flagship data product. COMTRADE returns a single row in just 20 seconds.
The blistering speed demonstrated by the UN's flagship data product. COMTRADE returns a single row in just 20 seconds.

The database boasts close to two billion records. This means that if a densely typed book were produced with one trade record per line, the book would be around 217 metres thick (See the bottom of this post for the calculation). In order that the UN doesn’t spend all its limited resources on server power, they’ve limited the queries you can submit to the database to be those that would return fewer than 50,000 records. So you can’t just ask: “how much stuff does the UK export to the rest of the world?” because, with around 6,300 product categories, 200-odd countries, and around fifty years of data, you quickly hit that ceiling.

The maximum will be relaxed if you contribute to server costs: for a mere $1,000US the limited is upped to 50 million records. This means that, in principle at least, you could download the entire database in just 35 queries. But how to put those queries together? We can select “all products” and “all years” and then a random bunch of countries, in the hope that the limit won’t be exceeded. But it’s impossible to know a priori how many countries will fit into a single 50 million-record query.

So I decided to do things the ‘brute force’ way: no single country exceeds the 50 million record data limit (as far as I can tell) so by submitting queries country by country, I should be able safely to avoid the ceiling. But this is still a tedious process for 200 countries: queries must be submitted via click-boxes on the website (which is painfully slow running as I’ve mentioned) and then, once the query is ready, an email is sent and you go back to the website to download a file containing the data. This file must be named appropriately (by hand) and saved somewhere appropriate before being uploaded to our data server. Keeping track of which countries you’ve submitted, which are ready, which you’ve downloaded and which uploaded to the database is a painful process.

So you can imagine my horror (or, if you can’t, think blood draining from face, dry mouth, bulging eyes, exploding brain) upon discovering that, 75 countries in to this long, boring process, I’ve been asking the server for the wrong pieces of information.

Instead of the dollar value of each transaction, I’ve ended up with quantity of a product traded. This means I now know, for example, that the Bahamas traded nine live horses with the US, but not how much those horses were worth. I also know that Swaziland bought 10 kilos of used postage stamps from South Africa, but not how much they spent buying them.

For aggregation purposes, this information is utterly useless. What is the total export value of the Solomon Islands? Well, it’s 70 tons of “Ornamental fish, live”, plus 19 kg of edible offal, plus 317 tons of “Palm kernel or babassu oil, crude”. It’s just not going to work.

So it’s back to square 1 with the downloading of trade data from the UN website. If anyone knows of a better way of doing this, let this weary researcher know quick, or there may be one fewer “Professional brainbox, unfrozen” exported from the UK in future editions of the data.

Here’s how the thickness of our imaginary book of trade data was calculated. Microsoft Word can squeeze 46 lines of fairly dense data onto an A4 page. My copy of Pemnberton & Rau’s “Mathematics for Economists” is 4cm thick and has 700 pages, or 0.0057cm per page. The COMTRADE database has 1.75 billion records, which means it’d need 38 million pages, for a total book thickness of 217,391cm or 217 metres.

Advertisements

Connecting R to PostgreSQL

The most stupidly named database management system is, sadly, also the system of choice for my research group, the powers that be having apparently no interest in the fact the you can’t pronounce the name, can’t remember how to write it properly and have no idea what it’s supposed to mean.

Anyway, PostgreSQL is what I’m using, and I’m sticking with it. It helps that it has the PostGIS extension and is free, but I’m yet to be really really convinced that it’s better than Microsoft’s SQL Server (one of my main reasons being that the management software for SQL Server includes a mini GIS viewer for quickly checking the contents of your spatial queries. That’s ace. If only PostGIS had something similar. (By the way, I never fail to be amused by the fact that, the way I pronounce it at least, PostGIS sounds like a bedroom cleanup operation.)

Another standard I’m now having to learn in this office is R, the real man’s version of Stata the much loved/hated statistical software. Stata is like an embarrassing hangover from a previous era of software design: it’s incredibly expensive, tremendously ugly and fantastically unhelpful. Leaving you with the old white screen/flashing cursor not seen since the days of the VI text editor. My first goes at R (particularly with RStudio)

So here, I go. I’m using the RPostgreSQL package and this tutorial, which is currently making life pretty easy. I’m going to be doing some pretty simple econometrics using R, pulling my information from our PostGIS database, and I’ll be blogging about my experiences.

Off we go!

ISO3 Country Codes (things we learn as researchers)

This year I have mostly been mapping country names from data sources found online to ISO3 country codes which are actually useful for something.

The continuing proliferation of data sources whose only country indicator is something like “Virgin Is. (UK)” or “Korea, People’s Rep Of” has staggered me over the past few years of being a development aid researcher. It makes the providers of the data look like amateurs and makes life much harder than necessary for the poor researchers trying to eak some meaning out of the profusion of numbers. (Mentioning no names. But the delightful Penn World Tables are a rare exception.)

I’ve created various little tricks for turning these into the far more useful ISO3 country codes, examples of which include GBR for Great Britain, CHN for the People’s Republic of You-know-who, and TCA for the wonderfully obscure Turks and Caicos Islands (variously represented in datasets with an ampersand, “Islands” inexplicably shortened to Is. etc. etc. ad nauseam). These techniques mostly involve fancy use of the mostly-magical and semi-mythological VLookup in Excel: the topic of more dinner-table conversations than any other mostly-undocumented software functionality since the “discovery” of BCC in the 90s.

Anyway, to cut a long story short, I am about to embark on this operation once more as I launch myself into the vaguaries of the EM-DAT natural disaster database. See you on the other side.

Rob

Directly query DB in QGIS

Sometimes a plugin comes along which changes your life. Fast SQL Layer from Pablo T. Carreira is just such a plugin.

It allows you to execute arbitrary SQL queries against your underlying database, meaning that views and queries can now live in your QGIS project where they belong, instead of cluttering up the poor unsuspecting PostGIS database that’s having to do all the hard work.

To install it (on Windows at least), add the QGIS Contributed repository to your QGIS repository list (http://pyqgis.org/repo/contributed) and you’ll be writing queries in no time at all. Amazing.

Giving aid to non-existent countries

In the big, fat dataset I’m using to study what affects the global flow of development aid, I’ve come across a problem which researchers must have dealt with a million times from scratch, and never published the results of how they got around it.

The data set has donor/recipient pairs for over a million promised amounts of development aid, along with a date and a reason for giving the aid.

Sounds simple, but the problem comes from data relating to countries which, according to my separate dataset of country boundaries (the wonderful CShapes) didn’t exist at the time the aid was promised.

For example, aid was given by the World Bank to Botswana in 1965 even though that country only achieved independence from Great Britain in 1966. My world boundaries dataset reflects this accurately giving a “start date” for Botswana of 1966. Before this date, the country was simple considered not to have existed and a map drawn of all the countries as of 1965 just has a gap where Botswana should be.

This is a pretty serious problem in the field of development aid, since many of the relevant countries are former colonies and aid was often given to those countries before they were independent from their colonial masters.

This problem must have been solved a million times before. But by whom?

Categorising data

In developing a data structure which will provide for the needs of many different research projects, there’s a difficult balance to be struck.
The structure needs to be generic enough to cope with the very different needs of all these projects (some are modelling cities across a couple of days, others modelling the whole world for several decades!) but still offer some kind of separation between the projects’ data so a project can easily select just “its” data and not be bothered with hundreds of events which don’t concern them. I want project working to be straight-forward and, crucially, collaboration to be as easy as changing a WHERE statement.
In the coming weeks I’m going to try and research ways of sensibly dividing up data which is all stored in one big table but which relates to different project areas (but may relate to more than one project area, or just be generic).

Watch this space. Thoughts are welcome.

A first attempt at a spatio-temporal constraint

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_overlap(geometry, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE ST_Overlaps(sh1.shape, $1)
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

CREATE OR REPLACE FUNCTION e_atlas.shape_concurrent_uid_existence(integer, timestamp, timestamp)
RETURNS integer AS
$BODY$
SELECT count(*)::integer FROM e_atlas.shapes sh1
WHERE $1 = id_from_source
AND $2 < sh1.valid_to
AND sh1.valid_from < $3;
$BODY$
LANGUAGE 'sql' IMMUTABLE
COST 100;

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_concurrent_overlap_ck
CHECK(e_atlas.shape_concurrent_overlap(shape, valid_from, valid_to) < 1);

ALTER TABLE e_atlas.shapes ADD CONSTRAINT shapes_primary_key_ck
CHECK(e_atlas.shape_concurrent_uid_existence(id_from_source,valid_from,valid_to) < 1);