Big data, big ball-ache

Big data, yeah? It’s great isn’t it? Doesn’t everyone just love to have loads of big data all over the place?

Got 30 million customers in the UK, have you? Each of those customers purchasing thousands of products a year, yeah? Screw it, lets just store ALL that information in a massive database. It’s big data innit? It’s what people do now.

Well I’m sick of it. Regular readers will know that I’m currently in the process of trying to gather trade data from the UN. It’s of the format “we sold this much soap to this country in this year”. Sounds simple, right?

Well it is. But it’s also big. There are around 200 reporting countries, reporting trade with one another, in over 3,000 product categories across fifteen years. This makes the final database somewhere in the region of 150 million rows long. It’s big, and it’s slow, and it’s incredibly painful to deal with.

By way of an example, let me introduce you to a painful problem which has bedevilled me these past few days: due to some kind of wierdness with the import process, some countries’ data ended up with an equals sign at the start of their product codes. So instead of product code “101305” they had “=101305”. I can’t even remember now how this happened, but it’s to do with the fact that the data sets are so large, that they could only be opened in certain pieces of software, one of which has obviously had this wierd side-effect. The affected countries are Japan, Brazil, China, India, Russia and Mexico. So, some nice small countries then. This means that 20 millions pieces of data need an equals sign removing. Sounds easy right?

The process to get rid of those equals signs started yesterday evening, and was still running this lunchtime, a full eighteen hours after it started.

This is not tenable. This is not big data. This is just a big ball-ache.

What the flip is going on with global trade?

Similarly to many branches of statistics-gathering, the world’s trade statistics bureaux lack, in their communication style, a certain panache. The writings of such agencies are characterised by a complete absence of zing, lightness-of-touch and joie de vivre. I’ve blogged before about horrific diagrams like that shown here, and how the whole enterprise of gathering information about global trade is inaccessible and unpleasant.

So it gave me an extra tickle, to find a rare example of humour in a working paper from the National Bureau of Economic Research called “World Trade Flows: 1962-2000”.

In the paper, they present a number of databases of world trade flows from a series of years between 1962 and 2000. Blind or indifferent to the fears of the “Millenium Bug“, they use a two-digit code to represent the particular year. Let me recap: it’s a database of World Trade Flows at a given two-digit year we’ll generically call “??”.

The result can be seen here on p48 of the report. Fantastic stuff…
Feenstra et al 2005