In a former life, I was an English teacher—that’s the easy type: English to willing foreigners, not English to dastardly teenagers—and I used to say to my students that if a fellow language learner ever tells them that English is easy, they probably don’t know very much about English. It’s one of these deceptive learning experiences: seemingly easy at first, impossibly difficult once you get into it. It’s a mess, there are no easy guarantees about how things work, and each new problem needs to be approached with an almost completely open mind (see the famous if slightly strained example about the pronunciation of ghoti. The spelling is a historical (or ‘legacy’ to use a buzz term) clusterfudge, and the reality of the spoken language in no way reflects the language you learn in school. Nahwaddamean?
Data is a similar thing. When you’re learning your first data application, and it’s universally an old-school flip-card system for remembering people’s addresses, you think “pfff, this is what those database contractors are getting 500 quid a day for?” But the more you find out about data, the more you are staggered that, despite its supposed ubiquity and the utter buzz-word nature of terms like “big data”, it’s still bastard hard to work with, and the tools you use to work with data look like they haven’t been worked on since 1993. There’s still loads of black screens to deal with, and they still speak a dead language, SQL, the techie equivalent of Latin.
These are just the software difficulties, but there are far bigger dangers lurking in the world’s actual data sets. A data munger’s day is often filled up with doing things like telling his database that Ivory Coast and Cote D’Ivoire are actually the same country, or that Levy, Rob (Mr) is the same guy as plain old Rob Levy. Unless I’m missing some major development in how this stuff works, we’re still exactly where we were in 1984.
But no matter how many times I have had to deal with a lack of standardisation among different data sets, I have never come such a tangle of standards as is present in data on world trade. In order to compare trade figures across time and countries, products are categorised when the data is gathered, so we are able to say things like “imports of white goods is up”, or “we’re exporting more chemicals to Germany than we used to”. All very useful stuff, but it seems that basically every international organisation came up with a system for categorising export products some time in the distant past, and the poor researchers have just muddled by for years with conversion tables which attempt to approximately map one categorisation onto another. This is painful stuff at the best of times, but the sheer number of trade classifications boggles the mind.
I’m working with a datasource which produces the now deeply-untrendy descriptions of economies called Input-Output Tables. Although these things are still used to calculate GDP, and make statements about which industry is causing the latest ‘dip’ back into recession, there seems to have been almost no cited research on the subject since the mid-80s.
Anyway, the Input-Output models I’m looking at use a categorisation called NACE, which is used by the European Union, and which has gone through several revisions, making comparisons across time difficult. But UN trade data is categorised by something called, somewhat over-optimistically, the Standard International Trade Classification (SITC) which has also had many revisions of its own. Add to this the BEC, the CPC, the ISIC, the HS, PRODCOM and NAICS, and you’re in a data tangle of epic proportions.
Data, like the historical headache of English-language spelling, is hard to work with in the best of circumstances. But the needlessly complicated tangle of legacy trade data categorisations are the diarrhoea of the data world. In more ways than one.