Following the announcement of a new way of looking at how interest in development projects might have moved around Africa over time (see this blog post) I’ve produced a more subtle version with one particularly interesting twist:
One of the problems with comparing countries against each other is that some are huge and some are tiny. (Note that there are hundreds of problems with comparing countries with each other, and this is merely the easiest to get around!) By adding population data from the UN’s Department of Economic and Social Affairs we can try to ‘account for’ the fact that larger countries will inevitably generate more Googling interest.
How does this work? Well, we simply assume (for now at least) that the size of your population has a linear effect on your Google page count: this means that if country A has a population twice that of country B then, all things being equal, we expect country A to have twice as many Google pages about it. The page count numbers are then adjusted so that all the countries have the “same population”. So small countries have their page counts increased and large countries have their page counts decreased, according to just how big or small each country is. This is clearly a gross simplification, but it gives a better idea of the relative Google interest over time than not adjusting for population at all.
In an attempt to find out a little bit about how development projects have been distributed between countries and types of project (health, education etc.) I have created a kind of trawler for Google information.
You give the trawler three lists—of countries, years and project types—and the software builds search terms for every combination of each of these lists. So it might start with “Angola 2000 education”, then “Angola 2000 health” followed by “Angola 2001 education” and so on, for all the countries, years and project types. The search terms are then handed over to Google, one by one, and the result count—the bit which looks like “About 62,200 results (0.45 seconds)”—is stored along with the query which generated that number.
I’ve put all the results onto a kind of interactive map of Africa. Simply select the project type you’re interested in, hover over a country to see its page count, and click on the map to cycle through the years 2000 to 2012.
There is clearly a great deal to be said about the validity of using Google’s page count for anything at all, let alone what I grandly term ‘research’, but I’ve written one possible justification in a previous blog post. The short version is that even if the numbers are ‘wrong’ they should be broadly comparable; higher numbers should generally mean more documents found. Also, if you describe the results as being a measure of interest in a given topic by the English-speaking internet-using world then things seem far more reasonable. (Note that I’m not pretending to know anything about the number of projects happening in a given country in a given year.)
The result is reasonably interesting. South Africa tops the poll for pretty much every project type in every year, something which might be a factor of its large population (and/or its English speaking status). I’m going to make a population-adjusted version very soon.
I’ve been looking into whether Google’s “search result count” facility can help direct the focus of my research a little bit, by recording the result count for a series of search strings relating to different countries, in different years, and for different development project types (e.g. education, health etc.)
Although the result count is only an estimate (and famously a rather poor one) I think that comparing result counts against each other should be an at-least-reasonable heuristic. I assume that although they’re wrong, they’re not randomly so, and hence a higher result count for one of a set of structurally similar search terms should say something about the number of pages found.
The other assumption that lies behind using result counts in this way is that the amount of ‘stuff’ on the internet is a good measure of how much the English-speaking developed world is interested in a given topic. I assume that if “Kenya development project water” comes up with more results than “Kenya development project malaria” then more English-speaking internet users are ‘interested’ in water projects than in malaria projects. I then make the leap of faith that this implies these projects are happening more. Debatable? Most certainly. I’d be interested to try and defend this against a well-informed doubter. Comments below!
Since this assumption, if true, would be more accurate post internet-era, I’ve restricted my searching to the years 2000 to 2012. I’ve both included the year in the search term, and restricted the search results to only those pages from that year.
Methodological quibbles (or more) aside, I was impatient to start looking at the results of this “Google-harvest” and have analysed the numbers for a subset of African countries (namely Burkina Faso, Congo, Egypt, Eritrea, Gambia, Ghana, Guinea, Guinea-Bissau, Liberia, Mali, Mauritania, Morocco, Mozambique, Sierra Leone and Tunisia (I realise that any search results for Guinea-Bissau will show up in those for Guinea as well. I also realise that “Congo” is two countries. I’ll gloss over these details for now).
Although this represents only a small number of all the countries in Africa, the results are already worth commenting on. Here’s a graph showing how the total result count across all project types was divided between the project types.
The sharp-eyed amongst my readers will have spotted something odd about this graph. In 2006, there are huge spikes education projects (18.3% up to 20.8%) and agriculture projects (10% to 11.8%) at the apparent expense of water, AIDS and malaria projects. So the mystery is this: what happened in 2006 that led to a huge (but temporary) increase in interest in education and agriculture projects, at the expense of interest in health projects? And will this trend still be visible once all the results are in? Only time will tell….
p.s. apologies to G.F. Handel for the title of this post.
Interesting econometrics to follow, this is just the before-party.
Let me start this post with a couple of disclaimers:
1) This post is of no interest to anyone other than myself and possibly a few of my colleagues at UCL’s Centre for Advanced Spatial Analysis. It’s here as a memento of a couple of close-to-unbearable days in the office following an accidental blat of our project’s database. It will serve as a warning to us, and a guide in the unlikely event of us doing it again.
2) Persons and events in this blog post are fictional. Similarities to real persons or real events are entirely coincidental. That said, a few of the shouting, swearing episodes bear close resemblance to those seen round these parts recently…
It all started with an innocent bout of child-like curiosity: what does the new RTM of Windows 8 look like? How will it make our lives more exciting and more productive, as the expensive-looking YouTube ads have suggested?
We decided to run this little experiment on the server we use to host our project’s database, a natty little PostgreSQL install with PostGIS dealing with the spatial elements of our data.
To cut a long story short, we screwed up our Postgres installation due to a weirdness relating to 32-bit vs. 64-bit operating systems: our data refused to go into a 64-bit database, and Windows 8 seemingly refused to install a 32-bit instance of Postgres. We did however, have the contents of the data folder within the old Postgres installation so we felt like we were in reasonably good shape.
The next piece of good news was that the data was restorable onto a Windows 7 computer with 32-bit Postgres installed. A sigh of relief went round the office like a OAP Mexican wave. We naively took a pg_dump of the database and ran pg_restore in the hopes that it would all come out in the wash. After three days of The Matrix-style rapidly-scrolling DOS window the process was announced as complete, but we couldn’t find the resulting database anywhere. Closer inspection of the pg_restore documentation revealed a very important piece of information hidden away in the header text which no one ever reads (the emphasis is mine):
pg_restore can operate in two modes. If a database name is specified, pg_restore connects to that database and restores archive contents directly into the database. Otherwise, a script containing the SQL commands necessary to rebuild the database is created and written to a file or standard output.
In short, the command either does what you want it to do, or something completely useless. In our case, the three days had been spent writing SQL commands to standard output (for which read the DOS prompt, with a buffer of just 40 lines) which could theoretically be used to piece the database back together line by line.
So where to next? We attempted to rerun the pg_restore command with the relevant stuff to make the output go into the database rather than just to the screen, but were faced with screenfuls of errors such as ERROR: could not access file "$libdir/postgis-1.5": No such file or directory
Oh god, quoth we, it’s looking for a version of PostGIS which is not supported by a 64-bit installation of Postgres. Being the eager beavers we are, we immediately set about trying to build a version of PostGIS 1.5 in 64-bit bytecode, before throwing our hands in the air and going for coffee and doughnuts. Surely there must be a better way.
At this point in the story the deadlock began to be broken: why not upgrade the version of PostGIS on the 32-bit Postgres which actually worked so the pg_restore would be looking for the right library files? A Google search turned up this vital page about upgrading PostGIS to version 2.0. Unbelievably, the only way to do this is via what is ominously referred to as a “Hard Upgrade” (so could because it’s “hard” to work out how this could possibly be the officially recommended way of doing things).
Hard upgrading basically involves creating a special pg_dump with some important switches (-Fc -b -v) and running a Perl script included in PostGIS specifically for pg_restoring a PostGIS database. So, off to install Perl for Windows 8…
With Perl installed and the instructions for hard upgrading under our belts, things were moving forward. We were now up against this unhelpful error message (x1000 in the error log): ERROR: type "geometry" does not exist
What? Sounds worrying.
This was resolved when we realised that the new database we’d created to house the old data was not a PostGIS database, but merely a PostgreSQL one. To create a PostGIS database you need to do:
So, we had an empty PostGIS-enabled database, a pg_dump dump file created using the magic command: pg_dump -h localhost -p 5432 -U postgres -Fc -b -v -f "/somepath/olddb.backup" olddb
and we were ready to roll.
The dainty command perl postgis_restore.pl dumpfile.dump | psql -h localhost -p 5432 -U postgres db_name 2> errors.txt was all that was left, and the process began. More The Matrix-style scrolling DOS prompt (but this time, for around ten minutes, rather than the three days under the old regime) and we finally had our 64-bit, PostGIS 2.0 database up, running and ready to use.
Will we make the same mistakes again? Probably? Are we going to start backing up our database from now one? You bet….