Following the announcement of a new way of looking at how interest in development projects might have moved around Africa over time (see this blog post) I’ve produced a more subtle version with one particularly interesting twist:
One of the problems with comparing countries against each other is that some are huge and some are tiny. (Note that there are hundreds of problems with comparing countries with each other, and this is merely the easiest to get around!) By adding population data from the UN’s Department of Economic and Social Affairs we can try to ‘account for’ the fact that larger countries will inevitably generate more Googling interest.
How does this work? Well, we simply assume (for now at least) that the size of your population has a linear effect on your Google page count: this means that if country A has a population twice that of country B then, all things being equal, we expect country A to have twice as many Google pages about it. The page count numbers are then adjusted so that all the countries have the “same population”. So small countries have their page counts increased and large countries have their page counts decreased, according to just how big or small each country is. This is clearly a gross simplification, but it gives a better idea of the relative Google interest over time than not adjusting for population at all.
In an attempt to find out a little bit about how development projects have been distributed between countries and types of project (health, education etc.) I have created a kind of trawler for Google information.
You give the trawler three lists—of countries, years and project types—and the software builds search terms for every combination of each of these lists. So it might start with “Angola 2000 education”, then “Angola 2000 health” followed by “Angola 2001 education” and so on, for all the countries, years and project types. The search terms are then handed over to Google, one by one, and the result count—the bit which looks like “About 62,200 results (0.45 seconds)”—is stored along with the query which generated that number.
I’ve put all the results onto a kind of interactive map of Africa. Simply select the project type you’re interested in, hover over a country to see its page count, and click on the map to cycle through the years 2000 to 2012.
There is clearly a great deal to be said about the validity of using Google’s page count for anything at all, let alone what I grandly term ‘research’, but I’ve written one possible justification in a previous blog post. The short version is that even if the numbers are ‘wrong’ they should be broadly comparable; higher numbers should generally mean more documents found. Also, if you describe the results as being a measure of interest in a given topic by the English-speaking internet-using world then things seem far more reasonable. (Note that I’m not pretending to know anything about the number of projects happening in a given country in a given year.)
The result is reasonably interesting. South Africa tops the poll for pretty much every project type in every year, something which might be a factor of its large population (and/or its English speaking status). I’m going to make a population-adjusted version very soon.
Let me start this post with a couple of disclaimers:
1) This post is of no interest to anyone other than myself and possibly a few of my colleagues at UCL’s Centre for Advanced Spatial Analysis. It’s here as a memento of a couple of close-to-unbearable days in the office following an accidental blat of our project’s database. It will serve as a warning to us, and a guide in the unlikely event of us doing it again.
2) Persons and events in this blog post are fictional. Similarities to real persons or real events are entirely coincidental. That said, a few of the shouting, swearing episodes bear close resemblance to those seen round these parts recently…
It all started with an innocent bout of child-like curiosity: what does the new RTM of Windows 8 look like? How will it make our lives more exciting and more productive, as the expensive-looking YouTube ads have suggested?
We decided to run this little experiment on the server we use to host our project’s database, a natty little PostgreSQL install with PostGIS dealing with the spatial elements of our data.
To cut a long story short, we screwed up our Postgres installation due to a weirdness relating to 32-bit vs. 64-bit operating systems: our data refused to go into a 64-bit database, and Windows 8 seemingly refused to install a 32-bit instance of Postgres. We did however, have the contents of the data folder within the old Postgres installation so we felt like we were in reasonably good shape.
The next piece of good news was that the data was restorable onto a Windows 7 computer with 32-bit Postgres installed. A sigh of relief went round the office like a OAP Mexican wave. We naively took a pg_dump of the database and ran pg_restore in the hopes that it would all come out in the wash. After three days of The Matrix-style rapidly-scrolling DOS window the process was announced as complete, but we couldn’t find the resulting database anywhere. Closer inspection of the pg_restore documentation revealed a very important piece of information hidden away in the header text which no one ever reads (the emphasis is mine):
pg_restore can operate in two modes. If a database name is specified, pg_restore connects to that database and restores archive contents directly into the database. Otherwise, a script containing the SQL commands necessary to rebuild the database is created and written to a file or standard output.
In short, the command either does what you want it to do, or something completely useless. In our case, the three days had been spent writing SQL commands to standard output (for which read the DOS prompt, with a buffer of just 40 lines) which could theoretically be used to piece the database back together line by line.
So where to next? We attempted to rerun the pg_restore command with the relevant stuff to make the output go into the database rather than just to the screen, but were faced with screenfuls of errors such as ERROR: could not access file "$libdir/postgis-1.5": No such file or directory
Oh god, quoth we, it’s looking for a version of PostGIS which is not supported by a 64-bit installation of Postgres. Being the eager beavers we are, we immediately set about trying to build a version of PostGIS 1.5 in 64-bit bytecode, before throwing our hands in the air and going for coffee and doughnuts. Surely there must be a better way.
At this point in the story the deadlock began to be broken: why not upgrade the version of PostGIS on the 32-bit Postgres which actually worked so the pg_restore would be looking for the right library files? A Google search turned up this vital page about upgrading PostGIS to version 2.0. Unbelievably, the only way to do this is via what is ominously referred to as a “Hard Upgrade” (so could because it’s “hard” to work out how this could possibly be the officially recommended way of doing things).
Hard upgrading basically involves creating a special pg_dump with some important switches (-Fc -b -v) and running a Perl script included in PostGIS specifically for pg_restoring a PostGIS database. So, off to install Perl for Windows 8…
With Perl installed and the instructions for hard upgrading under our belts, things were moving forward. We were now up against this unhelpful error message (x1000 in the error log): ERROR: type "geometry" does not exist
What? Sounds worrying.
This was resolved when we realised that the new database we’d created to house the old data was not a PostGIS database, but merely a PostgreSQL one. To create a PostGIS database you need to do:
So, we had an empty PostGIS-enabled database, a pg_dump dump file created using the magic command: pg_dump -h localhost -p 5432 -U postgres -Fc -b -v -f "/somepath/olddb.backup" olddb
and we were ready to roll.
The dainty command perl postgis_restore.pl dumpfile.dump | psql -h localhost -p 5432 -U postgres db_name 2> errors.txt was all that was left, and the process began. More The Matrix-style scrolling DOS prompt (but this time, for around ten minutes, rather than the three days under the old regime) and we finally had our 64-bit, PostGIS 2.0 database up, running and ready to use.
Will we make the same mistakes again? Probably? Are we going to start backing up our database from now one? You bet….
After a long while playing with Google’s excellent Building Maker, a time-wasting toy for crowd-sourcing Google World’s 3D building library, I’ve completed my first attempt at rendering the offices of UCL CASA. The results are laughably poor, but mainly because, in London’s crowded Bloomsbury, there’s just no open space from which to get a sensible vantage-point of the building. The whole thing is covered in trees, and one of my walls features a huge part of another building which was obscuring the StreetView image I used to put the 3D model together.
Predictably, the best part of the image is the roof, since these were rendered using unobscured sattelite imagery. See what you think of my first draft:
This weekend (20th and 21st Oct) my colleagues at CASA and I will be presenting at an event in central London called the Bloomsbury Festival. There we’ll be taking some of the fancy visual and interactive research work we do with space and networks and demonstrating it to the assembled excited masses.
I’ll also be running a workshop on joining geography and the digital world from 1-1.45 on Sunday afternoon. Come and find out about how agent-based modelling works, what the point of it is, and how it could be changing our ideas of town planning, policing strategies and development aid distribution.
So far so simple in terms of reading data into the R environment (I’m using the wonderful RStudio by the way). Download a little library, run about 5 lines of code and, boom, my PostgreSQL view is available as a data frame in R. Easy peasy.
## Load the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
## Open a connection
con <- dbConnect(drv, host="xxxxxxxx",dbname="xxxxxxx",user="xxxxxx",password="xxxxxx")
## Read the whole contents of a table into a dataframers
rs <- dbReadTable(con,"rlvw_cntry_year_aid_affected")
More tricky is the whole panel data regression part. Panel data has two dimensions, a time dimension (in my case the years from about 1960 to 2008) and an “individuals” dimension, in my case countries. So I have aid received data for all countries for each year in the set, making a lot of observations overall.
The hard part is using both dimensions when running a regression. In principle, the maths is not complicated. And in the statistics package I’m used to using, it’s straightforward (once you know how!). You tell the software which column represents your time dimension, and which your individual dimension and off you go. In R, I’m not sure yet how that stuff works, so it’s back to the reading board for me as I trawl through online tutorials etc. I’ll report back once I’ve worked out how to do it.