I’ve been interested in collective intelligence and machine learning for a while now. These too related fields centre round using statistical tools on large sets of data to make measurements and predictions. So when the UK’s Guardian newspaper announced their “Data-store”, a collection of data set open to the public I felt it was time to apply some of what I’ve learned to the data they were offer.
I choose to apply hierarchal clustering to the data on world health. The idea of hierarchal clustering is to measure how similar data sets are then pair off the similar data sets to build a binary tree that will relieve groups of similar data. I used the pearson correlation to compare the data sets and the resulting data is drawn in a dendrogram, a way of showing the distances between the various clusters that emerge from our clustering algorithm.
The code I’ve used is available on github.com, it’s packaged in an F# project called gdata.fsproj. For a direct link to the project click here. (There’s also a demonstration on hierarchal clustering with word counts from blogs from TechDays Paris 2009 talk).
Anyway, I’m not going to dig too deeply into the code, at least for this post, so let’s have a look at the results. First I clustered by county using the following statics to form my vectors:
Hospital beds per 1000
Nursing and Midwifery Personnel per 1000
One-year-olds Immunised with diphtheriatetanustoxoidandpertussisdtp
One-year-olds Immunised with hepatitis b
One-year-olds Immunised with hibhib3vaccine
Adolescent fertility rate (%)
Births attended by skilled health personnel (%)
Infant mortality rate (per 1 000 live births) both sexes
Maternal mortality ratio (per 100 000 live births)
Neonatal mortality rate (per 1 000 live births)
Life expectancy at birth (years) both sexes
Life expectancy at birth (years) female
Life expectancy at birth (years) male
Deaths among children under five years of age due to HIV/AIDS (%)
Per capita recorded alcohol consumption (litres of pure alcohol) among adults
Population with sustainable access to improved drinking water sources (%) total
Population with sustainable access to improved sanitation (%) total.
The statistics were chosen mainly because they were the most complete; it is only possible to compare countries using this technique if all statistics are available. The resulting dendrogram can be seen below:
There’s no great surprises from the stats, there appears to be two distinct clusters, one of poor countries towards the bottom of the diagram and one of richer countries towards the top, with the 1st world countries being located towards the top of this cluster (absolute position doesn’t matter much is the diagram it’s more who your close to). There are perhaps a few surpises, maybe we wouldn’t have expected to find Cananda quite so close to the Ukraine or perhaps not the Czech Republic so closed to Germany. It may be worth going back to the underlying statistics to find why this is.
Perhaps a more interesting analysis is to reverse the matrix so we are no comparing which conditions are related to each other:
Again, the diagram does show some obvious relations. Male and female life expectancies were always going to statically similar to overall life expectancy, but it does appear that this is closely related to infant mortality rates. In turn is closely correlated to births attend by medical professions and access to clean water and sanitations. While this is fairly logical I think it’s good that we can show, statically speaking at least, that access to clean water and sanitation will improve infant mortality rates and life expectancy.
While these first steps in analysing the Guardian Data didn’t perhaps turn up anything we didn’t already know, I feel it’s shown that if you spend a bit of time working with public available data you can start to find interesting patterns. I shall definitely be looking at how I can further these experiments.
Feedback was imported from my only blog engine, it’s no longer possible to post feedback here.
re: Collective Intelligence and the Guardian Data-Store - Thibaut Barrère
nice post - great to see the Pearson algorithm at work.
It seems that the first link of your article (Collective Intelligence) doesn’t point to the right page. Thought you’d want to be warned.
re: Collective Intelligence and the Guardian Data-Store - Robert Pickering
Thanks! Corrected now.
Topics about Alcohol &raquo; Archive &raquo; Collective Intelligence and the Guardian Data-Store - http://alcohol.linkablez.info/2009/03/19/collective-intelligence-and-the-guardian-data-store/
Topics about Alcohol &raquo; Archive &raquo; Collective Intelligence and the Guardian Data-Store