Thursday, October 4, 2012

Use Python to extract date of birth from Wikipedia pages

Religion cybernetics and getting things done

Two days ago we had this conversation about astrology and a friend mentioned it would be interesting, from a "religion cybernetics" aspect, to see the list physicists who has the zodiac Taurus. Religion cybernetics: I'd heard this term many times, and I was still puzzled what the heck this would be, but this time I had a second thought: I was wondering how long would it take to actually get this done and produce such a list from the Wikipedia.

Well, here you go, eventually I did this during my daily telecommuting and in my spare minutes yesterday and today. All in all, it took cc. 2-3 hours which is not bad.

The export format

I decided in the first place, that I will start from the List of physicists and visit each individual page and extract the date of birth. I knew the Wikipedia can be downloaded but I was too lazy to download the whole thing. On the other hand, I also knew that it is not nice to crawl the human readable HTML version of the site, and I found the middle ground: the export format, which is basically the wiki content wrapped up in some minimal XML metadata.

The date of birth

After a short investigation, it turned out the date of birth can be found usually at three different places.

1) Somewhere in the article text.
E.g. Johann Jakob Balmer's page starts with this: "Johann Jakob Balmer (May 1, 1825 – March 12, 1898)"

2) In the so called infobox, 
which is on the right side of the page. It can contain the dates in different formats, see the birth_date and death_date for example.
"{{Infobox scientist 
|name = Johann Jakob Balmer 
|image = Balmer.jpeg 
|image_size = 220px 
|caption = 
|birth_date = May 1, 1825 
|birth_place = [[Lausen]], [[Switzerland]] 
|death_date = {{dda|1898|3|12|1825|5|1}} 
|death_place = [[Basel]], [[Switzerland]] 
|residence = 
|citizenship = 
|nationality = [[Switzerland]] 
|ethnicity = 
|field = [[Mathematics]] 
|work_institutions = 
|alma_mater = [[University of Basel]] 
|doctoral_advisor = 
|doctoral_students = 
|known_for = 
|author_abbrev_bot = 
|author_abbrev_zoo = 
|influences = 
|influenced = 
|prizes = 
|religion = 
|footnotes = 
|signature = }}"


3) Persondata
I have learnt, that every biographical article have some metadata, that is not visible on the human readable page. It's purpose is exactly to help my work, that is help the automatic extraction of information. What I did not expect, this can contain the dates in different formats as well.
"{{Persondata  
| NAME = Balmer, Johann 
| ALTERNATIVE NAMES = 
| SHORT DESCRIPTION = 
| DATE OF BIRTH = May 1, 1825 
| PLACE OF BIRTH = [[Lausen]], [[Switzerland]] 
| DATE OF DEATH = March 12, 1898 
| PLACE OF DEATH = [[Basel]] 
}}

The cache

Since I knew I will run the whole process several times until I get everything right, I decided to save every webpage I download, so that I do not  have to download them unnecessary several times. I also saved the dictionaries I have built into files, so that later I can load them and rerun a partial step of the process without running every other preceding step.

Epilog

Just after I have finished writing up this, I've made one last Google search. It's a shame I only read this Stackoverflow thread after I have finished. I should have started with that. There are several tools that I could have used:
Pywikipediabot: collection of python scripts automating work on wikipedia articles
- How to extract Persondata from the SQL dump files
- Wikidata
- this is the worst: forget export format. there is an API which lets you retrieve data in JSON format. Next time start here: http://www.mediawiki.org/wiki/API/Tutorial
- there is a Wiki format parser module that will return the date of birth
I could rewrite this in a few lines of code. But I won't do this now, because it is DONE. Next time I grab info from Wikipedia, I will be smarter.

The code:

No comments: