Fetching the data
Luckily for us, the team behind stackoverflow provides most of the data behind the StackExchange universe to which stackoverflow belongs under a CC Wiki license. While writing this, the latest data dump can be found at http://www.clearbits.net/torrents/2076-aug-2012. Most likely, this page will contain a pointer to an updated dump when you read it.
After downloading and extracting it, we have around 37 GB of data in the XML format. This is illustrated in the following table:
File |
Size (MB) |
Description |
---|---|---|
|
309 |
Badges of users |
|
3,225 |
Comments on questions or answers |
|
18,370 |
Edit history |
|
12,272 |
Questions and answers—this is what we need |
|
319 |
General information about users |
|
2,200 |
Information on votes |
As the files are more or less self-contained, we can delete all of them except posts.xml
; it contains all the questions and answers as individual row
tags within the root tag posts
. Refer...