Harnessing data from various sources
Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.
In a very general sense, structured data is anything that can be parsed by an algorithm. Common examples include JSON, CSV, and XML. If given structured data, we can design a piece of code to dissect the underlying format and easily produce useful results. As mining structured data is a deterministic process, it allows us to automate the parsing. This in effect lets us gather more input to feed our data analysis algorithms.
Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.
In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.
This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.
Tip
Unlike most recipes in this book, this recipe does not contain any code. The best way to read this book is by skipping around to the recipes that interest you.
How to do it...
We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.
Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.
News
The New York Times has one of the most polished API documentation to access anything from real-estate data to article search results. This documentation can be found at http://developer.nytimes.com.
The Guardian also supports a massive datastore with over a million articles at http://www.theguardian.com/data.
USA TODAY provides some interesting resources on books, movies, and music reviews. The technical documentation can be found at http://developer.usatoday.com.
The BBC features some interesting API endpoints including information on BBC programs, and music located at http://www.bbc.co.uk/developer/technology/apis.html.
Private
Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.
For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/
Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.
For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.
Academic
Some data sources are hosted openly by universities around the world for research purposes.
To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.
The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.
Nonprofits
Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.
The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.
The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.
UNICEF also releases interesting statistics, as the quote from their website suggests:
"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at http://www.unicef.org/statistics."
The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.
The United States government
If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.
The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.