Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.
Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.
In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.
This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.
We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.
Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.
Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.
For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/
Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.
For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.
Some data sources are hosted openly by universities around the world for research purposes.
To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.
The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.
Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.
The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.
The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.
UNICEF also releases interesting statistics, as the quote from their website suggests:
"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at http://www.unicef.org/statistics."
The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.
The United States government
If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.
The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.