Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Haskell Data Analysis cookbook

You're reading from   Haskell Data Analysis cookbook Explore intuitive data analysis techniques and powerful machine learning methods using over 130 practical recipes

Arrow left icon
Product type Paperback
Published in Jun 2014
Publisher
ISBN-13 9781783286331
Length 334 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Nishant Shukla Nishant Shukla
Author Profile Icon Nishant Shukla
Nishant Shukla
Arrow right icon
View More author details
Toc

Table of Contents (14) Chapters Close

Preface 1. The Hunt for Data FREE CHAPTER 2. Integrity and Inspection 3. The Science of Words 4. Data Hashing 5. The Dance with Trees 6. Graph Fundamentals 7. Statistics and Analysis 8. Clustering and Classification 9. Parallel and Concurrent Design 10. Real-time Data 11. Visualizing Data 12. Exporting and Presenting Index

Harnessing data from various sources

Information can be described as structured, unstructured, or sometimes a mix of the two—semi-structured.

In a very general sense, structured data is anything that can be parsed by an algorithm. Common examples include JSON, CSV, and XML. If given structured data, we can design a piece of code to dissect the underlying format and easily produce useful results. As mining structured data is a deterministic process, it allows us to automate the parsing. This in effect lets us gather more input to feed our data analysis algorithms.

Unstructured data is everything else. It is data not defined in a specified manner. Written languages such as English are often regarded as unstructured because of the difficulty in parsing a data model out of a natural sentence.

In our search for good data, we will often find a mix of structured and unstructured text. This is called semi-structured text.

This recipe will primarily focus on obtaining structured and semi-structured data from the following sources.

Tip

Unlike most recipes in this book, this recipe does not contain any code. The best way to read this book is by skipping around to the recipes that interest you.

How to do it...

We will browse through the links provided in the following sections to build up a list of sources to harness interesting data in usable formats. However, this list is not at all exhaustive.

Some of these sources have an Application Programming Interface (API) that allows more sophisticated access to interesting data. An API specifies the interactions and defines how data is communicated.

News

The New York Times has one of the most polished API documentation to access anything from real-estate data to article search results. This documentation can be found at http://developer.nytimes.com.

The Guardian also supports a massive datastore with over a million articles at http://www.theguardian.com/data.

USA TODAY provides some interesting resources on books, movies, and music reviews. The technical documentation can be found at http://developer.usatoday.com.

The BBC features some interesting API endpoints including information on BBC programs, and music located at http://www.bbc.co.uk/developer/technology/apis.html.

Private

Facebook, Twitter, Instagram, Foursquare, Tumblr, SoundCloud, Meetup, and many other social networking sites support APIs to access some degree of social information.

For specific APIs such as weather or sports, Mashape is a centralized search engine to narrow down the search to some lesser-known sources. Mashape is located at https://www.mashape.com/

Most data sources can be visualized using the Google Public Data search located at http://www.google.com/publicdata.

For a list of all countries with names in various data formats, refer to the repository located at https://github.com/umpirsky/country-list.

Academic

Some data sources are hosted openly by universities around the world for research purposes.

To analyze health care data, the University of Washington has published Institute for Health Metrics and Evaluation (IHME) to collect rigorous and comparable measurement of the world's most important health problems. Navigate to http://www.healthdata.org for more information.

The MNIST database of handwritten digits from NYU, Google Labs, and Microsoft Research is a training set of normalized and centered samples for handwritten digits. Download the data from http://yann.lecun.com/exdb/mnist.

Nonprofits

Human Development Reports publishes annual updates ranging from international data about adult literacy to the number of people owning personal computers. It describes itself as having a variety of public international sources and represents the most current statistics available for those indicators. More information is available at http://hdr.undp.org/en/statistics.

The World Bank is the source for poverty and world development data. It regards itself as a free source that enables open access to data about development in countries around the globe. Find more information at http://data.worldbank.org/.

The World Health Organization provides data and analyses for monitoring the global health situation. See more information at http://www.who.int/research/en.

UNICEF also releases interesting statistics, as the quote from their website suggests:

"The UNICEF database contains statistical tables for child mortality, diseases, water sanitation, and more vitals. UNICEF claims to play a central role in monitoring the situation of children and women—assisting countries in collecting and analyzing data, helping them develop methodologies and indicators, maintaining global databases, disseminating and publishing data. Find the resources at http://www.unicef.org/statistics."

The United Nations hosts interesting publicly available political statistics at http://www.un.org/en/databases.

The United States government

If we crave the urge to discover patterns in the United States (U.S.) government like Nicholas Cage did in the feature film National Treasure (2004), then http://www.data.gov/ is our go-to source. It's the U.S. government's active effort to provide useful data. It is described as a place to increase "public access to high-value, machine-readable datasets generated by the executive branch of the Federal Government". Find more information at http://www.data.gov.

The United States Census Bureau releases population counts, housing statistics, area measurements, and more. These can be found at http://www.census.gov.

You have been reading a chapter from
Haskell Data Analysis cookbook
Published in: Jun 2014
Publisher:
ISBN-13: 9781783286331
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime