Getting data for machine learning projects was a challenge in the past. However, now there is a rich set of public data sources specifically suitable for machine learning.
Identifying data sources for practical machine learning
Getting ready
In addition to the university and government sources, there are many other open sources of data that can be used to learn and code your own examples and projects. We will list the data sources and show you how to best obtain and download data for each chapter.
How to do it...
The following is a list of open source data worth exploring if you would like to develop applications in this field:
- UCI machine learning repository: This is an extensive library with search functionality. At the time of writing, there were more than 350 datasets. You can click on the https://archive.ics.uci.edu/ml/index.html link to see all the datasets or look for a specific set using a simple search (Ctrl + F).
- Kaggle datasets: You need to create an account, but you can download any sets for learning as well as for competing in machine learning competitions. The https://www.kaggle.com/competitions link provides details for exploring and learning more about Kaggle, and the inner workings of machine learning competitions.
- MLdata.org: A public site open to all with a repository of datasets for machine learning enthusiasts.
- Google Trends: You can find statistics on search volume (as a proportion of total search) for any given term since 2004 on http://www.google.com/trends/explore.
- The CIA World Factbook: The https://www.cia.gov/library/publications/the-world-factbook/ link provides information on the history, population, economy, government, infrastructure, and military of 267 countries.
See also
Other sources for machine learning data:
- SMS spam data: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
- Financial dataset from Lending Club https://www.lendingclub.com/info/download-data.action
- Research data from Yahoo http://webscope.sandbox.yahoo.com/index.php
- Amazon AWS public dataset http://aws.amazon.com/public-data-sets/
- Labeled visual data from Image Net http://www.image-net.org
- Census datasets http://www.census.gov
- Compiled YouTube dataset http://netsg.cs.sfu.ca/youtubedata/
- Collected rating data from the MovieLens site http://grouplens.org/datasets/movielens/
- Enron dataset available to the public http://www.cs.cmu.edu/~enron/
- Dataset for the classic book elements of statistical learning http://statweb.stanford.edu/~tibs/ElemStatLearn/data.htmlIMDB
- Movie dataset http://www.imdb.com/interfaces
- Million Song dataset http://labrosa.ee.columbia.edu/millionsong/
- Dataset for speech and audio http://labrosa.ee.columbia.edu/projects/
- Face recognition data http://www.face-rec.org/databases/
- Social science data http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies
- Bulk datasets from Cornell University http://arxiv.org/help/bulk_data_s3
- Project Guttenberg datasets http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
- Datasets from World Bank http://data.worldbank.org
- Lexical database from World Net http://wordnet.princeton.edu
- Collision data from NYPD http://nypd.openscrape.com/#/
- Dataset for congressional row calls and others http://voteview.com/dwnl.htm
- Large graph datasets from Stanford http://snap.stanford.edu/data/index.html
- Rich set of data from datahub https://datahub.io/dataset
- Yelp's academic dataset https://www.yelp.com/academic_dataset
- Source of data from GitHub https://github.com/caesar0301/awesome-public-datasets
- Dataset archives from Reddit https://www.reddit.com/r/datasets/
There are some specialized datasets (for example, text analytics in Spanish, and gene and IMF data) that might be of some interest to you:
- Datasets from Colombia (in Spanish): http://www.datos.gov.co/frm/buscador/frmBuscador.aspx
- Dataset from cancer studies http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- Research data from Pew http://www.pewinternet.org/datasets/
- Data from the state of Illinois/USA https://data.illinois.gov
- Data from freebase.com http://www.freebase.com
- Datasets from the UN and its associated agencies http://data.un.org
- International Monetary Fund datasets http://www.imf.org/external/data.htm
- UK government data https://data.gov.uk
- Open data from Estonia http://pub.stat.ee/px-web.2001/Dialog/statfile1.asp
- Many ML libraries in R containing data that can be exported as CSV https://www.r-project.org
- Gene expression datasets http://www.ncbi.nlm.nih.gov/geo/