Gathering the data
Apart from legal aspects, there is no real limit on the kind of content you can store in the datasets: tabular data, images, text; if it fits within the size requirements, you can store it. This includes data harvested from other sources; tweets by hashtag or topic are among the popular datasets at the time of writing:
Figure 2.6: Tweets are among the most popular datasets
Discussion of the different frameworks for harvesting data from social media (Twitter, Reddit, and so on) is outside the scope of this book.
Andrew Maranhão
https://www.kaggle.com/andrewmvd
We spoke to Andrew Maranhão (aka Larxel), Datasets Grandmaster (number 1 in Datasets at time of writing) and Senior Data Scientist at the Hospital Albert Einstein in São Paulo, about his rise to Datasets success, his tips for creating datasets, and his general experiences on Kaggle.
What’s your favourite kind of competition...