Data science is the discipline of extracting actionable knowledge from data of various forms. The name data science emerged quite recently--it was invented by DJ Patil and Jeff Hammerbacher and popularized in the article Data Scientist: The Sexiest Job of the 21st Century in 2012. But the discipline itself had existed before for quite a while and previously was known by other names such as data mining or predictive analytics. Data science, like its predecessors, is built on statistics and machine learning algorithms for knowledge extraction and model building.
The science part of the term data science is no coincidence--if we look up science, its definition can be summarized to systematic organization of knowledge in terms testable explanations and predictions. This is exactly what data scientists do, by extracting patterns from available data, they can make predictions about future unseen data, and they make sure the predictions are validated beforehand.
Nowadays, data science is used across many fields, including (but not limited to):
- Banking: Risk management (for example, credit scoring), fraud detection, trading
- Insurance: Claims management (for example, accelerating claim approval), risk and losses estimation, also fraud detection
- Health care: Predicting diseases (such as strokes, diabetes, cancer) and relapses
- Retail and e-commerce: Market basket analysis (identifying product that go well together), recommendation engines, product categorization, and personalized searches
This book covers the following practical use cases:
- Predicting whether an URL is likely to appear on the first page of a search engine
- Predicting how fast an operation will be completed given the hardware specifications
- Ranking text documents for a search engine
- Checking whether there is a cat or a dog on a picture
- Recommending friends in a social network
- Processing large-scale textual data on a cluster of computers
In all these cases, we will use data science to learn from data and use the learned knowledge to solve a particular business problem.
We will also use a running example throughout the book, building a search engine. We will use it to illustrate many data science concepts such as, supervised machine learning, dimensionality reduction, text mining, and learning to rank models.