What is data science?
Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.
Basic terminology
The definitions that follow are general enough to be used in daily conversations, and work to serve the purpose of this book, an introduction to the principles of data science.
Let's start by defining what data is. This might seem like a silly first definition to look at, but it is very important. Whenever we use the word "data," we refer to a collection of information in either an organized or unorganized format. These formats have the following qualities:
- Organized data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation.
- Unorganized data: This is the type of data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.
Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insights from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.
Data science is the art and science of acquiring knowledge through data.
What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know—I tried and got told to edit it down).
Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:
- Make decisions
- Predict the future
- Understand the past/present
- Create new industries/products
This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.
Data science is about using data in order to gain new insights that you would otherwise have missed.
As an example, using data science, clinics can identify patients who are likely to not show up for an appointment. This can help improve margins, and providers can give other patients available slots.
That's why data science won't replace the human brain, but complement it, working alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion—a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.
Why data science?
In this Data Age, it's clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources, and often comes in a very unorganized format.
Data can be missing, incomplete, or just flat out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say that we are looking at data in relation to pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.
Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today using a very simple example.
Example – xyz123 Technologies
Ben Runkle, the CEO of xyz123 Technologies, is trying to solve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Ben the most recent transcripts and finds something surprising:
- ".... Not sure how to export this; are you?"
- "Where is the button that makes a new list?"
- "Wait, do you even know where the slider is?"
- "If I can't figure this out today, it's a real problem..."
It is clear that customers were having problems with the existing UI/UX, and weren't upset because of a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.
Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle drivers. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Hughan is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.
Both of these mentalities have their place in today's enterprises; however, it is Hughan's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information, rather than just picking up a solution and going with it.