Data mining, a buzzword in the 1990 is the predecessor of data science (the science of data). One of the methodologies popular in the data mining community is called cross industry standard process for data mining (CRISP DM). CRISP DM was created in 1996, and is still used today. I am not endorsing CRISP DM, however I like its general framework. The CRISP DM consists of the following phases, which are not mutually exclusive and can occur in parallel:
- Business understanding: This phase is often taken care of by specialized domain experts. Usually we have a business person formulate a business problem, such as selling more units of a certain product.
- Data understanding: This is also a phase, which may require input from domain experts, however, often a technical specialist needs to get involved more than in the business understanding phase. The domain expert may be proficient with spreadsheet programs, but have trouble with complicated data. In this book, I usually call this phase exploration.
- Data preparation: This is also a phase where a domain expert with only Excel know-how may not be able to help you. This is the phase where we create our training and test datasets. In this book I usually call this phase preprocessing.
- Modeling: This is the phase, which most people associate with machine learning. In this phase we formulate a model, and fit our data.
- Evaluation: In this phase, we evaluate our model, and our data to check whether we were able to solve our business problem.
- Deployment: This phase usually involves setting up the system in a production environment (it is considered good practice to have a separate production system). Typically this is done by a specialized team.
When we learn, we require high quality learning material. We can't learn from gibberish, so we automatically ignore anything that doesn't make sense. A machine learning system isn't able to recognize gibberish, so we need to help it by cleaning the input data. It is often claimed that cleaning the data forms a large part of machine learning. Sometimes cleaning is already done for us, but you shouldn't count on it. To decide how to clean the data, we need to be familiar with the data. There are some projects, which try to automatically explore the data, and do something intelligent, like producing a report. For now, unfortunately, we don't have a solid solution, so you need to do some manual work.
We can do two things, which are not mutually exclusive: first scan the data and second visualize the data. This also depends on the type of data we are dealing with; whether we have a grid of numbers, images, audio, text, or something else. At the end, a grid of numbers is the most convenient form, and we will always work towards having numerical features. I will pretend that we have a table of numbers in the rest of this section.
We want to know if features miss values, how the values are distributed, and what type of features we have. Values can approximately follow a normal distribution, a binomial distribution, a Poisson distribution, or another distribution altogether. Features can be binary: either yes or no, positive or negative, and so on. They can also be categorical: pertaining to a category, for instance continents (Africa, Asia, Europe, Latin America, North America, and so on). Categorical variables can also be ordered—for instance high, medium, and low. Features can also be quantitative, for example temperature in degrees or price in dollars.
Feature engineering is the process of creating or improving features. It's more of a dark art than a science. Features are often created based on common sense, domain knowledge, or prior experience. There are certain common techniques for feature creation, however there is no guarantee that creating new features will improve your results. We are sometimes able to use the clusters found by unsupervised learning as extra features. Deep neural networks are often able to create features automatically.