Creating a Data Science Pipeline
"Pipeline" is a commonly used term in data science, and it means that a pre-defined list of steps is performed in a proper sequence – one after another. The clearer the instructions, the better the standard of results obtained, in terms of quality and quantity. OSEMN is one of the most common data science pipelines used for approaching any kind of data science problem. The acronym is pronounced awesome.
The following figure provides an overview of the typical sequence of actions a data analyst would follow to create a data science pipeline:
Figure 7.12: The OSEMN pipeline
Let's understand the steps in the OSEMN pipeline in a little more detail:
- Obtaining the data, which can be from any source: structured, unstructured, or semi-structured.
- Scrubbing the data, which means getting your hands dirty and cleaning the data, which can involve renaming columns and imputing missing values.
- Exploring the data to find out...