Understanding data ingestion
The responsibility of completing tasks within the early stages of the data pipeline (i.e., data ingestion and data storage) often falls under the responsibility of a machine learning/data engineer and not the data scientist. However, a data scientist should be able to understand what happens during these stages at a high level.
In the simplest terms, data ingestion involves developing automated processes to collect the data used for data science models automatically. Often, organizations/businesses already have processes in place to collect basic information about their activities, such as tracking website usage or customer purchase transactions. However, sometimes, to solve a particular organizational/business question, new data needs to be collected. The goal here is to automate the process to ensure that the data eventually used in a model is consistent, reliable, and free of bias to the best of the organization’s ability.
Data ingestion...