Collecting data
Data collection is collecting the source data and storing it in a central safe location. In the data collection phase, we try to answer the following questions:
- What is the nature of the problem and do we have the right data for it?
- Where is the data and do we have access to the data?
- What can we do to ingest all the data into one central repository?
- How do we safeguard the central data repository?
These questions are crucial in any ML project because, in a real business, data is typically spread across many different heterogeneous source systems, and bringing all the source data together to form a dataset may involve huge challenges.
A common data collection and consolidation process called Extract, Transform, and Load (ETL) has the following steps:
- Extract: Pull the data from the various sources to a single location.
- Transform: During data extraction and consolidation, we may need to change the data format, modify some data...