Data discovery
Data discovery is an important phase in the wrangling pipeline, as it helps users to understand the data and guides how the next steps should be done. For example, if the user looks at the data and determines certain columns have missing values, data cleansing should fix those values and any missing columns can be added by joining the data with other data sources or deriving them from raw data. Essentially, this step will give an idea of the completeness, usefulness, and relevance of the dataset to users.
There are multiple ways to perform data discovery including downloading small files on a local machine and using Excel files to explore the data. We will look at ways in which we can explore the raw data stored in a data lake. Some of the common steps that are performed during a data discovery phase are as follows:
- Identifying the source data structure/format and its associated properties
- Visualizing the data distribution on the dataset
- Validating...