Data wrangling and profiling
Often, testing efforts focus on the quality of the software developed; however, in a data factory solution, the data is the product and needs to be tested for contract compliance. These tests require the data to be profiled and the shape of that data assessed for being fit for purpose. This is called data profiling, when ingested data is wrangled as part of its ingestion into the data factory. The wrangling of data is a mix of deterministic and stochastic based processes:
- Transformations (to coerce data into a normalized processable form)
- Redactions (to remove detected errors)
- Reductions (to remove restatements)
- Data masking/obfuscation (for security and privacy)
- Summarizations (facilitating downstream analytics)
- Alt-data expansions (to use tables to deterministically expand data using known reference sources)
- Metadata enrichments (to classify and provide semantic context)
- Semantic alignments (to enable data shape fitment...