Once the dataset objects have been created, they need to be transformed based on the model's requirements. The following diagram shows the flow of dataset transformation:
Some of the most important transformations are as follows:
- Data rearrangements: These might be needed to select a portion of data instead of taking the entire dataset. They can be useful for doing experiments with a subset of data.
- Data cleanups: These are extremely important. It could just be as simple as cleaning up a date format, such as from YYYY/MM/DD to MM-DD-YYYY, or removing data that has missing values or incorrect numbers. Other examples of data cleansing is removing stop words from text files for an NLP module.
- Data standardization and normalization: These are crucial for data where one or more features are coming from various sources and have different units and scales...