Non-tabular data
This is an elementary project. The knowledge points in this project can be found in Chapter 1, Fundamentals of Data Collection, Cleaning, and Preprocessing, and Chapter 2, Essential Statistics for Data Assessment.
The university dataset in the UCI machine learning repository is stored in a non-tabular format: https://archive.ics.uci.edu/ml/datasets/University. Please examine its format and perform the following tasks:
- Examine the data format visually and then write down some patterns to see whether such patterns can be used to extract the data at specific lines.
- Write a function that will systematically read the data file and store the data contained within in a pandas DataFrame.
- The data description mentioned the existence of both missing data and duplicate data. Identify the missing data and deduplicate the duplicated data.
- Classify the features into numerical features and categorical features.
- Apply min-max normalization to all the numerical...