The second step, after loading the data, is to carry out Exploratory Data Analysis (EDA). By doing this, we get to know the data we are supposed to work with. Some insights we try to gather are:
- What kind of data do we actually have, and how should we treat different types?
- What is the distribution of the variables?
- Are there outliers in the data, and how can we treat them?
- Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might want to use techniques such as log transformation.
- Does the distribution vary per group (for example, gender or education level)?
- Do we have cases of missing data? How frequent are these, and in which variables?
- Is there a linear relationship between some variables (correlation)?
- Can we create new features using the existing set of variables? An example...