Exploratory Data Analysis is about taking a dataset and extracting the most important information from it, in such a way that it is possible to get an idea of what the data looks like. This includes two main parts: summarization and visualization.
The summarization step is very helpful for understanding data. For numerical variables, in this step we calculate the most important sample statistics:Â
- The extremes (the minimal and the maximal values)
- The mean value, or the sample average
- The standard deviation, which describes the spread of the data
Often we consider other statistics, such as the median and the quartiles (25% and 75%).
As we have already seen in the previous chapter, Java offers a great set of tools for data preparation. The same set of tools can be used for EDA, and especially for creating summaries.