Exploratory data analysis in Java
Exploratory Data Analysis is about taking a dataset and extracting the most important information from it, in such a way that it is possible to get an idea of what the data looks like. This includes two main parts:
The summarization step is very helpful for understanding data. For numerical variables, inĂ‚Â this step we calculate the most important sample statistics:Ă‚Â
- The extremes (the minimal and the maximal values)
- The mean value, or the sample average
- The standard deviation, which describes the spread of the data
Often we consider other statistics, such as the median and the quartiles (25% and 75%).
As we have already seen in the previous chapter, Java offers a great set of tools for data preparation. The same set of tools can be used for EDA, and especially for creating summaries.
Search engine datasets
In this chapter, we will use our running example--building a search engine. In Chapter 2, Data Processing Toolbox, we extracted some data from HTML pages...