In the previous chapter, we discussed the best practices for approaching data science problems. We looked at CRISP-DM, which is the methodology for dealing with data mining projects, and one of the first steps there is data preprocessing. In this chapter, we will take a closer look at how to do this in Java.
Specifically, we will cover the following topics:
- Standard Java library
- Extensions to the standard library
- Reading data from different sources such as text, HTML, JSON, and databases
- DataFrames for manipulating tabular data
In the end, we will put everything together to prepare the data for the search engine.
By the end of this chapter, you will be able to process data such that it can be used for machine learning and further analysis.