In this recipe, we explore reading, parsing, and preparing a CSV file for a typical ML program. A comma-separated values (CSV) file normally stores tabular data (numbers and text) in a plain text file. In a typical CSV file, each row is a data record, and most of the time, the first row is also called the header row, which stores the field's identifier (more commonly referred to as a column name for the field). Each record consists of one or more fields, separated by commas.
Two methods of ingesting and preparing a CSV file for processing in Spark
How to do it...
- The sample CSV data file is from movie ratings. The file can be retrieved at http://files.grouplens.org/datasets/movielens/ml-latest-small.zip.
- Once...