Two methods of ingesting and preparing a CSV file for processing in Spark
In this recipe, we explore reading, parsing, and preparing a CSV file for a typical ML program. A comma-separated values (CSV) file normally stores tabular data (numbers and text) in a plain text file. In a typical CSV file, each row is a data record, and most of the time, the first row is called the header row, which stores the field's identifier (more commonly referred to as a column name for the field). Each record of one or fields, separated by commas.
How to do it...
- The sample CSV data file is from movie ratings. The file can be retrieved at http://files.grouplens.org/datasets/movielens/ml-latest-small.zip.
- Once the file is extracted, we will use the
ratings.csv
file for our CSV program to load the data into Spark. The CSV files will look like the following:
userId | movieId | rating | timestamp |
1 | 16 | 4 | 1217897793 |
1 | 24 | 1.5 | 1217895807 |
1 | 32 | 4 | 1217896246 |
1 | 47 | 4 | 1217896556 |
1 | 50 | 4 | 1217896523 |
1 | 110 | 4 | 1217896150 |
1 | 150 | 3 | 1217895940 |
1 | 161 | 4 | 1217897864 |
1 | 165 | 3 | 1217897135... |