Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Machine Learning Cookbook

Product type Book

Published in Sep 2017

Publisher Packt

ISBN-13 9781783551606

Pages 666 pages

Edition 1st Edition

Languages

Scala

Concepts

Machine Learning

Authors (5):

Mohammed Guller

Siamak Amirghodsi

Shuen Mei

Meenakshi Rajendran

Broderick Hall

View More author details

Table of Contents (20) Chapters

Title Page

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Practical Machine Learning with Spark Using Scala

2. Just Enough Linear Algebra for Machine Learning with Spark

3. Spark's Three Data Musketeers for Machine Learning - Perfect Together

4. Common Recipes for Implementing a Robust Machine Learning System

5. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I

6. Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II

7. Recommendation Engine that Scales with Spark

8. Unsupervised Clustering with Apache Spark 2.0

9. Optimization - Going Down the Hill with Gradient Descent

10. Building Machine Learning Systems with Decision Tree and Ensemble Models

11. Curse of High-Dimensionality in Big Data

12. Implementing Text Analytics with Spark 2.0 ML Library

13. Spark Streaming and Machine Learning Library

Two methods of ingesting and preparing a CSV file for processing in Spark

In this recipe, we explore reading, parsing, and preparing a CSV file for a typical ML program. A comma-separated values (CSV) file normally stores tabular data (numbers and text) in a plain text file. In a typical CSV file, each row is a data record, and most of the time, the first row is called the header row, which stores the field's identifier (more commonly referred to as a column name for the field). Each record of one or fields, separated by commas.

How to do it...

The sample CSV data file is from movie ratings. The file can be retrieved at http://files.grouplens.org/datasets/movielens/ml-latest-small.zip.

Once the file is extracted, we will use the ratings.csv file for our CSV program to load the data into Spark. The CSV files will look like the following:

userId	movieId	rating	timestamp
1	16	4	1217897793
1	24	1.5	1217895807
1	32	4	1217896246
1	47	4	1217896556
1	50	4	1217896523
1	110	4	1217896150
1	150	3	1217895940
1	161	4	1217897864
1	165	3	1217897135...