Chapter 2. First Steps in Data Analysis
Let's take the first steps towards data analysis now. Spark has a very useful module, Spark. Apache Spark has a prebuilt module called as Spark SQL and this module is used for structured data processing. Using this module, we can execute SQL queries on our underlying data. Spark lets you read data from various datasources whether text, CSV, or Parquet files on HDFS or also from hive tables or HBase
tables. For simple data analysis tasks, whether you are exploring your datasets initially or trying to analyze and cut a report for your end users with simple stats this module is tremendously useful.
In this chapter, we will work on two datasets. The first dataset that we will analyze is a simple dataset and the next one is a more complex real-world dataset from an e-commerce store.
In this chapter, we will cover the following topics:
- Basic statistical analytic approaches using Spark SQL
- Building association rules using the Apriori algorithm...