Using DataFrames with SparkR
The following steps will help us to understand more operations with DataFrames on SparkR by analyzing a New York flights dataset:
As a first step, let's download the flights data and copy it to HDFS:
[cloudera@quickstart ~]$ wget https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv --no-check-certificate [cloudera@quickstart ~]$ hadoop fs -put nycflights13.csv flights.csv
Start the SparkR shell and create a DataFrame using the CSV DataSource. While installing packages, use HTTP locations near you:
[cloudera@quickstart ~]$ cd spark-2.0.0-bin-hadoop2.7/ [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/sparkR > install.packages("magrittr", dependencies = TRUE) > library(magrittr) > flights <- read.df("flights.csv",source="csv", header="true", inferschema="true") > flights SparkDataFrame[year:int, month:int, day:int, dep_time:int, dep_delay:int, arr_time:int, arr_delay:int, carrier:string, tailnum:string, flight:int, origin:string...