Munging time series data
Time series data is a sequence of linked to a timestamp. In section, we use Cloudera's spark-ts
package for analyzing time-series data.
Note
Refer to Cloudera Engineering Blog, A New Library for Analyzing Time-Series Data with Apache Spark, for more details on time-series data and its processing using spark-ts
. This blog is available at: https://github.com/sryza/spark-timeseries.
The spark-ts
package can be downloaded and using instructions available at:
https://github.com/sryza/spark-timeseries.
We will attempt to accomplish the following objectives in the following sub-sections:
- Pre-processing of the time-series Dataset
- Processing date fields
- Persisting and loading data
- Defining a date-time index
- Using the
TimeSeriesRDD
object - Handling missing time-series data
- Computing basic statistics
For this section, specify inclusion of the spark-ts.jar
file while starting the Spark shell as shown:
We download Datasets containing pricing and volume data for six stocks over a one year...