The following steps need to be observed to follow this recipe:
- Firstly, import the required libraries. We will be using pyspark.sql, numpy, and pandas for data manipulation and matplotlib and seaborn for visualization:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import pandas as pd
import numpy as np
np.random.seed(1385)
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
- Next, we're going to import the data and apply a schema to it so that the data types can be correctly used. To do this we import the data file through the wizard and then apply our schema to it:
file_location = "/FileStore/tables/train_FD001.txt"
file_type = "csv"
from pyspark.sql.types import *
schema = StructType([
StructField("engine_id", IntegerType()),
StructField("cycle", IntegerType()),
StructField("setting1", DoubleType()),
StructField("setting2",...