Variance is the measure of how much the data varies from the mean. In the code that follows, we are using Koalas, a distributed clone of pandas, to do our basic data engineering tasks, such as determining variance. The following code uses standard deviation over a rolling window to show data spike issues:
import databricks.koalas as ks
df = ks.DataFrame(pump_data)
print("variance: " + str(df.var()))
minuite['time'] = pd.to_datetime(minuite['time'])
minuite.set_index('time')
minuite['sample'] = minuite['sample'].rolling(window=600,center=False).std()
Duty cycles are used on IoT product lines before enough data is collected for machine learning. They are often simple measures, such as whether the device is too hot or there are too many vibrations.
We can also look at high and low values such as maximum to show whether the sensor is throwing out appropriate readings. The following code shows the maximum reading of our dataset:
max = DF.agg({"averageRating": "max"}).collect()[0]