Processing large amounts of data is not possible on a single computer. That is where distributed systems such as Spark (made by Databricks) come in. Spark allows you to parallelize large workloads over many computers.Â
Spark was developed to help solve the Netflix Prize, which had a $1 million prize for the team that made the best recommendation engine. Spark uses distributed computing to wrangle large and complex datasets. There are distributed Python equivalent libraries, such as Koalas, which is a distributed equivalent of pandas. Spark also supports analytics and feature engineering that requires a large amount of compute and memory, such as graph theory problems. Spark has two modes: a batch mode for training large datasets and a streaming mode for scoring data in near real time.Â
IoT data tends to be large and imbalanced. A device may have 10 years of data showing it is running in normal conditions and only a few records showing it needs...