SPARKing new ideas on Jupyter
Apache Spark is an open source analytics engine for distributed processing across large clusters to take advantage of the parallelism and fault tolerance that comes from such architecture. It is also, in my opinion, the most simultaneously loved and hated piece of software since the invention of JavaScript! The love comes from the workflows it enables, but it is notoriously fragile and difficult to use properly. If you aren’t familiar with Spark, it is commonly used in conjunction with Scala, Java, Python, and/or R in addition to being able to run distributed SQL queries. Because Python is easy to pick up and very quick to write, data scientists will often utilize Jupyter Notebooks with Python to quickly create and test models for analysis.
This style of workflow is excellent for quickly iterating on various ideas and proving feasibility and correctness. However, engineers and data scientists often find themselves beholden by the fact that, frankly...