Computing sequencing statistics using Spark
If you need to use parallel computing, then Spark is one alternative to Dask. Its abstraction level is slightly higher. This gives you less granular control over the computation, but is more declarative to code. Spark is also somewhat language agnostic (it is actually Java/Scala-based). Here, we will compute some very basic statistics over the Parquet dataset that we generated in the previous recipe.
Getting ready
Preparing for this recipe can be quite tricky. First, we will have to start a Spark server. At the time of writing this book, the conda
packages for accessing Spark were quite immature. We will still use conda
 here, but we will not install any Spark packages from conda
. Follow these steps to prepare the environment:
- Make sure that you have Java 8 installed. Be careful with the Java version, as an older version will not work, but a newer might also be problematic.
- Download Spark (https://spark.apache.org/downloads.html). This code was tested...