Generating TPC-DS data
We will stick to our tradition of finding tools within the Databricks ecosystem. The good folks of Databricks have created an open source project called spark-sql-perf (https://github.com/databricks/spark-sql-perf) that has everything we will need to generate the TPC-DS data. Let’s begin.
Building the spark-sql-perf library
The first thing we must do is compile the spark-sql-perf
library. I will be using the IntelliJ IDEA Integrated Development Environment (IDE). You can use any IDE of your choice or even compile the library from your terminal.
If you do not want to compile the library, you can head over to the GitHub repository for this book, download the JAR file, and move to the next step. However, note that the JAR file has been built for Spark 3.2.1 and Scala 2.12.10:
Note
Ensure that you install the necessary JDK and Scala versions onto your machine before commencing these steps. If you are using an IDE, it should guide you through...