There are different categories of questions that you will find in the exam. They can be broadly divided into theoretical and code questions. We will look at both categories and their respective subcategories in this section.
Theoretical questions
Theoretical questions are the questions where you will be asked about the conceptual understanding of certain topics. Theoretical questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into them.
Explanation questions
Explanation questions are ones where you need to define and explain something. It can also include how something works and what it does. Let’s look at an example.
Which of the following describes a worker node?
- Worker nodes are the nodes of a cluster that perform computations.
- Worker nodes are synonymous with executors.
- Worker nodes always have a one-to-one relationship with executors.
- Worker nodes are the most granular level of execution in the Spark execution hierarchy.
- Worker nodes are the coarsest level of execution in the Spark execution hierarchy.
Connection questions
Connections questions are such questions where you need to define how different things are related to each other or how they differ from each other. Let’s look at an example to demonstrate this.
Which of the following describes the relationship between worker nodes and executors?
- An executor is a Java Virtual Machine (JVM) running on a worker node.
- A worker node is a JVM running on an executor.
- There are always more worker nodes than executors.
- There are always the same number of executors and worker nodes.
- Executors and worker nodes are not related.
Scenario question
Scenario questions involve defining how things work in different if-else scenarios – for example, “If ______ occurs, then _____ happens.” Moreover, it also includes questions where a statement is incorrect about a scenario. Let’s look at an example to demonstrate this.
If Spark is running in cluster mode, which of the following statements about nodes is incorrect?
- There is a single worker node that contains the Spark driver and the executors.
- The Spark driver runs in its own non-worker node without any executors.
- Each executor is a running JVM inside a worker node.
- There is always more than one node.
- There might be more executors than total nodes or more total nodes than executors.
Categorization questions
Categorization questions are such questions where you need to describe categories that something belongs to. Let’s look at an example to demonstrate this.
Which of the following statements accurately describes stages?
- Tasks within a stage can be simultaneously executed by multiple machines.
- Various stages within a job can run concurrently.
- Stages comprise one or more jobs.
- Stages temporarily store transactions before committing them through actions.
Configuration questions
Configuration questions are such questions where you need to outline how things will behave based on different cluster configurations. Let’s look at an example to demonstrate this.
Which of the following statements accurately describes Spark’s cluster execution mode?
- Cluster mode runs executor processes on gateway nodes.
- Cluster mode involves the driver being hosted on a gateway machine.
- In cluster mode, the Spark driver and the cluster manager are not co-located.
- The driver in cluster mode is located on a worker node.
Next, we’ll look at the code-based questions and their subcategories.
Code-based questions
The next category is code-based questions. A large number of Spark API-based questions lie in this category. Code-based questions are the questions where you will be given a code snippet, and you will be asked questions about it. Code-based questions can be subdivided further into different categories. Let’s look at some of these categories, along with example questions taken from previous exams that fall into these different subcategories.
Function identification questions
Function identification questions are such questions where you need to define which function does something. It is important to know the different functions that are available in Spark for data manipulation, along with their syntax. Let’s look at an example to demonstrate this.
Which of the following code blocks returns a copy of the df
DataFrame, where the column
salary has been renamed employeeSalary
?
df.withColumn(["salary", "employeeSalary"])
df.withColumnRenamed("salary").alias("employeeSalary ")
df.withColumnRenamed("salary", "
employeeSalary ")
df.withColumn("salary", "
employeeSalary ")
Fill-in-the-blank questions
Fill-in-the-blank questions are such questions where you need to complete the code block by filling in the blanks. Let’s look at an example to demonstrate this.
The following code block should return a DataFrame with the employeeId
, salary
, bonus
, and department
columns from the transactionsDf
DataFrame. Choose the answer that correctly fills the blanks to accomplish this.
df.__1__(__2__)
drop
"employeeId", "salary", "
bonus", "department"
filter
"employeeId, salary,
bonus, department"
select
["employeeId", "salary", "
bonus", "department"]
select
col(["employeeId", "salary", "
bonus", "department"])
Order-lines-of-code questions
Order-lines-of-code questions are such questions where you need to place the lines of code in a certain order so that you can execute an operation correctly. Let’s look at an example to demonstrate this.
Which of the following code blocks creates a DataFrame that shows the mean of the salary
column of the salaryDf
DataFrame based on the department
and state
columns, where age
is greater than 35?
salaryDf.filter(col("age") >
35)
.
filter(col("employeeID")
.
filter(col("employeeID").isNotNull())
.
groupBy("department")
.
groupBy("department", "state")
.
agg(avg("salary").alias("mean_salary"))
.
agg(average("salary").alias("mean_salary"))
- i, ii, v, vi
- i, iii, v, vi
- i, iii, vi, vii
- i, ii, iv, vi