How to manipulate data on rows and columns
In this section, we will learn how to do different data manipulation operations on Spark DataFrames rows and columns.
We will start by looking at how we can select columns in a Spark DataFrame.
Selecting columns
We can use column functions for data manipulation at the column level in a Spark DataFrame. To select a column in a DataFrame, we would use the select()
function like so:
from pyspark.sql import Column data_df.select(data_df.col_3).show()
As a result, you will see only one column of the DataFrame with its data:
+-------------+
| col_3 |
+-------------+
|string_test_1|
|string_test_2|
|string_test_3|
+-------------+
The important thing to note here is that the resulting DataFrame with one column is a new DataFrame. Recalling what we discussed in Chapter 3, RDDs are immutable. The underlying data structure for DataFrames is RDDs, therefore, DataFrames are also immutable....