DataFrame immutability and persistence
DataFrames, like RDDs, are immutable. When you define a transformation on a DataFrame, this always creates a new DataFrame. The original DataFrame cannot be modified in place (this is notably different to pandas DataFrames, for instance).
Operations on DataFrames can be grouped into two: transformations, which result in the creation of a new DataFrame, and actions, which usually return a Scala type or have a side-effect. Methods like filter
or withColumn
are transformations, while methods like show
or head
are actions.
Transformations are lazy, much like transformations on RDDs. When you generate a new DataFrame by transforming an existing DataFrame, this results in the elaboration of an execution plan for creating the new DataFrame, but the data itself is not transformed immediately. You can access the execution plan with the queryExecution
method.
When you call an action on a DataFrame, Spark processes the action as if it were a regular RDD: it implicitly...