Reading and writing data
When we work with Spark and do all the operations in Spark for data manipulation, one of the most important things that we need to do is read and write data to disk. Remember, Spark is an in-memory framework, which means that all the operations take place in the memory of the compute or cluster. Once these operations are completed, we’ll want to write that data to disk. Similarly, before we manipulate any data, we’ll likely need to read data from disk as well.
There are several data formats that Spark supports for reading and writing different types of data files. We will discuss the following formats in this chapter.
- Comma Separated Values (CSV)
- Parquet
- Optimized Row Columnar (ORC)
Please note that these are not the only formats that Spark supports but this is a very popular subset of formats. A lot of other formats are also supported by Spark, such as Avro, text, JDBC, Delta, and others.
In the next section, we will...