Summary
In this chapter, we have learned about common input file formats for big data, such as CSV and JSON. We also learned about popular file formats, namely Parquet, Avro, and ORC, which are useful in the big data environment and looked at essential decision points for making a choice on which to use. We explored the conversion to each of these file formats from the CSV and JSON formats and executed them in a big data environment using Spark and Scala. To strengthen the concept, we executed each format conversion in the respective exercises.
At the end of the chapter, we looked at a real-world business problem and concluded which was the most suitable file format based on the selection criteria learned in this chapter.
In the next chapter, we will extensively cover the vital infrastructure of the big data environment known as Spark. This will lay a strong foundation of the concept and also lead us through the journey of creating our first pipeline in Spark.