Learning to differentiate CSV and Parquet
Data scientists are more used to CSV files than Parquet files in the majority of the cases. When they are starting to use Databricks and Spark, it becomes quite obvious that they'll continue working with CSV files. Making that switch to Parquet might be daunting at first, but in the long run, it reaps huge returns!
Let's first discuss the advantages and disadvantages of CSV and Parquet files:
Advantages of CSV files:
- CSV is the most common file type among data scientists and users.
- They are human-readable, as data is not encoded before storing. They are also easy to edit.
- Parsing CSV files is very easy, and they can be read by almost any text editor.
Advantages of Parquet files:
- Parquet files are compressed using various compression algorithms, which is why they consume less space.
- Being a columnar storage type, Parquet files are very efficient when reading and querying data.
- The file...