Data Ingestion and Data Extraction with Apache Spark
Apache Spark is a powerful distributed computing framework that can handle large-scale data processing tasks. One of the most common tasks when working with data is loading it from various sources and writing it into various formats. In this hands-on chapter, you will learn how to load and write data files with Apache Spark using Python.
In this chapter, we’re going to cover the following recipes:
- Reading CSV data with Apache Spark
- Reading JSON data with Apache Spark
- Reading Parquet data with Apache Spark
- Parsing XML data with Apache Spark
- Working with nested data structures in Apache Spark
- Processing text data in Apache Spark
- Writing data with Apache Spark
By the end of this chapter, you will have learned how to read, write, parse, and manipulate data in CSV, JSON, Parquet, and XML formats. You will have also learned how to analyze text data with natural language processing (NLP)...