Working with different data formats
Apache Spark extensively supports various file formats either natively or with the support of libraries written in Java or other programming languages. Compressed file formats, as well as Hadoop's file format, are very well integrated with Spark. Some of the common file formats widely used in Spark are as follows:
Plain and specially formatted text
Plain text can be read in Spark by calling the textFile()
function on SparkContext
. However, for specially formatted text, such as files separated by white space, tab, tilde (~
), and so on, users need to iterate over each line of the text using the map()
function and then split them on specific characters, such as tilde (~
) in the case of tilde-separated files.
Consider, we have tilde-separated files that consist of data of people in the following format:
name~age~occupation
Let's load this file as an RDD of Person
objects, as follows:
Person POJO:
public class Person implements Serializable { private String Name...