Reading and writing files
Data lives mostly in files stored in the filesystem in semi-structured text files, structured delimited files, or more sophisticated formats such as Avro and Parquet. Logfiles, SQL exports, JSON, XML, and any type of file can be processed with Scalding.
Scalding is capable of reading and writing many file formats, which are:
- The TextLine format is used to read and write raw text files, and it returns tuples with two fields named by default: offset and line. These values are inherited from Hadoop. After reading a text file, we usually parse with regular expressions to apply a schema to the data.
- Delimited files such as Tab Separated Values (TSV), Comma Separated Values (CSV), and One Separated Values (OSV), with the latter commonly used in Pig and Hive, are already structured text files, and thus, easier to work with.
- Advanced serialization files such as Avro, Parquet, Thrift, and protocol buffers offer their own capabilities. Avro, for example, is a data-serialization...