Data file optimization
Data file optimization covers the performance improvement on the data files in terms of file format, compression, and storage.
File format
Hive supports TEXTFILE
, SEQUENCEFILE
, RCFILE
, ORC
, and PARQUET
file formats. The three ways to specify the file format are as follows:
CREATE TABLE ... STORE AS <File_Format>
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT <File_Format>
SET hive.default.fileformat=<File_Format> --default fileformat for table
Here, <File_Type>
is TEXTFILE
, SEQUENCEFILE
, RCFILE
, ORC
, and PARQUET
.
We can load a text file directly to a table with the TEXTFILE
format. To load data to the table with other file formats, we need to load the data to a TEXTFILE
format table first. Then, use INSERT OVERWRITE TABLE <target_file_format_table> SELECT * FROM <text_format_source_table>
to convert and insert the data to the file format as expected.
The file formats supported by Hive and their optimizations are as follows...