Chapter 4. Collecting Hadoop Distributed File System Data
The Hadoop Distributed File System (HDFS) is the primary source of evidence in a Hadoop forensic investigation. Whether Hadoop data is used in Hive, HBase, or a custom Java application, the data is stored in HDFS. This means the forensic evidence can be collected from HDFS. Investigators can take two collection approaches: collect HDFS data from the host operating system or directly from Hadoop.
The advantage of collecting from HDFS is investigators can collect much more data than they can from a data analysis layer or application layer. Some potentially relevant data can only be collected through HDFS. This includes metadata, configuration files, user files that were not imported into an application, custom scripts, and other information. In some forensic investigations, this otherwise ancillary data can be crucial for determining how the system operated and how the system was used.
Collecting evidence from HDFS can be more...