Combining multiple data sources is sometimes necessary for multiple reasons, which include the following:
- The source data is broken up into many different files with the same defined schema (tables and field names), but the number of rows will vary slightly. A common reason is for storage purposes, where it is easier to maintain multiple smaller file sizes versus one large file.
- The data is partitioned where one field is used to break apart the data for faster response time reading or writing to the source data. For example, HIVE/HDFS recommends storing data by a single date value so you can easily identify when it was processed and quickly extract data for a specific day.
- Historical data is stored in a different technology than more current data. For example, the engineering team changed the technology being used to manage the source data and it was decided not to import historical data beyond a specific date. ...