Quality assurance
With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the "front door". It's perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, and so on.
You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it's not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate...