Chapter 1: Data Management – Introduction and Concepts
A vast amount of data is being generated by people, organizations, devices, and software applications, and the volume of data being generated is growing rapidly. The numbers vary significantly, depending on the source, but it is estimated that approximately 60% to 80% of data gathered by organizations is dark data. Essentially, data is being collected, processed, and stored for a long time by organizations for compliance reasons, but the data is not used for any other purposes, such as analytics or direct monetization. In most cases, storing and securing this data can be more expensive than the value extracted.
In today’s digital economy, organizations are striving to be data-driven by basing their strategic business decisions on intelligence that’s been obtained from data gathered from various sources. Until recently, organizations thought of data purely in the context of transactions and locked it away in heavily siloed databases that were built for transaction processing; however, this was not suitable for open-ended analysis. All this changed with advancements in data processing techniques and drops in the costs involved in processing and analyzing data. Organizations are now adopting data-driven approaches for key business decisions.
In this chapter, we will cover the following topics:
- Types of data processing – OLTP and OLAP
- Data warehouses and data marts
- Data lakes
- Data lakehouse
- Data mesh
- Apache Spark on the AWS cloud
- AWS Glue
- Querying data using AWS
The topics in this chapter will introduce us to different data management techniques and different tools and services offered by the AWS cloud. These concepts will help you understand the different design approaches you can take to build effective data integration and management setups that are suitable to your use cases when using AWS Glue.