Business aspect
Business requirements are the starting point for your application and also for choosing your database system. There are four stages in the life cycle of data in its business application that help determine the choice of database system:
- Data ingestion
- Storage
- Process
- Visualize
The following diagram represents the attributes in the four stages of data and the categories of questions in each stage in the life cycle of your data:
Figure 1.1 – Representation of the four stages of data and the categories of questions in each stage
Let’s look at some of these attributes in detail. Some of them are in the business attributes category, while others are technical.
Ingestion
This is the first stage in the data life cycle and it is all about acquiring (bringing in) data from different sources in one place into your system. In this stage, the questions that arise are bucketed into three categories:
- What type of data are you bringing in?
- What is the purpose of this data?
- What is the structure of your data?
Let’s take a look at each in detail.
Types of data
There are broadly three types of data we will be dealing with that highly influence the choice of database and storage.
Application data
This is the kind of data that is generated or downloaded as part of the application’s content and can contain transactional data that is generated by users and applications – for example, online retail applications, log data from applications, event data, and clickstream data. Let’s take a look at a specific example – consider a banking application in which user A transfers money from their account to user B’s account. In this case, the user data, such as the account ID, name, bank details, the recipient’s name, and transaction date, constitute the application data.
Live stream and real-time stream data
This data comes from real-time sources such as streaming data, which comes in continuously from data sources such as sensor data. These can also be event data responses and can be very frequent compared to batch data processing. It refers to data that is immediately available and not delayed by a system or process. The term real-time stream refers to streams of real-time data that are gathered and stored or processed as they come in. This includes monitoring data such as CPU utilization, memory consumption, Internet of Things (IoT) devices data such as humidity and pressure, and automated real-time environmental temperature monitoring data.
Batch data
This is data that comes in as bulk at scheduled intervals and could be event-triggered. For example, batch data is transactional data that comes in from applications after a transaction and is stored for use in later stages of the data life cycle. This can include data extracted from one application for use in another at a later point, data migration use cases, and file uploads for processing later. Such applications may not be designed for real-time operations on the data.
The purpose of data
The specific use case and the nature of implementing applications using the data being ingested is a critical factor in determining the choice and design of the database. There may be cases where the type and ingestion mode of data fall into a different choice of database design, whereas its functional use case would imply a different purpose. For example, you could have data streamed in from live events or housekeeping data coming in real-time from transactions, but the specific use case you are designing for might only involve visualization, analytical, or ML functionalities. So, make sure you understand what purpose you are solving with the data that is being ingested in a specific mode and type.
The structure of data
The structure of the data is a crucial factor in deciding the choice and design of a database. There are three widely recognized categories:
- Structured
- Semi-structured
- Unstructured
Let’s briefly explore these three categories.
Structured data
This type of data is typically composed of rows and columns; rows are entities or records and columns are attributes. Structured data is organized in such a way that you can be sure that the data structure will be consistent for the most part throughout the life cycle of that data, except for the possible addition or removal of some attributes altogether. This kind of data is mostly transactional or analytical.
Semi-structured data
Semi-structured data does not follow a fixed tabular format – that is, a column-row structure. Instead, it stores schema attributes along with data. The attributes for semi-structured data could vary for each record. The major differentiating factor for each kind of semi-structured data is the way they are accessed.
Unstructured data
Unstructured data includes images, audio files, and so on. Unstructured data does not have a definite schema or data model. The amount of unstructured data is much larger than that of structured data. So, the methods by which we store such data are more important than ever. Here are some examples of unstructured data:
- Text
- Audio
- Video
- Images
- Other binary large objects (BLOBs)
Now that we have had a sneak peek into the structure of data, be sure to include functional and design questions based on these categories while designing your database and application model.