Categorizing the data to identify its types and best solutions for your storage is an important process for a data solution, and not just for evaluating whether it is structured, unstructured, or semi-structured. In this section, you will learn about the characteristics of different types of data.
Characteristics of relational and non-relational databases
Relational databases are the most traditional and used database format, as they have an easy-to-understand design and a simple tabular data model like other simple platforms such as Excel spreadsheets. Relational databases have predefined schemas, which are the structures of their tables, containing columns, the data type of each column, and other parameters such as primary and secondary keys used in relationships.
However, relational databases with these rigid schemas can pose challenges, as presented in the following example.
Your CRM system has a database structure with a CUSTOMER
table, where you intend to store customer data: CUSTOMER_ID
, CUSTOMER_NAME
, ADDRESS
, MOBILE_PHONE
, and ZIP_CODE
. To do this, you start by creating a CUSTOMER
table with five fields:
Figure 1.7 – Example of a CUSTOMER table in a relational database
However, after setting up this table, you realize that you have clients that have more than one address and zip code, and even more than one mobile phone number. How can you solve this issue?
To face problems like this one, we can use normalization one more time. Normalization is done when there is a need to split a table (CUSTOMER
, in this example) into more child tables that are correlated to the initial table.
Therefore, we can change the CUSTOMER
table as follows:
Figure 1.8 – A relationship model in a transactional database
Non-relational databases allow you to store data in its original format without having a predefined schema as in relational databases. The most common non-relational storage format is document storage, where each record in the database is an independent file. The benefit is that each file can have different and unique attributes.
On the other hand, the files being independent can present a challenge: data duplication.
Going back to our CUSTOMER
entity example in a relational database, when two or more customers live at one address, the database records that relationship, and the normalized database only keeps one address record. But in a non-relational database, if two customers live at the same address, this address will be presented in the records of the first customer and the second customer as well, independently.
Let’s now analyze how this storage could be structured in a relational database, using the concept of normalization:
Figure 1.9 – Example of data structured into tables
The preceding figure exemplifies the data stored within the relational model tables with the CUSTOMER
, CUSTOMER_ADDRESS
, and ADDRESS
entities to understand the structure of a normalized table.
Now let’s analyze the same data in a CUSTOMER
table, but in the format of a non-relational database:
## JSON FILE - CUSTOMER ##
{
"CUSTOMER_ID": "0001",
" CUSTOMER_NAME":
{
"FIRST_NAME": " MARK",
"LAST_NAME": " HUGGS"
},
"ADDRESS":
{
"STREET": "1200, Harper Str"
}
}
## JSON FILE – CUSTOMER2 ##
{
"CUSTOMER_ID": "0002",
" CUSTOMER_NAME":
{
"FIRST_NAME": " KRISTI",
"LAST_NAME": " LAMP"
},
"ADDRESS":
{
"STREET": "1200, Harper Str"
}
}
In the preceding example, we can see two records in a CUSTOMER
table, with each record being a JSON document structured with the attributes of each customer.
Thus, we can observe that the same data can be stored in relational and non-relational structures.
Therefore, to decide between a relational or non-relational data storage solution, you must evaluate the behavior of the application or the user that will use that database, the relationships between the entities, and possible normalization processes.
Both relational and non-relational databases should be used primarily for transactional workloads. In the upcoming sections, we will understand the differences between these transactional workloads and analytical workloads.
A transactional workload
Relational and non-relational databases can be used as solutions for transactional workloads, which are the databases used to perform basic data storage operations: create, read, update, and delete (CRUD). Transactional operations must be done in sequence, with a transaction control that only confirms the conclusion of this transaction (a process called a commit) when the entire operation is successfully executed. If this does not occur, the transaction is canceled, and all processes are not performed, thus generating a process called rollback.
An important idea to help understand the difference between relational and non-relational databases is ACID, present in most database technologies. These properties are as follows:
- Atomicity: This is the property that controls the transaction and defines whether it was successfully performed completely to commit or must be canceled by performing a rollback. Database technology should ensure atomicity.
- Consistency: For a running transaction, it is important to evaluate consistency between the database state before receiving the data and the database state after receiving the data. For example, in a bank transfer, when funds are added to an account, those funds must have a source. Therefore, it is important to know this source and whether the fund’s source exit process has already been performed before confirming the inclusion in this new account.
- Isolation: This property evaluates whether there are multiple executions of transactions similar to the current one and if so, it keeps the database in the same state. It then evaluates whether the execution of transactions was sequential. In the bank transfer example, if multiple transactions are sent simultaneously, it checks whether the amounts have already left the source for all transactions, or you need to review one by one, transaction per transaction.
- Durability: This is responsible for evaluating whether a transaction remains in the committed database even if there is a failure during the process, such as a power outage or latency at the time of recording the record.
ACID properties are not unique to transactional databases; they are also found in analytic databases. At this point, the most important thing is to understand that these settings exist, and you can adjust them as per the requirements of your data solution use case.
Since we are talking about databases, let’s understand an acronym that is widely used to represent database software: DBMS.
Database management systems
Database management systems (DBMSs), which are database software, have ACID properties within their architecture, and in addition to performing these controls, they need to manage several complex situations. For example, if multiple users or systems try to access or modify database records, the database systems need to isolate transactions, perform all necessary validations quickly, and maintain the consistency of the data stored after the transaction is committed. For this, some DBMS technologies work with temporary transaction locks, so that actions are done sequentially. This lock is done during the process of an action executing in that record; for example, in an edit of a field in a table, the lock ends as soon as the commit is executed, confirming that transaction.
Some DBMSs are called distributed databases. These databases have their architecture distributed in different storage and processing locations, which can be on-premises in the company’s data center or a different data center in the cloud. Distributed database solutions are widely used to maintain consistency in databases that will serve applications in different geographic locations, but this consistency doesn’t need to be synchronous. For example, a mobile game can be played in the United States and Brazil, and the database of this game has some entities (categories, game modes, and so on) that must be shared among all players. But the transactions from the United States player do not necessarily need to appear to the player in Brazil in a real-time way; this transactional data will be synchronized from the United States to Brazil, but in an asynchronous process. Let’s understand this process next.
Eventual consistency
All transactions in distributed databases take longer to process than in undistributed databases because it is necessary to replicate the data across all nodes in this distributed system. So, to maintain an adequate replication speed, the distributed databases only synchronize the data that is needed. This is the concept of eventual consistency, which configures ACID to perform replication between the distributed nodes asynchronously, after the confirmation of the transaction on the main node of the database is created. This technique can lead to temporary inconsistencies between database nodes. Ideally, the application connected to a distributed database does not require a guarantee of data ordering. It means that the data relating to this eventual consistency may appear to users with an eventual delay as well. Distributed databases are widely used by social media platforms, for news feeds, likes, and shares, among other features.
Let’s use the following figure to understand the behavior of a database with eventual consistency:
Figure 1.10 – Diagram of an eventual consistency database
The preceding diagram shows behavior that we can observe when querying information in a database with eventual consistency. Instead of fetching the ball in a sequential way, the hero who retrieved it made the query of the ball in a future frame, generating a momentary duplication of the ball. In the end, only one ball was retrieved, after the sync was done.
This is an analogy for an eventual consistency database, where queries do not need to be made on entities that are already synchronized between all replicas of the database, and sometimes, this momentary duplication happens until the asynchronous process data update is complete.
In addition to transactional, relational, or non-relational databases, we also have another data workload, the analytical workload, which we will address in the next section.
An analytical workload
The second category of data solutions is the analytical workloads. These analytical solutions are based on high-volume data processing platforms, optimized for querying and exploring, and not for CRUD transactions or with ACID properties. In analytical databases, we aggregate various data sources, such as more than one transactional database, as well as logs, files, images, videos, and everything that can generate information for a business analyst.
This raw data is processed and aggregated, thus generating summaries, trends, and predictions that can support decision-making.
An analytical workload can be based on a specific time or a sequence of dated events. In these workloads, it’s common to evaluate only the data that is relevant to the analysis. For example, if you have a sales system with a transactional database (source) with several tables recording all sales, products, categories, and customers, among others, it is important to evaluate which of these tables can be used for the analytical database (destination) and then perform the data connections.
To create an analytical database, it is necessary to perform data ingestion, a process of copying data from sources to the analytical base. For this, a technique called extract, transform, and load (ETL), or the more recent extract, load, and transform (ELT), is used. The following figure demonstrates this process with an example of a transactional database as the data source and the analytical database as the destination:
Figure 1.11 – Data flow between a transactional database and an analytical database
In the preceding diagram, we can see that transactional databases are storages of information systems that automate business processes. Analytical databases act on simple and advanced data analysis, using, for example, statistical models with the application of machine learning, a branch of artificial intelligence. The data ingestion process is an important process for assembling an analytical database that meets the data solution. In the next section, we will understand what data ingestion is and the different types of this ingestion.