Let's go straight to the point. To group or classify the data out there, three major groups have been defined: unstructured, semi-structured, and structured.
To give you a simple example, it is possible to compare these types of data to vegetation. I've used three pictures, one for each data type. Take a look at them in the following paragraphs. Data can be represented as trees, leaves, and branches; unstructured data is like a wild and uncultured forest or jungle, with all the nature that creates beautiful chaos. We can imagine semi-structured data as forest paths, where the path is a bit difficult, but not as difficult as the wild and uncultivated forest. The last one is structured data, which is represented by a very well-organized field that it is possible to walk easily through.
So, let's take a look at each of these data types in more detail.
As the name suggests, the unstructured data type is data with an unknown form. It can be a combination of images, text, emails, and video files, and it can create value only when it is processed, analyzed, and well organized.
Figure 1.4 – Unstructured data is like an uncultured forest or jungle
Some of the main characteristics of the unstructured data type are as follows:
- It does not have any particular format or sequence.
- It does not follow any rules or semantics.
- It does not have an easily identifiable structure.
- It cannot be stored in a spreadsheet-like form (that is, based on rows and columns).
- It isn't directly usable or understandable by a program.
So, basically speaking, anything that isn't in a database form belongs to the unstructured data type.
Important note
Gartner estimates that unstructured data constitutes 80% of the world's enterprise data.
Figure 1.5 – Semi-structured data is similar to a forest with paths
The semi-structured type, in contrast, is a type of data that can be processed using metadata tagging, which will help us to catch useful information. With this type of data, it is difficult to determine the meaning of the data, and it is even more challenging to store the data in rows and columns as in a standard database, so even with the availability of metadata, it is not always possible to automate data analysis. To give you an example, please take a look at the following email structure:
To: <Name>
From: <Name>
Subject: <Text>
CC: <Name><Name>
Body:<Graphics, Images, Links, Text, etc.>
Those email tags are considered a semi-structured data type, and similar entities in the data will be grouped in a hierarchy. For each group, there could be a few or a lot of properties, and those properties may or may not be the same. If you read the email structure again you can immediately see that tags give us some metadata. Still, it is almost impossible to organize the data of the body tag, for example, because it will almost certainly contain no format at all. So, let's take a look at the most common features of the semi-structured data type:
- Attributes within the same group may not be the same.
- Similar entities will be grouped.
- It doesn't conform to a data model, but it contains tags and metadata.
- It cannot be stored in a spreadsheet-like form, that is, based on rows and columns.
There are lots of challenges here, too, to better manage, store, and analyze semi-structured data. The computer science community seems to be going toward a unique standard format for storing semi-structured data. All you need to know, for now, is that there is a format that is hardware and software independent, which is XML, an extensible markup language, which is also open source and written in plain text. This XML format is more or less the alter ego of the Industry Foundation Classes (IFC) for BIM models!
The third data type is the structured data type, which is a database of systematic data that can be used by companies for direct analysis and processing. It consists of information that has been both transformed and formatted into a well-defined data model. Without going too much into the technical details, remember that this type of data is mapped into predesigned fields that can later be read and extracted by a relational database. This way of storing information is the best one out of the three types, and even though the relational model minimizes data redundancy, you still need to be careful because structured data is more inter-dependent, and for this reason, it can sometimes be less flexible.
Figure 1.6 – Structured data looks like a well-organized field
So, some of the most important features of the structured data type are as follows:
- It conforms to a data model.
- Similar entities are grouped.
- Attributes within the same group are the same.
- Data resides in fixed fields within a record.
- The definition and meaning of the data is explicitly known.
At this point, I would like you to understand that we will have to carry out different tasks to transform our raw data into structured information, whether we are dealing with unstructured, semi-structured, or structured data.
As you probably already understand, data is becoming a fundamental tool for knowing and understanding the world around us. Simply put, we can think of data as a way to "approach" problems and to "solve" them in the end. And at this point, I would like to introduce you to the Data, Information, Knowledge, Wisdom (DIKW) pyramid, also known as the DIKW model. The pyramid helps us by describing how raw data can be transformed into information, then knowledge, then wisdom. Take a look at the following image. As we move up to the top of the pyramid, we look for patterns, and by imposing structure, organization, classification, and categorization, we turn data without any particular meaning into knowledge, and finally wisdom.
Figure 1.7 – DIKW pyramid
To better fix the concept in your head, I would like to give you a simple yet practical example of the DIKW pyramid by talking about the weather! Imagine that it is raining; this is an objective thing and has no particular meaning, so we can associate raining with data. Then, if I tell you that it started to rain at 7p.m. and the temperature dropped by 2 degrees, that is information. Continuing on that path, if we can explain why this is happening, like saying that the temperature dropped because of low pressure in the atmosphere, you're talking about knowledge. And in the end, if we get a complete picture of things such as temperature gradients, air currents, evaporation, and so on, we can statistically make predictions about what will happen in the future – that's wisdom!
Although we didn't go into the technical details, I would like you to remember that businesses and organizations all around the world use processes such as the one described here, the DIKW pyramid, when it comes to organizing their data.
Here, we've learned some fundamental concepts, such as types of digital data and their differences. We've also learned about the DIKW pyramid. Next, we will talk about how much data we produce every 60 seconds!