As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms of unstructured, semi-structured, and structured.
Unstructured data
The concept behind unstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.
When working with unstructured data, there will be inconsistencies because of the nature of free text including misspellings, the different classification of dates, and so on. Always have a peer review of the workflow or code used to curate the data.
Semi-structured data
Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:
{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}
This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).
The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.
Structured data
Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, or Database Management Systems (DBMS) and Relational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.
The following diagram is a basic Entity-Relationship (ER) diagram of three tables that would be found in a database:
In this example, each entity would represent physical tables stored in the database, named car, part, and car_part_bridge. The relationship between the car and part is defined by the table called car_part_bridge, which can be classified by multiple names such as bridge, junction, mapping, or link table. The name of each field in the table would be on the left such as part_id, name, or description found in the part table.
The pk label next to the car_id and part_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.
Finally, the text aligned on the right side next to the field name labeled as int or text is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.
Common data types
Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.
Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:
Common data type
|
Common short name
|
Sample value
|
Example usage
|
Integers
|
int
|
1235
|
Counting occurrences, summing values, or the average of values such as sum (hits)
|
Booleans
|
bit
|
TRUE
|
Conditional testing such as if sales > 1,000, true else false
|
Geospatial
|
float or spatial
|
40.229290, -74.936707
|
Geo analytics based on latitude and longitude
|
Characters/string
|
char
|
A
|
Tagging, binning, or grouping data
|
Floating-point numbers
|
float or double
|
2.1234
|
Sales, cost analysis, or stock price
|
Alphanumeric strings
|
blob or varchar
|
United States
|
Tagging, binning, encoding, or grouping data
|
Time
|
time, timestamp, date
|
8/19/2000
|
Time-series analysis or year-over-year comparison
|
Technologies change and legacy systems will offer opportunities to see data types that may not be common. The best advice when dealing with new data types is to validate the source systems that are created by speaking to an SME (Subject Matter Expert) or system administrator, or to ask for documentation that includes the active version used to persist the data.
In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.
As you evolve your data analysis skills, helping to resolve data type issues or offer suggestions to improve them will make the quality and accuracy of reporting better throughout the organization.
String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar or blob) can be represented as text such as a word or full sentence. Time is a special data type that can be represented and stored in multiple ways such as 12 PM EST or a date such as 08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.
The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.