The World of Data
Start with a simple question: what is data? Data is the recorded description or measurements of something in the real world. For example, a list of heights is data; that is, height is a measure of the distance between a person's head and their feet. The data is used to describe a unit of observation. In the case of these heights, a person is a unit of observation.
As you can imagine, there is a lot of data you can gather to describe a person—including their age, weight, and smoking preferences. One or more of these measurements used to describe a specific unit of observation is called a data point, and each measurement in a data point is called a variable (often referred to as a feature). When you have several data points together, you have a dataset. For example, you may have Person A, who is a 45-year-old smoker, and Person B, who is a 24-year-old non-smoker. Here, age is a variable. The age of Person A is one measurement and the age of Person B is another. 45 and 24 are the values of measurement. A compilation of data points with measurements such as ages, weights, and smoking trends of various people is called a dataset.
Types of Data
Data can be broken down into three main categories: structured, semi-structured, and unstructured.
Structured data has an atomic definition for all the variables, such as the data type, value range, and meaning for values. In many cases, even the order of variables is clearly defined and strictly enforced. For example, the record of a student in a school registration card contains an identification number, name, and date of birth, each with a clear meaning and stored in order.
Unstructured data, on the other hand, does not have a definition as clear as structured data, and thus is harder to extract and parse. It may be some binary blob that comes from electronic devices, such as video and audio files. It may also be a collection of natural input tokens (words, emojis), such as social network posts and human speech.
Semi-structured data usually does not have a pre-defined format and meaning, but each of its measurement values is tagged with the definition of that measurement. For example, all houses have an address. But some may have a basement, or a garage, or both. It is also possible that owners may add upgrades that cannot be expected at the time when this house's information is recorded. All components in this data have clear definitions, but it is difficult to come up with a pre-defined list for all the possible variables, especially for the variables that may come up in the future. Thus, this house data is semi-structured.