Big data is a relatively new term which has been gathering steam over the past few years. Big data is a term used for datasets that are relatively large to be stored in a traditional database system or processed by traditional data-processing pipelines. This data could be structured, semi-structured, or unstructured data. The datasets that belong to this category usually scale to terabytes or petabytes of data. Big data usually involves one or more of the following:
- Velocity: Data moves at an unprecedented speed and must be dealt with it in a timely manner.
For example, online systems, sensors, social media, web clickstream, and so on.
- Volume: Organizations collect data from a variety of sources, including business transactions, social media, and information from sensor or machine-to-machine data. This could involve terabytes to petabytes of data. In the past, storing it would've been a problem, but new technologies have eased the burden.
- Variety: Data comes in all sorts of formats ranging from structured data to be stored in traditional databases to unstructured data (blobs) such as images, audio files, and text files.
These are known as the 3Vs of big data.
In addition to these, we tend to associate another term with big data:
- Complexity: Today's data comes from multiple sources, which makes it difficult to link, match, cleanse, and transform data across systems. However, it's necessary to connect and correlate relationships, hierarchies, and multiple data linkages, or your data can quickly spiral out of control. It must be able to traverse multiple data centers, cloud, and geographical zones.