Introducing stream clustering
Clustering can be defined as the task of separating a set of observations/tuples into groups/clusters so that the intra-cluster records are similar and the inter-cluster records are dissimilar. There are several approaches to clustering when we are dealing with data at rest. In streaming data, data continues to arrive at a particular rate. We don't have the luxury of accessing the data randomly or making multiple passes on the data. Among the data stream clustering methods, a large number of algorithms use a two-phase scheme which consists of an online component that processes data stream points and produces summary statistics, and an offline component that uses the summary data to generate the clusters.
The online/offline two-stage processing is the most common framework adopted by many of the stream clustering algorithms.
Before we go on to explain the online/offline two-stage process, let us quickly look at micro-clusters
.
Micro-clusters
are created by a single...