Mining Massive Datasets - Clustering

 

🔹 1. What is Clustering?

  • Clustering is the grouping of similar data points based on a distance measure.

  • Goal: Points within a cluster should be close; points in different clusters should be far apart.

  • Clustering is essential in domains with large, high-dimensional, or non-Euclidean data.


🔹 2. Types of Clustering Strategies

A. Hierarchical Clustering

  • Starts with each point as its own cluster.

  • Merges clusters iteratively based on similarity.

  • Can be agglomerative (bottom-up) or divisive (top-down).

  • Common in small datasets or datasets with meaningful nested structure.

B. Point Assignment Clustering

  • Each point is assigned to the most appropriate cluster.

  • Often starts with initial “seeds” and iteratively refines assignments.

  • Common algorithms: k-means, BFR, CURE.


🔹 3. Distance Measures and High-Dimensional Effects

  • Clustering depends on a distance metric: Euclidean, Manhattan, cosine, Jaccard, etc.

  • Curse of Dimensionality:

    • In high dimensions, all points tend to become equidistant.

    • Vectors are nearly orthogonal; this complicates clustering.


🔹 4. Hierarchical Clustering Concepts

  • Clusters can be represented using centroids in Euclidean space or clustroids in non-Euclidean space.

  • Strategies for merging clusters:

    • Minimum distance between points

    • Average distance

    • Centroid distance

    • Diameter or radius-based criteria

  • Stopping rules:

    • Predefined number of clusters

    • Cluster compactness (radius/diameter)

    • Sudden increase in cluster size or dispersion


🔹 5. Clustering in Non-Euclidean Spaces

  • Use clustroids (real points in the cluster) instead of centroids.

  • Distance between clusters can be based on:

    • Minimum, average, or maximum pairwise distances

    • Radius or density estimations using clustroids


🔹 6. K-Means Algorithm

  • Assigns points to clusters based on proximity to centroids.

  • Iteratively refines cluster centroids.

  • Initialization can greatly affect results.

  • Heuristics for selecting k include monitoring average diameter vs. k.


🔹 7. BFR Algorithm (Bradley, Fayyad, Reina)

  • Handles very large datasets in Euclidean space.

  • Assumes clusters are normally distributed and axis-aligned.

  • Uses summaries of clusters: N, SUM, SUMSQ.

  • Points too far from any centroid are temporarily stored or merged into mini-clusters.

  • Uses Mahalanobis distance for cluster assignment.


🔹 8. CURE Algorithm (Clustering Using REpresentatives)

  • Designed for non-spherical, oddly shaped clusters in Euclidean space.

  • Uses representative points (not just centroids).

  • Points are moved toward the cluster centroid to reduce sensitivity to outliers.

  • Merging is based on proximity of representative points.


🔹 9. GRGPF Algorithm

  • Designed for non-Euclidean and out-of-core (disk-based) data.

  • Uses a tree-based structure with summaries at nodes (clustroid, rowsums, etc.).

  • Assigns new points by descending the tree to the nearest cluster.

  • Supports dynamic splitting/merging of clusters to maintain scalability and accuracy.