Random World: Mining Massive Datasets - Clustering

Mining Massive Datasets - Clustering

🔹 1. What is Clustering?

Clustering is the grouping of similar data points based on a distance measure.
Goal: Points within a cluster should be close; points in different clusters should be far apart.
Clustering is essential in domains with large, high-dimensional, or non-Euclidean data.

🔹 2. Types of Clustering Strategies

A. Hierarchical Clustering

Starts with each point as its own cluster.
Merges clusters iteratively based on similarity.
Can be agglomerative (bottom-up) or divisive (top-down).
Common in small datasets or datasets with meaningful nested structure.

B. Point Assignment Clustering

Each point is assigned to the most appropriate cluster.
Often starts with initial “seeds” and iteratively refines assignments.
Common algorithms: k-means, BFR, CURE.

🔹 3. Distance Measures and High-Dimensional Effects

Clustering depends on a distance metric: Euclidean, Manhattan, cosine, Jaccard, etc.
Curse of Dimensionality:
- In high dimensions, all points tend to become equidistant.
- Vectors are nearly orthogonal; this complicates clustering.

🔹 4. Hierarchical Clustering Concepts

Clusters can be represented using centroids in Euclidean space or clustroids in non-Euclidean space.
Strategies for merging clusters:
- Minimum distance between points
- Average distance
- Centroid distance
- Diameter or radius-based criteria
Stopping rules:
- Predefined number of clusters
- Cluster compactness (radius/diameter)
- Sudden increase in cluster size or dispersion

🔹 5. Clustering in Non-Euclidean Spaces

Use clustroids (real points in the cluster) instead of centroids.
Distance between clusters can be based on:
- Minimum, average, or maximum pairwise distances
- Radius or density estimations using clustroids

🔹 6. K-Means Algorithm

Assigns points to clusters based on proximity to centroids.
Iteratively refines cluster centroids.
Initialization can greatly affect results.
Heuristics for selecting k include monitoring average diameter vs. k.

🔹 7. BFR Algorithm (Bradley, Fayyad, Reina)

Handles very large datasets in Euclidean space.
Assumes clusters are normally distributed and axis-aligned.
Uses summaries of clusters: N, SUM, SUMSQ.
Points too far from any centroid are temporarily stored or merged into mini-clusters.
Uses Mahalanobis distance for cluster assignment.

🔹 8. CURE Algorithm (Clustering Using REpresentatives)

Designed for non-spherical, oddly shaped clusters in Euclidean space.
Uses representative points (not just centroids).
Points are moved toward the cluster centroid to reduce sensitivity to outliers.
Merging is based on proximity of representative points.

🔹 9. GRGPF Algorithm

Designed for non-Euclidean and out-of-core (disk-based) data.
Uses a tree-based structure with summaries at nodes (clustroid, rowsums, etc.).
Assigns new points by descending the tree to the nearest cluster.
Supports dynamic splitting/merging of clusters to maintain scalability and accuracy.

Subscribe to: Posts (Atom)