🔹 1. What is Clustering?
-
Clustering is the grouping of similar data points based on a distance measure.
-
Goal: Points within a cluster should be close; points in different clusters should be far apart.
-
Clustering is essential in domains with large, high-dimensional, or non-Euclidean data.
🔹 2. Types of Clustering Strategies
A. Hierarchical Clustering
-
Starts with each point as its own cluster.
-
Merges clusters iteratively based on similarity.
-
Can be agglomerative (bottom-up) or divisive (top-down).
-
Common in small datasets or datasets with meaningful nested structure.
B. Point Assignment Clustering
-
Each point is assigned to the most appropriate cluster.
-
Often starts with initial “seeds” and iteratively refines assignments.
-
Common algorithms: k-means, BFR, CURE.
🔹 3. Distance Measures and High-Dimensional Effects
-
Clustering depends on a distance metric: Euclidean, Manhattan, cosine, Jaccard, etc.
-
Curse of Dimensionality:
-
In high dimensions, all points tend to become equidistant.
-
Vectors are nearly orthogonal; this complicates clustering.
-
🔹 4. Hierarchical Clustering Concepts
-
Clusters can be represented using centroids in Euclidean space or clustroids in non-Euclidean space.
-
Strategies for merging clusters:
-
Minimum distance between points
-
Average distance
-
Centroid distance
-
Diameter or radius-based criteria
-
-
Stopping rules:
-
Predefined number of clusters
-
Cluster compactness (radius/diameter)
-
Sudden increase in cluster size or dispersion
-
🔹 5. Clustering in Non-Euclidean Spaces
-
Use clustroids (real points in the cluster) instead of centroids.
-
Distance between clusters can be based on:
-
Minimum, average, or maximum pairwise distances
-
Radius or density estimations using clustroids
-
🔹 6. K-Means Algorithm
-
Assigns points to clusters based on proximity to centroids.
-
Iteratively refines cluster centroids.
-
Initialization can greatly affect results.
-
Heuristics for selecting
k
include monitoring average diameter vs.k
.
🔹 7. BFR Algorithm (Bradley, Fayyad, Reina)
-
Handles very large datasets in Euclidean space.
-
Assumes clusters are normally distributed and axis-aligned.
-
Uses summaries of clusters:
N
,SUM
,SUMSQ
. -
Points too far from any centroid are temporarily stored or merged into mini-clusters.
-
Uses Mahalanobis distance for cluster assignment.
🔹 8. CURE Algorithm (Clustering Using REpresentatives)
-
Designed for non-spherical, oddly shaped clusters in Euclidean space.
-
Uses representative points (not just centroids).
-
Points are moved toward the cluster centroid to reduce sensitivity to outliers.
-
Merging is based on proximity of representative points.
🔹 9. GRGPF Algorithm
-
Designed for non-Euclidean and out-of-core (disk-based) data.
-
Uses a tree-based structure with summaries at nodes (clustroid, rowsums, etc.).
-
Assigns new points by descending the tree to the nearest cluster.
-
Supports dynamic splitting/merging of clusters to maintain scalability and accuracy.