Introductory Overview
-
Introduces data mining and its relation to other disciplines (statistics, machine learning, computer science).
-
Warns about misuse of data mining through Bonferroni’s Principle.
-
Introduces key concepts like TF.IDF, hash functions, and data summarization that underpin many mining techniques.
1.1 What is Data Mining?
-
Data mining evolves with terminology: once “data mining,” then “big data,” now “data science.”
-
Core idea remains: using computation to uncover patterns and solve problems in varied fields.
1.1.1 Modeling
-
Modeling often involves learning patterns from data (machine learning), but can also mean direct algorithm design.
-
Example: Detecting phishing emails via weighted word models.
1.1.2 Statistical Modeling
-
Statisticians view data mining as building a model (e.g., a distribution like a Gaussian).
-
Example: Estimating parameters (mean, std dev) to model data statistically.
1.1.3 Machine Learning
-
Some equate data mining with machine learning.
-
ML is effective when the structure of the data is unknown (e.g., movie recommendations).
-
However, ML can underperform when simple, domain-specific algorithms suffice (e.g., resume detection).
-
Criticism: ML models (especially deep learning) can be accurate but not explainable.
1.1.4 Computational Approaches to Modeling
-
Computer scientists view data mining as an algorithmic problem.
-
Focus is on querying, summarizing, or extracting prominent features rather than building statistical models.
1.1.5 Summarization
-
Summarizing large data sets makes them manageable.
-
Example: PageRank distills web structure into a single relevance score.
-
Clustering reduces data by grouping similar points.
1.1.6 Feature Extraction
-
Extract key patterns from data:
-
Frequent Itemsets: Items often bought together.
-
Similar Items: Find related users/items for recommendation (collaborative filtering).
-
1.2 Statistical Limits on Data Mining
-
Mining for rare events (e.g., terrorism) can lead to false positives.
-
Example: TIA (Total Information Awareness) aimed to mine massive datasets for security threats but faced feasibility and privacy concerns.
1.2.2 Bonferroni’s Principle
-
If you search a large dataset in too many ways, you'll find meaningless patterns.
-
Suggests computing how often a pattern appears by chance to assess its significance.
1.2.3 Example of Bonferroni’s Principle
-
Illustrates how massive data mining (e.g., hotel co-location to detect terrorists) produces many false positives.
-
Even with only 10 actual suspects, you could wrongly flag 250,000 innocent pairs.
1.3 Things Useful to Know
-
Introduces concepts that support data mining but aren't mining techniques themselves.
1.3.1 TF.IDF
-
Measures word importance in documents.
-
High when a term is frequent in one document but rare across others.
-
Useful for topic classification.
1.3.2 Hash Functions
-
Map data to buckets for fast access.
-
Uniform distribution is key for performance.
-
Choosing prime numbers as bucket counts helps prevent biased mappings.
1.3.3 Indexes
-
Speed up data retrieval based on key values.
-
Often implemented using hash tables.
-
Especially crucial when querying large datasets.
1.3.4 Secondary Storage
-
Disk I/O is orders of magnitude slower than memory.
-
Algorithms should minimize disk access and maximize in-memory processing.
1.3.5 Natural Logarithm (e)
-
Useful approximations like for small .
-
Appears in various statistical and algorithmic contexts.
1.3.6 Power Laws
-
Many real-world distributions (web page links, book sales) follow power laws: .
-
Often linked to the Matthew Effect (“the rich get richer”).
1.4 Outline of the Book
-
Overview of subsequent chapters:
-
Ch. 2: Parallel computation (e.g., MapReduce)
-
Ch. 3: Similarity search (e.g., Minhashing)
-
Ch. 4: Stream processing
-
Ch. 5: PageRank
-
Ch. 6: Market-basket analysis
-
Ch. 7: Clustering
-
Ch. 8: Online advertising
-
Ch. 9: Recommendation systems
-
Ch.10: Social networks
-
Ch.11: Dimensionality reduction
-
Ch.12: Large-scale machine learning
-
1.5 Summary of Chapter 1
-
Reinforces key ideas:
-
Modeling, Bonferroni’s Principle, TF.IDF, hashing, disk-based algorithms, and power laws.
-