Mining Massive Datasets - Chapter 9 Summary

 

🔹 Introduction to Recommendation Systems

  • Definition: Systems that predict user responses to various items or content.

  • Examples:

    • News recommendation (e.g., articles tailored to reading habits).

    • Product recommendation (e.g., Amazon suggesting items based on past purchases).

  • Two Main Types:

    • Content-Based Systems: Focus on item properties (e.g., recommending cowboy movies if the user watches many in that genre).

    • Collaborative Filtering Systems: Focus on user-user or item-item similarities based on behavior or ratings.


🔹 9.1 A Model for Recommendation Systems

  • Utility Matrix:

    • Rows: Users

    • Columns: Items

    • Values: User ratings (e.g., 1-5 stars)

    • Most entries are unknown (sparse matrix)

  • Long Tail Effect:

    • Online systems can offer a vast range of items, including niche ones, unlike physical stores.

    • Example: “Touching the Void” gained popularity through recommendation after “Into Thin Air” was released.

  • Populating the Matrix:

    • Explicit ratings: Collected directly from users.

    • Implicit feedback: Derived from behavior (clicks, purchases, views).


🔹 9.2 Content-Based Recommendations

  • Item Profiles:

    • Attributes like genre, director, cast (for movies), or specs for products.

    • Profiles are used to match users to similar items they've liked.

  • Feature Extraction:

    • From structured metadata or textual data (e.g., using TF-IDF for documents).

    • Can use tags or user-generated annotations (e.g., del.icio.us, collaborative tagging games).

  • User Profiles:

    • Created based on the items a user has liked.

    • Often represented as vectors of feature frequencies.

  • Document & Image Features:

    • Use content analysis or tags to classify and recommend.

  • Classification Approach:

    • Use classifiers (e.g., decision trees) trained on a user’s past behavior to predict preferences.


🔹 9.3 Collaborative Filtering

  • Similarity Measures:

    • Compare user or item vectors using:

      • Jaccard similarity (binary data)

      • Cosine similarity (rating values)

  • Approaches:

    • User-User Similarity: Recommend items liked by similar users.

    • Item-Item Similarity: Recommend similar items to ones the user has liked.

  • Clustering:

    • Cluster users or items to reduce sparsity and improve similarity calculations.

    • Used in hierarchical or iterative steps to build aggregate user-item models.


🔹 9.4 Dimensionality Reduction: UV Decomposition

  • Matrix Factorization:

    • Decompose the utility matrix into matrices U and V, such that their product approximates the original matrix.

  • Root Mean Square Error (RMSE):

    • Used to measure prediction accuracy.

    • RMSE is minimized during matrix training.

  • Optimization Tricks:

    • Multiple initializations and combining models to escape local minima.


🔹 9.5 The Netflix Challenge

  • Overview:

    • $1M prize to improve Netflix’s CineMatch algorithm by 10%.

  • Dataset:

    • Included millions of user-movie ratings.

  • Winning Strategy:

    • Combined multiple algorithms.

    • Time of rating proved useful (some movies were appreciated only after a delay).

    • External metadata like IMDB genres provided minimal improvement due to ML capabilities and difficulty in entity resolution.


🔹 9.6 Summary Highlights

  • Key Concepts:

    • Utility Matrix: Central to recommendation modeling.

    • Content-Based & Collaborative Filtering: Core architectures.

    • Feature Engineering: Crucial for content-based methods.

    • Matrix Factorization & RMSE: Powerful for collaborative filtering.

    • Hybrid Models: Combine methods for better performance.

    • Clustering: Aids in overcoming sparsity in user-item interactions.


This breakdown gives you clear atomic concepts and relationships that can be structured into a knowledge graph, especially focusing on:

  • Entities: Users, Items, Features, Ratings, Clusters, Algorithms

  • Relations: likes, rates, similar_to, clustered_with, uses_algorithm, optimized_by, influenced_by (e.g., time, features)