Data Mining and Frequent Itemset Mining Essay

Submitted By QWERT02468
Words: 1189
Pages: 5

Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
Adapted for CSE 347-447, Lecture 5b, Spring 2015

1

Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern
Evaluation Methods n Summary
2

What Is Frequent Pattern Analysis? u Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

u

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

u

u

Motivation: Finding inherent regularities in data u What products were often purchased together?— Beer and diapers?!

u

What are the subsequent purchases after buying a PC?

u

What kinds of DNA are sensitive to this new drug?

u

Can we automatically classify web documents?

Applications u Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
3

Why Is Frequent Pattern Mining Important? n n

Frequent patterns: An intrinsic and important property of datasets Foundation for many essential data mining tasks n n n n n n n n

Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications

4

Basic Concepts: Frequent Patterns
Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40

Nuts, Eggs, Milk

50

Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both

Customer buys diaper

n

n n n

n

Customer buys beer

itemset: A set of one or more items k-itemset X = {x1, …, xk}
(absolute) support, or, support count of X: Frequency or occurrence of an itemset X
(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
An itemset X is frequent if X’s support is no less than a minsup threshold 5

Basic Concepts: Association Rules
Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40
50

Nuts, Eggs, Milk

n

Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both

Customer buys beer

Customer buys diaper

Find all the rules X à Y with minimum support and confidence n support, s, probability that a transaction contains X and Y n confidence, c, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3 n Association rules: (many more!) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%)
6

Closed Patterns and Max-Patterns n n n n

A long pattern contains a combinatorial number of sub100
100
patterns, e.g., {a1, …, a100} contains ( 1 ) + ( 2 ) + … +
100 – 1 = 1.27*1030 sub-patterns!
(100
100) = 2
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)

7

Closed Patterns and Max-Patterns n n

n

Exercise: Suppose a DB contains only two transactions n <a1, …, a100>, <a1, …, a50>

n

Let min_sup = 1

What is the set of closed itemsets? n {a1, …, a100}: 1

n

{a1, …, a50}: 2

What is the set of max-patterns? n n

{a1, …, a100}: 1

What is