Association Analysis
The business problem
Many companies accumulate large quantities of customer purchasing data from their day-to-day operations. This dataset represents an opportunity to understand customer purchasing preferences.
- Problem: we don't know which products are purchased by customers, which products are purchased jointly and which regularities can be found between products and customers.
- Goals:
- We want to understand customer purchasing preferences
- Why? Perform specific actions to improve our services and products (such as placement, recommendations, etc.)
Technique to solve the business problem
We will use association analysis:
- It is a technique that helps to detect and analyse the relationships in registered transactions of individuals, groups and objects.
- One the most know analysis is the market basket analysis aiming to understand the relationship between acquired products.
- We will use the Apriori algorithm
Main Concepts
- Association rules, described by lhs => rhs:
- Lhs refers to the left element in the rule and it is the acronym of left hand side. It is an itemset.
- Rhs refers to the right element in the rule and it is the acronym of right hand side. It is an itemset.
- Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.
- Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.
- Lift This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is.
$$ \frac{support(lhs \cup rhs)}{support(lhs) * support(rhs)} $$
The best predictor is lift. We should start our analysis with this parameter.
- If $$lift<1$$, lhs presence does not imply rhs presence.
- If $$lift=1$$, lhs and rhs are independent.
- If $$lift>1$$, lhs presence increases the probability of rhs presence in the transaction.
Implementation Process
- [BU/DU] Determine whether Association Analysis fits in our business
- [DP] Identification and understanding of sources and metadata
- [DP] Extract, clean and load data
- [DP] Purchasing Transaction identification
- [M] Choose the right algorithm (Apriori, Eclat, FP-growth, etc.)
- [M] Choose starging values for lift, support y confidence
- [M] Every product/service/department will have a NPS*
- [E] Analyze results and adjust parameters
- [D] Present and explain the results
Benefits
This technique provides the following benefits:
- Understand purchasing patterns. Example: Who is buying what? Which products are the best ones by region, channel, account,...? Which is the profile?
- Affinity Analysis. Frequent items analysis.
- Propension Analysis. Example: Who will buy what? Which is the profile of those customers?
Use cases
This technique is used in different use cases:
- Online: automated recommendation system, discounts, promotion, placement, cross-selling, etc.
- Offine: product placement, supplier management, store design, product catalog design, bundles, discounts, promotions, etc.
- Customer Retention: it can be used to create compelling arguments (using, for example, product bundles) to retain customers.
- Web usage mining, Intrusion detection, Continuous production, Bioinformatics, etc.
How to implement this algorithm using R
We have several R libraries related to Association Analysis:
- Arules: library to find frequent transactions and associations using Apriori and Eclat algorithms
- ArulesNBMiner: java version
- ArulesSequences: transaction manipulation and cSpade algorithm
- ArulesViz: visualization library for arules
The apriori principle can reduce the number of itemsets we need to examine. Put simply, the apriori principle states that if an itemset is infrequent, then all its subsets must also be infrequent.
- Computationally Expensive. Even though the apriori algorithm reduces the number of candidate itemsets to consider, this number could still be huge when store inventories are large or when the support threshold is low.
- Spurious Associations. Analysis of large inventories would involve more itemset configurations, and the support threshold might have to be lowered to detect certain associations. However, lowering the support threshold might also increase the number of spurious associations detected.
# Install packages
install.packages("arules")
install.packages("arulesViz")
# Load packages
library("arulesViz")
library("arules")
# Load data
transaction_data <- read.csv("data/chapter6.csv",stringsAsFactors=FALSE)
# Review data
summary(edata)
# Consider only
ldata <- unique(edata)
# Separamos los datos
i <- split (ldata$producto, ldata$id_compra)
# Transform dataframe into transaction
txn <- as(i,"transactions")
# Apply apriori algorithm
basket_rules <- apriori(txn,parameter = list(sup = 0.005, conf=0.01,target='rules'))
# Review outcome
inspect(basket_rules)
# Plot outcome as scatterplot
plot(basket_rules)
# Plot outcome as a graph
plot(basket_rules,
method="graph",
measure="confidence",
shading="lift", control=list(type="items"))
# Refining our analysis
basket_rules_refined <- apriori(txn,parameter = list(sup = 0.05, conf = 0.2,target="rules"))
inspect(basket_rules_refined)
plot(basket_rules_refined)
plot(basket_rules_refined,
method="graph",
measure="confidence",
shading="lift", control=list(type="items"))