Cohort Analysis
- Problem: we don't know how customers' behavior changes over time
- Goals:
- Understanding customers' behaviour evolution
- Understanding behavior using groups that evolve over time, not individually
- Why? Perform specific actions on groups, detect similar temporal patterns on group
Technique to solve the business problem
Cohort analysis is a observational, analytic and longitudinal study. It is a comparison of the evolution of a particular aspect (KPI). Individuals comprising study groups are selected based on the presence of a particular characteristic.
Main Concepts
Cohort is a group of people used in a study who have something (such as age or social class) in common
We can find the origin in ancient times. A cohort was a military unit, one of ten divisions in a Roman legion.
Cohort Analysis is a traditional tool in epidemiology. When we applied this technique in other industries most of the times:
- Metrics are easier to capture and analyse
- Direct: number of customers, revenue, cost
- Derived: retention
Implementation Process
- [BU] Determine business questions/needs, measure to study and cohorts of interest
- [DU] Data Sourcing, Cleaning & Exploration
- [DP] Create cohorts, extract data according to cohorts
- [M] Calculate the measure
- [E] Analyze results and adjust parameters
- [D] Present and explain the results
Benefits
This technique provides the following benefits:
- Understand Customer Lifecycle/Journey: length, value, situation,...
- Identify patterns
- Behavioral/Psychographic analysis
Use cases
This technique is used in different use cases:
- Examine where cashflow is coming from and understand the health of your business
- Easily see how much monthly or quarterly revenue is driven from newer and older cohorts
- Study customer retention patterns to see if they are getting better or worse
- Compare cohorts of users from different segments
How to implement this algorithm using R
# Install packages
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("readxl")
# install.packages("reshape")
# Load packages
library(ggplot2)
library(readxl)
library(dplyr)
library(reshape2)
# Load data into a dataframe
df <- read_excel("data/s8.xlsx")
df
# How many people complete the MOOC
Finished <- df$Finished[49]
Finished
# Question: Is this a good result?
Ratio <- df$Finished[49] / df$Started[49] *100
Ratio
# Ratio Evolution
RatioEvolution <- data.frame(day = df$Day, ratio=df$Finished / df$Started *100)
RatioEvolution
# Ratio Evolution Graph
g1 <- ggplot(RatioEvolution, aes(x = day, y = ratio)) +
geom_line() + ggtitle("Completition Ratio Evolution") +
ylab("Ratio") + xlab("Period") +
theme(plot.title = element_text(color="#666666", face="bold", size=20, hjust=0)) +
theme(axis.title = element_text(color="#666666", face="bold", size=14))
g1
# Ratio Evolution Graph vs MOOC objectives
g1 + geom_vline(xintercept=35, colour="red") + geom_hline(yintercept=20, colour="red")
# How we research about the evolution
df_finished <- dplyr::select(df, contains("Finished"))
df_finished <- data.frame(day = df$Day, df_finished)
df_finished.chart <- melt(df_finished, id.vars = "day")
colnames(df_finished.chart) <- c('Day', 'Cohort', 'Students')
# Let's create a graph
p <- ggplot(df_finished.chart, aes(x=Day, y=Students, group=Cohort, colour=Cohort))
p + geom_line() + ggtitle('Students Completition per day and cohort')
# Question: What we observe?
# Let's create another graph
p1 <- ggplot(df_finished.chart, aes(x=Cohort, y=Students, group=Day, colour=Day))
p1 + geom_line() + ggtitle('Students Completition per day and cohort')
# Let's do the same for the completition ratio
df_finished_ratio <- as.data.frame(apply( df_finished, 2, function(x) x/df$Started*100 ))
df_finished_ratio$day <- df_finished$day
df_finished_ratio.chart <- melt(df_finished_ratio, id.vars = "day")
colnames(df_finished_ratio.chart) <- c('Day', 'Cohort', 'Ratio')
# Let's create a graph
p2 <- ggplot(df_finished_ratio.chart, aes(x=Day, y=Ratio, group=Cohort, colour=Cohort))
p2 + geom_line() + ggtitle('Completition Ratio per day and cohort')
# Let's create another graph
p3 <- ggplot(df_finished_ratio.chart, aes(x=Cohort, y=Ratio, group=Day, colour=Day))
p3 + geom_line() + ggtitle('Completition Ratio per day and cohort')
# Question: What we observe?