Cohort Analysis

  • Problem: we don't know how customers' behavior changes over time
  • Goals:
    • Understanding customers' behaviour evolution
    • Understanding behavior using groups that evolve over time, not individually
  • Why? Perform specific actions on groups, detect similar temporal patterns on group

Technique to solve the business problem

Cohort analysis is a observational, analytic and longitudinal study. It is a comparison of the evolution of a particular aspect (KPI). Individuals comprising study groups are selected based on the presence of a particular characteristic.

Main Concepts

Cohort is a group of people used in a study who have something (such as age or social class) in common

We can find the origin in ancient times. A cohort was a military unit, one of ten divisions in a Roman legion.

Cohort Analysis is a traditional tool in epidemiology. When we applied this technique in other industries most of the times:

  • Metrics are easier to capture and analyse
  • Direct: number of customers, revenue, cost
  • Derived: retention

Implementation Process

  • [BU] Determine business questions/needs, measure to study and cohorts of interest
  • [DU] Data Sourcing, Cleaning & Exploration
  • [DP] Create cohorts, extract data according to cohorts
  • [M] Calculate the measure
  • [E] Analyze results and adjust parameters
  • [D] Present and explain the results


This technique provides the following benefits:

  • Understand Customer Lifecycle/Journey: length, value, situation,...
  • Identify patterns
  • Behavioral/Psychographic analysis

Use cases

This technique is used in different use cases:

  • Examine where cashflow is coming from and understand the health of your business
  • Easily see how much monthly or quarterly revenue is driven from newer and older cohorts
  • Study customer retention patterns to see if they are getting better or worse
  • Compare cohorts of users from different segments

How to implement this algorithm using R

# Install packages
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("readxl")
# install.packages("reshape")

# Load packages

# Load data into a dataframe
df <- read_excel("data/s8.xlsx")

# How many people complete the MOOC
Finished <- df$Finished[49]

# Question: Is this a good result?
Ratio <- df$Finished[49] / df$Started[49] *100

# Ratio Evolution
RatioEvolution <- data.frame(day = df$Day, ratio=df$Finished / df$Started *100)

# Ratio Evolution Graph
g1 <- ggplot(RatioEvolution, aes(x = day, y = ratio)) + 
  geom_line() + ggtitle("Completition Ratio Evolution") + 
  ylab("Ratio") + xlab("Period") +
  theme(plot.title = element_text(color="#666666", face="bold", size=20, hjust=0)) +
  theme(axis.title = element_text(color="#666666", face="bold", size=14))

# Ratio Evolution Graph vs MOOC objectives
g1 + geom_vline(xintercept=35, colour="red") + geom_hline(yintercept=20, colour="red")

# How we research about the evolution
df_finished <- dplyr::select(df, contains("Finished"))
df_finished <- data.frame(day = df$Day, df_finished)
df_finished.chart <- melt(df_finished, id.vars = "day")
colnames(df_finished.chart) <- c('Day', 'Cohort', 'Students')

# Let's create a graph
p <- ggplot(df_finished.chart, aes(x=Day, y=Students, group=Cohort, colour=Cohort))
p + geom_line() + ggtitle('Students Completition per day and cohort')

# Question: What we observe?

# Let's create another graph
p1 <- ggplot(df_finished.chart, aes(x=Cohort, y=Students, group=Day, colour=Day))
p1 + geom_line() + ggtitle('Students Completition per day and cohort')

# Let's do the same for the completition ratio
df_finished_ratio <- df_finished, 2, function(x) x/df$Started*100 )) 
df_finished_ratio$day <- df_finished$day
df_finished_ratio.chart <- melt(df_finished_ratio, id.vars = "day")
colnames(df_finished_ratio.chart) <- c('Day', 'Cohort', 'Ratio')

# Let's create a graph
p2 <- ggplot(df_finished_ratio.chart, aes(x=Day, y=Ratio, group=Cohort, colour=Cohort))
p2 + geom_line() + ggtitle('Completition Ratio per day and cohort')

# Let's create another graph
p3 <- ggplot(df_finished_ratio.chart, aes(x=Cohort, y=Ratio, group=Day, colour=Day))
p3 + geom_line() + ggtitle('Completition Ratio per day and cohort')

# Question: What we observe?


results matching ""

    No results matching ""