cohort_generator
- repeatradar.cohort_generator.generate_cohort_data(data: DataFrame, date_column: str, user_column: str, value_column: str | None = None, aggregation_function: Literal['sum', 'mean', 'count', 'median', 'min', 'max', 'nunique'] = 'sum', cohort_period: Literal['D', 'W', 'M', 'Q', 'Y'] = 'M', period_duration: int | Literal['D', 'W', 'M', 'Q', 'Y'] = 30, output_format: Literal['long', 'pivot'] = 'pivot', calculate_retention_rate: bool = False) DataFrame [source]
Create cohort analysis data in a specified format with optimized performance.
Supports both user retention analysis and transaction value analysis with retention rates. This function groups users into cohorts based on their acquisition period and tracks their activity or value in subsequent periods.
- Parameters:
data (pd.DataFrame) – The input data containing transaction information
date_column (str) – Column name containing the datetime information
user_column (str) – Column name containing the user/customer ID
value_column (Optional[str], optional) – Column name containing values to aggregate (e.g., transaction amount). If None, the function counts unique users (traditional cohort analysis), defaults to None
aggregation_function (Literal['sum', 'mean', 'count', 'median', 'min', 'max', 'nunique'], optional) – Function to apply when aggregating values. Only used when value_column is provided. ‘nunique’ counts the number of unique values in each group, defaults to ‘sum’
cohort_period (Literal['D', 'W', 'M', 'Q', 'Y'], optional) – Period to group cohorts by (how to define cohort acquisition periods), defaults to ‘M’
period_duration (Union[int, Literal['D', 'W', 'M', 'Q', 'Y']], optional) – Duration of analysis periods. Can be number of days (int) or period string. If string: ‘D’=daily, ‘W’=weekly, ‘M’=monthly, ‘Q’=quarterly, ‘Y’=yearly, defaults to 30
output_format (Literal['long', 'pivot'], optional) – Format of the output data - long format or pivot table, defaults to ‘pivot’
calculate_retention_rate (bool, optional) – If True, calculates retention rate as percentage compared to period 0. Only applicable when value_column is None (user count analysis), defaults to False
- Raises:
ValueError – If required columns are not found in data or invalid parameters are provided
TypeError – If date_column is not of datetime type
- Returns:
Either a long-format DataFrame with columns [cohort_period, period_number, metric_value] or a pivoted DataFrame in triangle format with cohorts as rows and periods as columns. If calculate_retention_rate=True, values represent percentage retention rates
- Return type:
pd.DataFrame
Examples:
# Basic user retention analysis >>> user_cohorts = generate_cohort_data( ... data=df, ... date_column='purchase_date', ... user_column='customer_id' ... ) # User retention with retention rates >>> retention_rates = generate_cohort_data( ... data=df, ... date_column='purchase_date', ... user_column='customer_id', ... calculate_retention_rate=True ... ) # Revenue cohort analysis with weekly periods >>> revenue_cohorts = generate_cohort_data( ... data=df, ... date_column='purchase_date', ... user_column='customer_id', ... value_column='purchase_amount', ... period_duration='W', ... aggregation_function='sum' ... ) # Count unique products per cohort period >>> unique_products = generate_cohort_data( ... data=df, ... date_column='purchase_date', ... user_column='customer_id', ... value_column='product_id', ... aggregation_function='nunique' ... )