cohort_generator

repeatradar.cohort_generator.generate_cohort_data(data: DataFrame, date_column: str, user_column: str, value_column: str | None = None, aggregation_function: Literal['sum', 'mean', 'count', 'median', 'min', 'max', 'nunique'] = 'sum', cohort_period: Literal['D', 'W', 'M', 'Q', 'Y'] = 'M', period_duration: int | Literal['D', 'W', 'M', 'Q', 'Y'] = 30, output_format: Literal['long', 'pivot'] = 'pivot', calculate_retention_rate: bool = False) DataFrame[source]

Create cohort analysis data in a specified format with optimized performance.

Supports both user retention analysis and transaction value analysis with retention rates. This function groups users into cohorts based on their acquisition period and tracks their activity or value in subsequent periods.

Parameters:
  • data (pd.DataFrame) – The input data containing transaction information

  • date_column (str) – Column name containing the datetime information

  • user_column (str) – Column name containing the user/customer ID

  • value_column (Optional[str], optional) – Column name containing values to aggregate (e.g., transaction amount). If None, the function counts unique users (traditional cohort analysis), defaults to None

  • aggregation_function (Literal['sum', 'mean', 'count', 'median', 'min', 'max', 'nunique'], optional) – Function to apply when aggregating values. Only used when value_column is provided. ‘nunique’ counts the number of unique values in each group, defaults to ‘sum’

  • cohort_period (Literal['D', 'W', 'M', 'Q', 'Y'], optional) – Period to group cohorts by (how to define cohort acquisition periods), defaults to ‘M’

  • period_duration (Union[int, Literal['D', 'W', 'M', 'Q', 'Y']], optional) – Duration of analysis periods. Can be number of days (int) or period string. If string: ‘D’=daily, ‘W’=weekly, ‘M’=monthly, ‘Q’=quarterly, ‘Y’=yearly, defaults to 30

  • output_format (Literal['long', 'pivot'], optional) – Format of the output data - long format or pivot table, defaults to ‘pivot’

  • calculate_retention_rate (bool, optional) – If True, calculates retention rate as percentage compared to period 0. Only applicable when value_column is None (user count analysis), defaults to False

Raises:
  • ValueError – If required columns are not found in data or invalid parameters are provided

  • TypeError – If date_column is not of datetime type

Returns:

Either a long-format DataFrame with columns [cohort_period, period_number, metric_value] or a pivoted DataFrame in triangle format with cohorts as rows and periods as columns. If calculate_retention_rate=True, values represent percentage retention rates

Return type:

pd.DataFrame

Examples:

# Basic user retention analysis
>>> user_cohorts = generate_cohort_data(
...     data=df,
...     date_column='purchase_date',
...     user_column='customer_id'
... )

# User retention with retention rates
>>> retention_rates = generate_cohort_data(
...     data=df,
...     date_column='purchase_date',
...     user_column='customer_id',
...     calculate_retention_rate=True
... )

# Revenue cohort analysis with weekly periods
>>> revenue_cohorts = generate_cohort_data(
...     data=df,
...     date_column='purchase_date',
...     user_column='customer_id',
...     value_column='purchase_amount',
...     period_duration='W',
...     aggregation_function='sum'
... )

# Count unique products per cohort period
>>> unique_products = generate_cohort_data(
...     data=df,
...     date_column='purchase_date',
...     user_column='customer_id',
...     value_column='product_id',
...     aggregation_function='nunique'
... )