\n",
@@ -148,34 +235,146 @@
""
],
"text/plain": [
- " frequency recency T monetary_value\n",
- "customer_id \n",
- "1 2 30.43 38.86 22.35\n",
- "2 1 1.71 38.86 11.77\n",
- "6 7 29.43 38.86 73.74\n",
- "7 1 5.00 38.86 11.77\n",
- "9 2 35.71 38.86 25.55"
+ " frequency recency T monetary_value\n",
+ "0 2 30.43 38.86 22.35\n",
+ "1 1 1.71 38.86 11.77\n",
+ "5 7 29.43 38.86 73.74\n",
+ "6 1 5.00 38.86 11.77\n",
+ "8 2 35.71 38.86 25.55"
]
},
- "execution_count": 4,
+ "execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "summary_with_money_value = load_cdnow_summary_data_with_monetary_value()\n",
- "summary_with_money_value.head()\n",
- "returning_customers_summary = summary_with_money_value[summary_with_money_value['frequency']>0]\n",
+ "returning_customers_summary = summary_with_money_value.query(\"frequency > 0\")\n",
"\n",
"returning_customers_summary.head()"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "16374e35",
+ "metadata": {},
+ "source": [
+ "## Model Specification\n",
+ "\n",
+ "Here we briefly describe the assumptions and the parametrization of the Gamma-Gamma model from the paper above.\n",
+ "\n",
+ "The model of spend per transaction is based on the following three general assumptions:\n",
+ "\n",
+ "- The monetary value of a customer’s given transaction varies randomly around their average transaction value.\n",
+ "- Average transaction values vary across customers but do not vary over time for any given individual.\n",
+ "- The distribution of average transaction values across customers is independent of the transaction process.\n",
+ " \n",
+ "For a customer with x transactions, let $z_1, z_2, \\ldots, z_x$ denote the value of each transaction. The customer’s observed average transaction value by\n",
+ "\n",
+ "$$\n",
+ "\\bar{z} = \\frac{1}{x} \\sum_{i=1}^{x} z_i\n",
+ "$$\n",
+ "\n",
+ "Now let's describe the parametrization: \n",
+ "\n",
+ "1. We assume that $z_i \\sim \\text{Gamma}(p, ν)$, with $E(Z_i| p, ν) = \\xi = p/ν$.\n",
+ "\n",
+ " – Given the convolution properties of the gamma, it follows that total spend across x transactions is distributed $\\text{Gamma}(px, ν)$.\n",
+ "\n",
+ " – Given the scaling property of the gamma distribution, it follows that $\\bar{z} \\sim \\text{Gamma}(px, νx)$.\n",
+ "\n",
+ "2. We assume $ν \\sim \\text{Gamma}(q, \\gamma)$.\n",
+ "\n",
+ "We are interested in estimating the parameters $p$, $q$ and $ν$."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8696dc0a",
+ "metadata": {},
+ "source": [
+ "```{note}\n",
+ "The Gamma-Gamma model assumes that there is no relationship between the monetary value and the purchase frequency. We can check this assumption by calculating the correlation between the average spend and the frequency of purchases.\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "c413718e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
monetary_value
\n",
+ "
frequency
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
monetary_value
\n",
+ "
1.000000
\n",
+ "
0.113884
\n",
+ "
\n",
+ "
\n",
+ "
frequency
\n",
+ "
0.113884
\n",
+ "
1.000000
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " monetary_value frequency\n",
+ "monetary_value 1.000000 0.113884\n",
+ "frequency 0.113884 1.000000"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "returning_customers_summary[[\"monetary_value\", \"frequency\"]].corr()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ea4bea0a",
+ "metadata": {},
+ "source": [
+ "The value of this correlation is close to $0.11$, which in practice is considered low enough to proceed with the model."
+ ]
+ },
{
"cell_type": "markdown",
"id": "df93d769",
"metadata": {},
"source": [
- "## Lifetimes implementation"
+ "## Lifetimes Implementation\n",
+ "\n",
+ "First, we fit the model using the `lifetimes` package."
]
},
{
@@ -202,9 +401,11 @@
}
],
"source": [
- "ggf = GammaGammaFitter(penalizer_coef = 0)\n",
- "ggf.fit(returning_customers_summary['frequency'],\n",
- " returning_customers_summary['monetary_value'])"
+ "ggf = GammaGammaFitter()\n",
+ "ggf.fit(\n",
+ " returning_customers_summary[\"frequency\"],\n",
+ " returning_customers_summary[\"monetary_value\"],\n",
+ ")"
]
},
{
@@ -282,6 +483,14 @@
"ggf.summary"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "cace007a",
+ "metadata": {},
+ "source": [
+ "Once the model is fitted we can use the following method to compute the conditional expectation of the average profit per transaction for a group of one or more customers."
+ ]
+ },
{
"cell_type": "code",
"execution_count": 7,
@@ -291,17 +500,16 @@
{
"data": {
"text/plain": [
- "customer_id\n",
- "1 24.658616\n",
- "2 18.911480\n",
- "3 35.171002\n",
- "4 35.171002\n",
- "5 35.171002\n",
- "6 71.462851\n",
- "7 18.911480\n",
- "8 35.171002\n",
- "9 27.282408\n",
- "10 35.171002\n",
+ "0 24.658616\n",
+ "1 18.911480\n",
+ "2 35.171002\n",
+ "3 35.171002\n",
+ "4 35.171002\n",
+ "5 71.462851\n",
+ "6 18.911480\n",
+ "7 35.171002\n",
+ "8 27.282408\n",
+ "9 35.171002\n",
"dtype: float64"
]
},
@@ -312,8 +520,7 @@
],
"source": [
"avg_profit = ggf.conditional_expected_average_profit(\n",
- " summary_with_money_value['frequency'],\n",
- " summary_with_money_value['monetary_value']\n",
+ " summary_with_money_value[\"frequency\"], summary_with_money_value[\"monetary_value\"]\n",
")\n",
"avg_profit.head(10)"
]
@@ -327,7 +534,7 @@
{
"data": {
"text/plain": [
- "35.25295817604995"
+ "35.252958176049916"
]
},
"execution_count": 8,
@@ -346,7 +553,7 @@
"id": "a2z_ZcC74wPI"
},
"source": [
- "## PyMC Marketing implementation"
+ "## PyMC Marketing Implementation"
]
},
{
@@ -354,7 +561,7 @@
"id": "d153908d",
"metadata": {},
"source": [
- "We can use the pre-built PyMC Marketing implementation of the Gamma-Gamma model, which also provides nice ploting and prediction methods"
+ "We can use the pre-built PyMC Marketing implementation of the Gamma-Gamma model, which also provides nice plotting and prediction methods:"
]
},
{
@@ -364,11 +571,21 @@
"metadata": {},
"outputs": [],
"source": [
- "dataset = pd.DataFrame({\n",
- " \"customer_id\": returning_customers_summary.index,\n",
- " \"mean_transaction_value\": returning_customers_summary[\"monetary_value\"],\n",
- " \"frequency\": returning_customers_summary[\"frequency\"],\n",
- "})"
+ "dataset = pd.DataFrame(\n",
+ " {\n",
+ " \"customer_id\": returning_customers_summary.index,\n",
+ " \"mean_transaction_value\": returning_customers_summary[\"monetary_value\"],\n",
+ " \"frequency\": returning_customers_summary[\"frequency\"],\n",
+ " }\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e27387",
+ "metadata": {},
+ "source": [
+ "We can *build* the model so that we can see the model specification:"
]
},
{
@@ -378,18 +595,6 @@
"metadata": {
"id": "eoQmmIrj43NV"
},
- "outputs": [],
- "source": [
- "model = clv.GammaGammaModel(\n",
- " data = dataset\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "fdbcb6ae",
- "metadata": {},
"outputs": [
{
"data": {
@@ -401,90 +606,103 @@
"likelihood ~ Potential(f(q, p, v))"
]
},
- "execution_count": 11,
+ "execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
+ "model = clv.GammaGammaModel(data=dataset)\n",
"model.build_model()\n",
"model"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "1826502a",
+ "metadata": {},
+ "source": [
+ "```{note}\n",
+ "It is not necessary to build the model before fitting it. We can fit the model directly.\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "21608cf4",
+ "metadata": {},
+ "source": [
+ "### Using MAP\n",
+ "\n",
+ "To begin with, lets use a numerical optimizer (`L-BFGS-B`) from `scipy.optimize` to find the maximum a posteriori (MAP) estimate of the parameters."
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": null,
"id": "e39da004",
"metadata": {},
+ "outputs": [],
+ "source": [
+ "idata_map = model.fit(fit_method=\"map\").posterior.to_dataframe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "b8f11643",
+ "metadata": {},
"outputs": [
{
"data": {
"text/html": [
- "\n",
- "\n"
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
"\n",
- "
\n",
- " "
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 7 seconds.\n",
- "The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details\n",
- "/Users/michalraczycki/Documents/pymc-marketing/pymc_marketing/clv/models/basic.py:119: UserWarning: The effect of Potentials on other parameters is ignored during posterior predictive sampling. This is likely to lead to invalid or biased predictive samples.\n",
- " idata.extend(pm.sample_posterior_predictive(idata))\n"
- ]
- },
{
"data": {
"text/html": [
@@ -577,8 +766,8 @@
"