A Regression Project
Obs: The business problem is fictitious, although both company and data are real.
The in-depth Python code explanation is available in this Jupyter Notebook.
Rossmann is one of the largest drug store chains in Europe, with operations in Germany, Poland, Hungary, the Czech Republic, Turkey, Albania, Kosovo and Spain. Their sales can be influenced by promotions, competition, school and state holidays, seasonality, locality, etc.
This Data Science project is focused on solving one problem:
The data was collected from Kaggle. This dataset contains historical sales data for 1,115 Rossmann stores. The initial features descriptions are available below:
Feature | Definition |
---|---|
Id | an Id that represents a (Store, Date) duple within the dataset. |
Store | a unique Id for each store. |
Sales | the turnover for any given day. |
DayOfWeek | day of week on which the sale was made (e.g. DayOfWeek=1 -> monday, DayOfWeek=2 -> tuesday, etc). |
Date | date on which the sale was made. |
Customers | the number of customers on a given day. |
Open | an indicator for whether the store was open: 0 = closed, 1 = open. |
StateHoliday | indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None. |
SchoolHoliday | indicates if the (Store, Date) was affected by the closure of public schools. |
StoreType | differentiates between 4 different store models: a, b, c, d. |
Assortment | describes an assortment level: a = basic, b = extra, c = extended. |
CompetitionDistance | distance in meters to the nearest competitor store. |
CompetitionOpenSince(Month/Year) | gives the approximate year and month of the time the nearest competitor was opened. |
Promo | indicates whether a store is running a promo on that day. |
Promo2 | Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating. |
Promo2Since(Year/Week) | describes the year and calendar week when the store started participating in Promo2. |
PromoInterval | describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store. |
- Customers column was dropped, because for now there's no information about the amount of customers six weeks into the future.
- The NaN's in CompetitionDistance were replaced by 3 times the maximum CompetitionDistance in the dataset, because the observations with NaN's are likely stores that are too far, which means there's no competition.
- Some new features were created in order to best describe the problem:
New Feature | Definition |
---|---|
day/week_of_year/year_week/month/year | day/week_of_year/year_week/month/year extracted from 'date' column. |
day/day_of_week/week_of_year/month(sin/cos) | sin/cos component of each period, to capture their cyclical behavior. |
competition_time_month | amount of months from competition start. |
promo_time_week | time in weeks from when the promotion was active. |
state_holiday(christmas/easter_holiday/public_holiday/regular_day) | indicates wheter the sale was made in christmas, easter, public holiday or regular day. |
To predict sales values for each store (six weeks in advance) a Machine Learning model was applied. To achieve that, the following steps were performed:
-
Understanding the Business Problem : Understanding the reasons why Rossmann's CEO was requiring that task, and plan the solution.
-
Collecting Data : Collecting Rossmann store and sales data from Kaggle.
-
Data Cleaning : Renaming columns, changing data types and filling NaN's.
-
Feature Engineering : Creating new features from the original ones, so that those could be used in the ML model.
-
Exploratory Data Analysis (EDA) : Exploring the data in order to obtain business experience, look for useful business insights and find important features for the ML model. The top business insights found are available in Section 5 .
-
Data Preparation : Applying Normalization and Rescaling Techniques in the data, as well as Enconding Methods and Response Variable Transformation.
-
Feature Selection : Selecting the best features to use in the ML model by applying the Boruta Algorithm.
-
Machine Learning Modeling : Training Regression Algorithms with time series cross-validation. The best model was selected to be improved via Hyperparameter Tuning. More information in Section 6 .
-
Model Evaluation : Evaluating the model using four metrics: MAE, MAPE, RMSE and R2.
-
Financial Results : Translating the ML model's statistical performance to financial and business performance.
-
Model Deployment (Telegram Bot) : Implementation of a Telegram Bot that will give you the prediction of any given available store number. This is the project's Data Science Product, and it can be accessed from anywhere. More information in Section 7 .
- Python 3.9.12, Pandas, Matplotlib, Seaborn and Sklearn.
- Jupyter Notebook and VSCode.
- Flask and Python API's.
- Render Cloud and Telegram Bot.
- Git and Github.
- Exploratory Data Analysis (EDA).
- Techniques for Feature Selection.
- Regression Algorithms (Linear and Lasso Regression; Random Forest, XGBoost and LGBM Regressors).
- Cross-Validation Methods, Hyperparameter Optimization and Algorithms Performance Metrics (RMSE, MAE, MAPE, R2).
This was the most fundamental part of this project, since it's in ML modeling where the sales predictions for each store can be made. Six models were trained using time series cross-validation:
- Average Model (used as a baseline model)
- Linear Regression
- Lasso Regression (Regularized Linear Regression)
- Random Forest Regressor
- XGBoost Regressor
- Light GBM Regressor
The initial performance for all six algorithms are displayed below (sorted by RMSE):
Model Name | MAE | MAPE | RMSE | R2 |
---|---|---|---|---|
LGBM Regressor | 833.23 +/- 121.0 | 0.1178 +/- 0.0093 | 1197.68 +/- 176.39 | 0.8467 +/- 0.0237 |
Random Forest Regressor | 838.84 +/- 219.91 | 0.1162 +/- 0.0233 | 1257.62 +/- 321.14 | 0.8409 +/- 0.0527 |
XGBoost Regressor | 900.29 +/- 152.53 | 0.1273 +/- 0.0155 | 1293.45 +/- 214.31 | 0.8322 +/- 0.0334 |
Average Model (Baseline) | 1354.8 | 0.2064 | 1835.14 | 0.6366 |
Linear Regression | 2081.72 +/- 295.57 | 0.3026 +/- 0.0166 | 2953.15 +/- 468.22 | 0.1353 +/- 0.0721 |
Lasso Regression | 2116.42 +/- 341.46 | 0.292 +/- 0.0118 | 3058.12 +/- 504.18 | 0.0742 +/- 0.0834 |
Both Linear Regression and Lasso Regression have worst performances in comparison to the simple Average Model. This shows a nonlinear behavior in our dataset, hence the use of more complex models, such as Random Forest, XGBoost and Light GBM.
The LGBM model was chosen for Hyperparameter Tuning, since it has the lowest RMSE. Even if we look into other metrics, such as MAPE (on which Random Forest has the best performance), LGBM would still be better to use, because it's much faster to train and tune .
After tuning LGBM's hyperparameters using Random Search the model performance has improved:
Model Name | MAE | MAPE | RMSE | R2 |
---|---|---|---|---|
LGBM Regressor | 617.54000 | 0.08940 | 921.52000 | 0.90840 |
Metric | Definition |
---|---|
MAE | Mean Absolute Error |
MAPE | Mean Absolute Percentage Error |
RMSE | Root Mean Squared Error |
R2 | Coefficient of Determination |
R2 basically show how well the sales are being predicted by the model, and alongside RMSE isn't the best metric to translate into financial performance, despite being key to check statistical performance.
Both MAE and MAPE are really useful in explaining the model's business performance. MAE shows how much the model prediction is wrong on average, while MAPE shows how much the model prediction is wrong on average percentage-wise.
Below there are displayed two tables with brief financial results given by the LGBM model, as the complete financial results will be explained in the next section .
A couple interesting metrics to evaluate the financial performance of this solution (LGBM Model) is the MAE and MAPE. Below there's a table with a few stores metrics:
Store | Predictions (€) | Worst Scenario (€) | Best Scenario (€) | MAE (€) | MAPE |
---|---|---|---|---|---|
1 | 161,274.69 | 160,988.99 | 161,560.39 | 285.69937 | 0.06472 |
2 | 175,549.76 | 175,192.06 | 175,907.47 | 357.70668 | 0.07338 |
3 | 259,576.04 | 259,077.04 | 260,075.03 | 498.99756 | 0.07215 |
... | ... | ... | ... | ... | ... |
1113 | 238,353.17 | 237,841.04 | 238,865.30 | 512.12600 | 0.07852 |
1114 | 769,997.75 | 767,598.11 | 772,397.39 | 2399.63754 | 0.10164 |
1115 | 254,766.52 | 254,227.68 | 255,305.36 | 538.83848 | 0.07576 |
According to this model, the sales sum for all stores over the next six weeks is:
Scenario (€) | Total Sales of the Next 6 Weeks (€) |
---|---|
Prediction | 283,786,860.62 |
Worst Scenario | 283,094,186.26 |
Best Scenario | 284,479,534.97 |
As previously mentioned, the complete financial results can be consulted by using the Telegram Bot. The idea behind this is to facilitate the access of any store sales prediction, as those can be checked from anywhere and from any electronic device, as long as internet connection is available. The bot will return you a sales prediction over the next six weeks for any available store, all you have to do is send him the store number in this format "/store_number" (e.g. /12, /23, /41, etc) . If a store number if non existent the message "Store not available" will be returned, and if you provide a text that isn't a number the bot will ask you to enter a valid store id.
Because the deployment was made in a free cloud (Render) it could take a few minutes for the bot to respond, in the first request. In the following requests it should respond instantly.
In this project the main objective was accomplished:
A model that can provide good sales predictions for each store over the next six weeks was successfully trained and deployed in a Telegram Bot, which fulfilled CEO' s requirement, for now it's possible to determine the best resource allocation for each store renovation. In addition to that, five interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Rossmann CEO.
Further on, this solution could be improved by a few strategies:
-
Using ARIMA to predict the amount of customers over the next six weeks, so that the customers column could be added to the final model.
-
Tune even more the regression algorithm, by applying a Bayesian Optimization for instance.
-
Try other regression algorithms to predict the sales for each store.
-
Use different models for the stores on which it's more difficult (higher MAE and MAPE) to predict the sales.