Creating a Bot that Predicts Rossmann Future Sales

A Regression Project

Obs: The business problem is fictitious, although both company and data are real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Rossmann and Business Problem

Rossmann is one of the largest drug store chains in Europe, with operations in Germany, Poland, Hungary, the Czech Republic, Turkey, Albania, Kosovo and Spain. Their sales can be influenced by promotions, competition, school and state holidays, seasonality, locality, etc.

This Data Science project is focused on solving one problem:

Rossmann CEO is requiring a sales prediction of the next six weeks for each store, in order to determine the best resource allocation for each store renovation.

2. Data Overview

The data was collected from Kaggle. This dataset contains historical sales data for 1,115 Rossmann stores. The initial features descriptions are available below:

Feature	Definition
Id	an Id that represents a (Store, Date) duple within the dataset.
Store	a unique Id for each store.
Sales	the turnover for any given day.
DayOfWeek	day of week on which the sale was made (e.g. DayOfWeek=1 -> monday, DayOfWeek=2 -> tuesday, etc).
Date	date on which the sale was made.
Customers	the number of customers on a given day.
Open	an indicator for whether the store was open: 0 = closed, 1 = open.
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools.
StoreType	differentiates between 4 different store models: a, b, c, d.
Assortment	describes an assortment level: a = basic, b = extra, c = extended.
CompetitionDistance	distance in meters to the nearest competitor store.
CompetitionOpenSince(Month/Year)	gives the approximate year and month of the time the nearest competitor was opened.
Promo	indicates whether a store is running a promo on that day.
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.
Promo2Since(Year/Week)	describes the year and calendar week when the store started participating in Promo2.
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

3. Assumptions

Customers column was dropped, because for now there's no information about the amount of customers six weeks into the future.
The NaN's in CompetitionDistance were replaced by 3 times the maximum CompetitionDistance in the dataset, because the observations with NaN's are likely stores that are too far, which means there's no competition.
Some new features were created in order to best describe the problem:

New Feature	Definition
day/week_of_year/year_week/month/year	day/week_of_year/year_week/month/year extracted from 'date' column.
day/day_of_week/week_of_year/month(sin/cos)	sin/cos component of each period, to capture their cyclical behavior.
competition_time_month	amount of months from competition start.
promo_time_week	time in weeks from when the promotion was active.
state_holiday(christmas/easter_holiday/public_holiday/regular_day)	indicates wheter the sale was made in christmas, easter, public holiday or regular day.

4. Solution Plan

4.1. How was the problem solved?

To predict sales values for each store (six weeks in advance) a Machine Learning model was applied. To achieve that, the following steps were performed:

Understanding the Business Problem : Understanding the reasons why Rossmann's CEO was requiring that task, and plan the solution.
Collecting Data : Collecting Rossmann store and sales data from Kaggle.
Data Cleaning : Renaming columns, changing data types and filling NaN's.
Feature Engineering : Creating new features from the original ones, so that those could be used in the ML model.
Exploratory Data Analysis (EDA) : Exploring the data in order to obtain business experience, look for useful business insights and find important features for the ML model. The top business insights found are available in Section 5 .
Data Preparation : Applying Normalization and Rescaling Techniques in the data, as well as Enconding Methods and Response Variable Transformation.
Feature Selection : Selecting the best features to use in the ML model by applying the Boruta Algorithm.
Machine Learning Modeling : Training Regression Algorithms with time series cross-validation. The best model was selected to be improved via Hyperparameter Tuning. More information in Section 6 .
Model Evaluation : Evaluating the model using four metrics: MAE, MAPE, RMSE and R².
Financial Results : Translating the ML model's statistical performance to financial and business performance.
Model Deployment (Telegram Bot) : Implementation of a Telegram Bot that will give you the prediction of any given available store number. This is the project's Data Science Product, and it can be accessed from anywhere. More information in Section 7 .

4.2. Tools and techniques used:

5. Top Business Insights

1st - Stores with basic assortment level are the ones that sell more.

2nd - Stores with higher number of close competitors sell more.

3rd - Easter Holiday has the highest average sales, in comparison to other periods.

4th - Stores sell less during the second semester of each year.

5th - Stores Sell more after the 10th day of each month.

6. Machine Learning Models

This was the most fundamental part of this project, since it's in ML modeling where the sales predictions for each store can be made. Six models were trained using time series cross-validation:

Average Model (used as a baseline model)
Linear Regression
Lasso Regression (Regularized Linear Regression)
Random Forest Regressor
XGBoost Regressor
Light GBM Regressor

The initial performance for all six algorithms are displayed below (sorted by RMSE):

Model Name	MAE	MAPE	RMSE	R²
LGBM Regressor	833.23 +/- 121.0	0.1178 +/- 0.0093	1197.68 +/- 176.39	0.8467 +/- 0.0237
Random Forest Regressor	838.84 +/- 219.91	0.1162 +/- 0.0233	1257.62 +/- 321.14	0.8409 +/- 0.0527
XGBoost Regressor	900.29 +/- 152.53	0.1273 +/- 0.0155	1293.45 +/- 214.31	0.8322 +/- 0.0334
Average Model (Baseline)	1354.8	0.2064	1835.14	0.6366
Linear Regression	2081.72 +/- 295.57	0.3026 +/- 0.0166	2953.15 +/- 468.22	0.1353 +/- 0.0721
Lasso Regression	2116.42 +/- 341.46	0.292 +/- 0.0118	3058.12 +/- 504.18	0.0742 +/- 0.0834

Both Linear Regression and Lasso Regression have worst performances in comparison to the simple Average Model. This shows a nonlinear behavior in our dataset, hence the use of more complex models, such as Random Forest, XGBoost and Light GBM.

The LGBM model was chosen for Hyperparameter Tuning, since it has the lowest RMSE. Even if we look into other metrics, such as MAPE (on which Random Forest has the best performance), LGBM would still be better to use, because it's much faster to train and tune .

After tuning LGBM's hyperparameters using Random Search the model performance has improved:

Model Name	MAE	MAPE	RMSE	R²
LGBM Regressor	617.54000	0.08940	921.52000	0.90840

Metrics Definition and Interpretation

*Metric*	*Definition*
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
RMSE	Root Mean Squared Error
R²	Coefficient of Determination

R² basically show how well the sales are being predicted by the model, and alongside RMSE isn't the best metric to translate into financial performance, despite being key to check statistical performance.

Both MAE and MAPE are really useful in explaining the model's business performance. MAE shows how much the model prediction is wrong on average, while MAPE shows how much the model prediction is wrong on average percentage-wise.

6.1. Brief Financial Results:

Below there are displayed two tables with brief financial results given by the LGBM model, as the complete financial results will be explained in the next section .

A couple interesting metrics to evaluate the financial performance of this solution (LGBM Model) is the MAE and MAPE. Below there's a table with a few stores metrics:

Store	Predictions (€)	Worst Scenario (€)	Best Scenario (€)	MAE (€)	MAPE
1	161,274.69	160,988.99	161,560.39	285.69937	0.06472
2	175,549.76	175,192.06	175,907.47	357.70668	0.07338
3	259,576.04	259,077.04	260,075.03	498.99756	0.07215
...	...	...	...	...	...
1113	238,353.17	237,841.04	238,865.30	512.12600	0.07852
1114	769,997.75	767,598.11	772,397.39	2399.63754	0.10164
1115	254,766.52	254,227.68	255,305.36	538.83848	0.07576

According to this model, the sales sum for all stores over the next six weeks is:

Scenario (€)	Total Sales of the Next 6 Weeks (€)
Prediction	283,786,860.62
Worst Scenario	283,094,186.26
Best Scenario	284,479,534.97

7. Model Deployment

As previously mentioned, the complete financial results can be consulted by using the Telegram Bot. The idea behind this is to facilitate the access of any store sales prediction, as those can be checked from anywhere and from any electronic device, as long as internet connection is available. The bot will return you a sales prediction over the next six weeks for any available store, all you have to do is send him the store number in this format "/store_number" (e.g. /12, /23, /41, etc) . If a store number if non existent the message "Store not available" will be returned, and if you provide a text that isn't a number the bot will ask you to enter a valid store id.

Click below to chat with the Rossmann Bot

Because the deployment was made in a free cloud (Render) it could take a few minutes for the bot to respond, in the first request. In the following requests it should respond instantly.

8. Conclusion

In this project the main objective was accomplished:

A model that can provide good sales predictions for each store over the next six weeks was successfully trained and deployed in a Telegram Bot, which fulfilled CEO' s requirement, for now it's possible to determine the best resource allocation for each store renovation. In addition to that, five interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Rossmann CEO.

9. Next Steps

Further on, this solution could be improved by a few strategies:

Using ARIMA to predict the amount of customers over the next six weeks, so that the customers column could be added to the final model.
Tune even more the regression algorithm, by applying a Bayesian Optimization for instance.
Try other regression algorithms to predict the sales for each store.
Use different models for the stores on which it's more difficult (higher MAE and MAPE) to predict the sales.

Contact

brunodifranco99@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
rossmann-app		rossmann-app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rossmann.ipynb		rossmann.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creating a Bot that Predicts Rossmann Future Sales

1. Rossmann and Business Problem

Rossmann CEO is requiring a sales prediction of the next six weeks for each store, in order to determine the best resource allocation for each store renovation.

2. Data Overview

3. Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Top Business Insights

1st - Stores with basic assortment level are the ones that sell more.

2nd - Stores with higher number of close competitors sell more.

3rd - Easter Holiday has the highest average sales, in comparison to other periods.

4th - Stores sell less during the second semester of each year.

5th - Stores Sell more after the 10th day of each month.

6. Machine Learning Models

Metrics Definition and Interpretation

6.1. Brief Financial Results:

7. Model Deployment

8. Conclusion

9. Next Steps

Contact

About

Releases

Packages

Contributors 2

Languages

License

brunodifranco/project-rossmann-sales

Folders and files

Latest commit

History

Repository files navigation

Creating a Bot that Predicts Rossmann Future Sales

1. Rossmann and Business Problem

Rossmann CEO is requiring a sales prediction of the next six weeks for each store, in order to determine the best resource allocation for each store renovation.

2. Data Overview

3. Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Top Business Insights

1st - Stores with basic assortment level are the ones that sell more.

2nd - Stores with higher number of close competitors sell more.

3rd - Easter Holiday has the highest average sales, in comparison to other periods.

4th - Stores sell less during the second semester of each year.

5th - Stores Sell more after the 10th day of each month.

6. Machine Learning Models

Metrics Definition and Interpretation

6.1. Brief Financial Results:

7. Model Deployment

8. Conclusion

9. Next Steps

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages