NYCDSA ML Lab I Repository
This lab is aimed to walk you through the complete workflow of a machine learning project; from data wrangling, exploratory data analysis (EDA), model training and model evaluation/comparison.
You will work with your machine project teamates for this lab and your team needs to decide whether to use either R or Python as the main programming language. Each team memeber needs to work on his/her own submission.
We will use Github for team collaboration. There is a TL;DR of how do programmers work together on Github or we can break it down into following steps:
-
The team leader creates a public Github repository under his/her account first.
-
All the other team members fork the repo so you will have a COPY of the repo under your account
-
Git clone YOUR OWN repo otherwise you won't be able to push later.
-
Create a subfolder under your name and finish your code. Push the changes to Github.
-
Go to the Github page of YOUR OWN repository and click the "Pull Request" tab. You can find the details here
-
Submit the pull request so you can see it under team leader's repository.
-
Pair review each other's code before merging it to the master branch.
Homework To understand fork, pull request and branch better, review this video in 1.25X speed.
- The data comes from a global e-retailer company, including orders from 2012 to 2015. Import the Orders dataset and do some basic EDA.
- For problem 1 to 3, we mainly focus on data cleaning and data visualizations. You can use all the packages that you are familiar with to conduct some plots and also provide brief interpretations about your findings.
Check "Profit" and "Sales" in the dataset, convert these two columns to numeric type.
-
Retailers that depend on seasonal shoppers have a particularly challenging job when it comes to inventory management. Your manager is making plans for next year's inventory.
-
He wants you to answer the following questions:
- Is there any seasonal trend of inventory in the company?
- Is the seasonal trend the same for different categories?
-
Hint: For each order, it has an attribute called
Quantity
that indicates the number of product in the order. If an order contains more than one product, there will be multiple observations of the same order.
-
Your manager required you to give a brief report (Plots + Interpretations) on returned orders.
-
How much profit did we lose due to returns each year?
-
How many customer returned more than once? more than 5 times?
-
Which regions are more likely to return orders?
-
Which categories (sub-categories) of products are more likely to be returned?
-
-
Hint: Merge the Returns dataframe with the Orders dataframe using
Order.ID
.
Now your manager has a basic understanding of why customers returned orders. Next, he wants you to use machine learning to predict which orders are most likely to be returned. In this part, you will generate several features based on our previous findings and your manager's requirements.
- First of all, we need to generate a categorical variable which indicates whether an order has been returned or not.
- Hint: the returned orders’ IDs are contained in the dataset “returns”
- Your manager believes that how long it took the order to ship would affect whether the customer would return it or not.
- He wants you to generate a feature which can measure how long it takes the company to process each order.
- Hint: Process.Time = Ship.Date - Order.Date
- If a product has been returned before, it may be returned again.
- Let us generate a feature indictes how many times the product has been returned before.
- If it never got returned, we just impute using 0.
- Hint: Group by different Product.ID
- You can use any binary classification method you have learned so far.
- Use 80/20 training and test splits to build your model.
- Double check the column types before you fit the model.
- Only include useful features. i.e all the
ID
s should be excluded from your training set. - Not that there are only less than 5% of the orders have been returned, so you should consider using the
createDataPartition
function fromcaret
package that does a stratified random split of the data. Scikit-learn also has a StratifiedKfold function that does similar thing. - Do forget to
set.seed()
before the spilt to make your result reproducible. - Note: We are not looking for the best tuned model in the lab so don't spend too much time on grid search. Focus on model evaluation and the business use case of each model.
- What is the best metric to evaluate your model. Is accuracy good for this case?
- Now you have multiple models, which one would you pick?
- Can you get any clue from the confusion matrix? What is the meaning of precision and recall in this case? Which one do you care the most? How will your model help the manager make decisions?
- Note: The last question is open-ended. Your answer could be completely different depending on your understanding of this business problem.
- Is there anything wrong with the new feature we generated? How should we fix it?
- Hint: For the real test set, we do not know it will get returned or not.