Higher School of Economics, Yandex and Sberbank along with Harbour.Space University are proud to announce an olympiad created by and for data analysts.
Human behaviour isn’t governed by the rules of logic. It tends to defy even the shrewdest predictions, so successfully forecasting the future desires of just a small fraction of users would be a major achievement. Your task is having a browsing history of a large number of users to select a small sample group — 5% of users — and recommend five product categories for each person. At least one of these picks must be something that doesn’t interest the user right now, but will interest them during the next week.
The task is to choose exactly 53,979 users (user_id, 5% of all users in the dataset) and for each select five third-level product categories (id3) that they have not viewed in the last three weeks and which will be of interest to them in the next seven days. The resulting score is based on the number of users for which at least one product category is correctly nominated. Accurate predictions of two or more categories for one user will not improve your score.
Input format You will be working with Yandex.Market search logs. Each row in the data corresponds to a "view" event: a particular user viewed an item that belongs to a particular category. The data is stored in a .csv file with the following fields:
user_id — individual shopper identifier date — the day when user’s interest in a particular product was recorded; from 1 to 54 id1 — first (highest) level category identifier, e.g. “Home appliances”. id2 — second (middle) level category identifier, e.g. “Kitchen appliances”. id3 — third (lowest) level category identifier, e.g. “Refrigerators”. The data can be downloaded using this link.
Please upload your predictions into the system in the .csv
format. The file should consist of 53,979 + 1 rows and contain columns user_id, id3_1, id3_2, id3_3, id3_4, id3_5.
A sample submission can be found here.
In this task you need to create program, that wiil be solve first task for 5k users. You need to upload your program in .zip
file into contest platform.
You need to submit a .zip archive that includes a Makefile with tags "build" and "run" that will be executed one after another in a container. The log, produced during "build" phase will be visible on the submission page, so it is possible to debug the installation. For the "Run" phase your code should process the data stored in ./train.csv.zip and is expected to produce a ./submission.csv file with predictions. The submission file should consist of 5000 + 1 rows and contain columns user_id, id3_1, id3_2, id3_3, id3_4, id3_5.
The container has python 2.7/3.6 installed with the major libraries:
- numpy
- scipy
- pandas
- scikit-learn
- matplotlib
- joblib
- tqdm
- xgboost
- lightgbm
- catboost