Kaggle Competition: TalkingData AdTracking Fraud Detection Challenge

Authors: Kevin Liao

Objective

This repository is mainly about my learning experience in Kaggle Competition. It consists of python scripts (for feature engineering, model training and model selection) and jupyter notebook (for EDA). I hope beginers can find something useful in this repo. For me, I think I might recycle some of the tools in this repo as well.

Competition Website: TalkingData

File structure

The main directories of this repository are:

data, which stores the original data set train.csv, test.csv, and test_supplement.csv
scripts, which holds the meat for the competition. It includes feature engineering and model trainning/prediction
models, which stores trained models (trained object)
eda_nb, which stores jupyter notebooks and HTML for some EDA process and output
insample_iterations, which is reponsible for in-sample model selection, tuning and evaluations
images, which stores the graphic output for EDA
reference, which contains other top kagglers' scripts and tutorials

The complete file-structure for the project is as follows:

TalkingData/
    README.md
    LICENSE
    requirements.txt
    data/
        README.md
        train.csv
        test.csv
        test_supplement.csv
        train_v1.hdf
        test_v1.hdf
        train_v2.hdf
        test_v2.hdf
        train_v3.hdf
        test_v3.hdf
    scripts/
        feature_eng-v1.py
        train_xgb-v1.py
        feature_eng-v2.py
        train_lightgbm-v2.py
        feature_eng-v3.py
        train_lightgbm-v3.py
    models/
        model_lgbm.txt
    eda_nb/
        basic_EDA.ipynb
        basic_EDA.html
        better_EDA.ipynb
        better_EDA.html
        SHAP_toy_example.ipynb
        SHAP_toy_example.html
        BayesOpt_toy_example.ipynb
        BayesOpt_toy_example.html
        Boruta_algo_toy_example.ipynb
        Boruta_algo_toy_example.html
    insample_iterations/
        README.md
        data/
            train_raw.hdf
            test_raw.hdf
            train.hdf
            test.hdf
        scripts/
            dump_in_sample_data.py
            feature_engineering.py
            feature_univariate_selection.py
            feature_forward_selection.py
            feature_backward_selection.py
            feature_permutation_selection.py
            train_model.py
    images/
        ...too many random plots
    reference/
        ...good stuff

Some thoughts

Big thanks to sponsor TalkingData and Kaggle for providing such an interesting competition. Congradulations to those top teams and appreciations of kernal contributions from @Pranav Pandya and @anttip. This is my first Kaggle competition and I can't tell you how much fun for me being a part of it. I had a fulltime job and I knew that I can only commit my weekend free time to the competition. As a newby on kaggle, I did not anticipate a good LB score at all before going into the competition. Just one week right before the final submission deadline, I was so pumped up that I got myself solo ranked top 3% in public LB. However, that didn't last long, and my final submission is ranked top 15% in private LB, which I think it is a reasonable rank for me. Overall, I think this is one of the most competitive competition and it's very hard to get in top 5% without a team.

My results are shown below, I won't share too much about my strategy because it's not a winning strategy anyway and most of my stuff is taken from public kernels. However, I will share what I have learned and what makes a winning strategy.

My Model and LB score (AUC-ROC)

model definition can be found in scripts/train_lightgbm-v3.py

feature engineering can be found in scripts/feature_eng-v3.py

model LGBM with 42 (36 numerical, 6 categorical) features.

model	public score	private score	final rank
model V3	0.9806721	0.9811112	586th (top15%)

What I have learned from kaggle competition winners?

We have to understand the game before wasting time.

In this compeition, the data set is huge but we only have six features. This means that 1). we need to spend a lot of time in feature engineering 2). feature engineering and model validation cycle would take long time (because data is huge). Unless we have a good team, time resource allocation is crucial in this particular competition. A suggested time table will be like following 6th place solution:

80% feature engineering
10% making local validation as fast as possible
5% hyper parameter tuning
5% ensembling

Establishing a high speed research cycle is the key to win

This competition is about training model in past historical data and predicting future fraudulant clicks (which is a big-time imbalanced classification). For imbalanced future classification problem, using tradititonal five-fold cross-validation may not be a good strategy (or you have to be really careful about sampling ratio, the timing and future information leakage).

Basic strategy: a good practice research framework for this kind would be like following 6th place solution:
- Understanding that training data starts from day 7 and ends at day 9. Testing data is day 10, in hours of 4, 5, 9, 10, 13, 14.
- Introducing a insample hold-out set bright line. So we can enforce a bright line between day 8 and day 9 for insample research cycle.
- Training on day <= 8, and validating on both day 9 - hour 4 (mirror public LP), and day-9, hours 5, 9, 10, 13, 14 (mirror private LP).
- For out-of-sample (public LB score) iteration, we retrain on all data using 1.2 times the number of trees found by early stopping in insample validation
Advanced strategy: a fast run-time and light weight memory usage iteration would be 1st place solution:
- Understanding that there are 99.85% of negative examples in the data and dropping out tons of negative example DOES NOT deteriorate out-of-sample performance.
- Using negative down-sampling strategy, which means that we use all positive examples (i.e., is_attributed == 1) and down-sampled negative examples on model training. We down-sampled negative examples such that their size becomes equal to the number of positive ones. It discards about 99.8% of negative examples.
- Using sample bagging technique, which means we bag five predictors trained on five sampled datasets created from different random seeds.
- This technique allows us to use hundreds of features while keeping LGB training time less than 30 minutes.
- Or use memory trick in numpy
Good principle: keep your insample hold-out score align with LB score:
- Do not rely solely on either pulic LB score or insample hold-out score. If you do that, you will end up overfitting to one of them eventually
- Discard features that increase the gap between pulic LB score and insample hold-out score even though it increases your insample hold-out score

Feature engineering is the winning secret sauce

We have five original categorical features and one timestamp feature in the data set. Unless you have some crazy NN models with proper data preprocessing (3rd place solution), you definitely need some magic features to separate youself from the crowd. If you have no idea about how to engineer some new features, please see this good feature engineering guidance.

Here are some general ideas taken from top winners:

dropping original worse-than-noise features [ip, maybe device]
encode timestamp into day and hour
user concepts: ip, device, os triplets (app is product concept)
(require brute-force) aggregates on various feature groups (click series-based feature sets (i.e., each feature set consists of 31 (=(2^5) - 1) features)) 1st place solution
- count features, unique count features, cumcount features
- time delta with previous value, delta with next value
- mean and variance with respect to hour
- standard target encoding
- Weights of Evidence target encoding
ratios features
- number of clicks per ip, app to number of click per app
- nunique_counts_ratio
- top_counts_ratio
magic additions:
- feature extraction (topic models): categorical feature embedding by using LDA/NMF/LSA 1st place solution
- matrix factorization: truncated svd from sklearn and FM-like embedding
data leakage in test set

Appropriate models for categorical features with large data

LightGBM is crowned over XGBoost in this competition in terms of memory usage and run-time optimization
Do NOT spend too much time on hyper-param tuning (not too much juice from hyper-params)
Some NN models for me to learn

Extra slight boost from ensembling

most people ensemble their predictions based on LB score
good practice in blending - average the logit of the predictions (aka raw predictions)
restacking barely helps in this competition

Some baseline benchmark from my observations (this is meant for ranking roughly estimates)

To be in top 30%, use solely LightGBM and trained (without too much tuning) it on some good features from public kernels
To be in top 20%, use solely LightGBM and trained (with some proper tuning) it on at least top 40 features from public kernels (must include time delta, count, unique count types aggregates with various feature groups)
To be in top 10%, must have beast machine (I am talking about at least 128G RAM) and train models with minimum of 100 proven-to-be-useful features or use NN models based on 20+ aggregate level features
To be in top 5%, all above + feature extraction (categorical feature embedding) or FM-like algos
To be in top 1%, this is really hard. Not sure how to do it.

Reference

[1]IP address encoding issues

[2]EDA by @Pranav Pandya

[3]FM_FTRL by @anttip

[4]Practical Lessons from Predicting Clicks on Ads at Facebook

[5]Ad Click Prediction: a View from the Trenches

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Competition: TalkingData AdTracking Fraud Detection Challenge

Objective

File structure

Some thoughts

My Model and LB score (AUC-ROC)

What I have learned from kaggle competition winners?

Some baseline benchmark from my observations (this is meant for ranking roughly estimates)

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
eda_nb		eda_nb
images		images
insample_iterations		insample_iterations
model		model
reference		reference
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

KevinLiao159/TalkingData

Folders and files

Latest commit

History

Repository files navigation

Kaggle Competition: TalkingData AdTracking Fraud Detection Challenge

Objective

File structure

Some thoughts

My Model and LB score (AUC-ROC)

What I have learned from kaggle competition winners?

Some baseline benchmark from my observations (this is meant for ranking roughly estimates)

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages