Machine Learning

This is a repo for different machine learning tasks, most of them come from Kaggle competitions.

I wrote everything in Python3 using Jupyter Notebooks in an Anaconda3 environment.

Examples contain Code for both structured and unstructured data and are mostly for showcasing the code because the data I used was stored locally.

Notebooks that are somewhat cleaned

Whale Identification (one-shot learning, siamese network)

Protein Classification (Multilabel classification, resnet50)

NLP example

structured data example and feature engineering

plotting shenanigans

Style Transfer (for fun :D )

Most interesting competitions so far

Unstructured Data

Image

Predicting Molecular Properties

Seperate repo

Solution writeup

Kaggle competition: https://www.kaggle.com/c/champs-scalar-coupling

In this competition, you will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant).

Dataformat	Metric	Prediction
structured data (graph based)	log mean average error	regression

Started working on this competition using lightgbm and then used a modified implementation of a message passing neural network.

Notes

A lot of additional data that is not usable directly because it's not contained in the test set.

Domain knowledge about atom interaction in molecules was really important (to a certain degree).

Most of the features were calculated using rdkit and openbabel.

local validation for message passing neural network

Per coupling type:

1JHC: -1.371
2JHC: -2.229
3JHC: -1.975
1JHN: -1.538
2JHN: -2.504
3JHN: -2.517
2JHH: -2.501
3JHH: -2.383

average local log mae: -2.12

Placement

top 2%

leaderboard	score	placement
public	-2.37190	43/2757
private	-2.36477	42/2757

Protein Classification Challenge

In this competition, you will develop models capable of classifying mixed patterns of proteins in microscope images. The Human Protein Atlas will use these models to build a tool integrated with their smart-microscopy system to identify a protein's location(s) from a high-throughput image.

Dataformat	Metric	Prediction
4 channel image	macro F1 Score	multi label classification

Placement

118/2172: top 5%

Focalloss worked way better than binary cross entropy

Started with resnet34 using fastai for multilabel-classification resnet50 worked even better (by about 0.05 macro F1-score)

Sartorius - Cell Instance Segmentation

In this competition, you’ll detect and delineate distinct objects of interest in biological images depicting neuronal cell types commonly used in the study of neurological disorders. More specifically, you'll use phase contrast microscopy images to train and test your model for instance segmentation of neuronal cells. Successful models will do this with a high level of accuracy.

Placement

224/1505: top 15%

Humpback Whale Identification

In this competition, you’re challenged to build an algorithm to identify individual whales in images. You’ll analyze Happywhale’s database of over 25,000 images, gathered from research institutions and public contributors. By contributing, you’ll help to open rich fields of understanding for marine mammal population dynamics around the globe.

Dataformat	Metric	Prediction
3 channel image	Mean Average Precision @ 5	single label classification (@ 5)

Placement

555/2131: top 26%

The greatest challenge for this competition was the lack of images for each label of humpback whale (1-20 different images) So I tried different kinds of one-shot learning algorithms like siamese networks with LAP matching of positive and negative examples.

In the end it turned out metric learning and siamese networks were indeed good approaches to the problem but time was running short.

Text

Jigsaw Unintended Bias in Toxicity Classification:

In this competition, you're challenged to build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias.

Dataformat	Metric	Prediction
text	generalized mean of bias AUCs	classification

First NLP competition I have joined and I still feel like I have to learn a lot in this space.

Used GloVe combined with a lstm + word embedding neural network.

Score

0.93568, 718/2646 placement

Ship Detection Challenge:

In this competition, you are required to locate ships in images, and put an aligned bounding box segment around the ships you locate. Many images do not contain ships, and those that do may contain multiple ships. Ships within and across images may differ in size (sometimes significantly) and be located in open sea, at docks, marinas, etc.

Dataformat	Metric	Prediction
3 channel image	F2 Score	binary segmentation

Score

public 0.70823, 208/884 placement

private 0.82704, 524/884 placement

Used fastai with resnet34 for image segmentation.

Big dropoff on private test set because I tried to select a part of the train set to reduce computing time but my selection method was lacking. Definitely will keep this mistake in mind for the future

Structured Data

Store Item Demand Forecasting Challenge

You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.

Dataformat	Metric	Prediction
time series data	symmetric mean absolute percentage error	regression

Score

Because of a leak the competition was reset in the last few weeks and I did not have the time to submit again.

Predicting Molecular Properties

In this competition, you will develop an algorithm that can predict the magnetic interaction between two atoms in a molecule (i.e., the scalar coupling constant).

Dataformat	Metric	Prediction
structured data (graph based)	log mean average error	regression

A lot of additional data that is not usable directly because it's not contained in the test set. Also domain knowledge about atom interaction in molecules seems really important.

Solved using lightgbm and a message passing neural network

Codebase

seperate github repo

Score

top 2%

leaderboard	score	placement
public	-2.37190	43/2757
private	-2.36477	42/2757

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
structured-data		structured-data
unstructured-data		unstructured-data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning

Notebooks that are somewhat cleaned

Most interesting competitions so far

Unstructured Data

Image

Predicting Molecular Properties

Notes

local validation for message passing neural network

Per coupling type:

Placement

Protein Classification Challenge

Placement

Sartorius - Cell Instance Segmentation

Placement

Humpback Whale Identification

Placement

Text

Jigsaw Unintended Bias in Toxicity Classification:

Score

Ship Detection Challenge:

Score

Structured Data

Store Item Demand Forecasting Challenge

Score

Predicting Molecular Properties

Codebase

Score

this is my best finish so far

About

Releases

Packages

Languages

License

CozyDoomer/Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Machine Learning

Notebooks that are somewhat cleaned

Most interesting competitions so far

Unstructured Data

Image

Predicting Molecular Properties

Notes

local validation for message passing neural network

Per coupling type:

Placement

Protein Classification Challenge

Placement

Sartorius - Cell Instance Segmentation

Placement

Humpback Whale Identification

Placement

Text

Jigsaw Unintended Bias in Toxicity Classification:

Score

Ship Detection Challenge:

Score

Structured Data

Store Item Demand Forecasting Challenge

Score

Predicting Molecular Properties

Codebase

Score

this is my best finish so far

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages