BirdSightingPrediction

Distributed Parallel Algorithm to predict the presence or absence of the Red-winged Blackbird in each birding session with an 83% accuracy

Key Points

Labelled Data is split into 80-20 for building a model and validation, with only the required attributes.
Ensemble algorithm is used with building a configurable number of models of Random Trees and the final prediction is achieved via Voting.
Each Random Tree is built in parallel on separate worker machines with tuned parameters which are set according to the suggestions made in recent studies and experimental findings.
Predictions are done in parallel and is scalable to the number of records to be predicted.

Approach

3 MapReduce Jobs are implemented that achieve the following tasks :

Performing preprocessing on the labelled data - Job 1
Building the model - Job 1
Validation of the model and finding the accuracy - Job 2
Predictions - Job 3

Job 1 - Preprocessing and Modelling

Job 1 - Features

Random Shuffling
Preprocessing , data split and modelling in a single job
Partitioner used for load distribution
Attributes selected for Modelling : - Latitude and Longitude - Data related to housing density at the sampling site - Data related to elevations at the sampling site - Data related to water bodies near or at the sampling site - Also , data was filtered to get distinct records and remove redundant observations from an observing group members

Job 2 - Validation

Job 2 - Features

Minimum I/O - Map only Job
Hadoop counters used for calculating Accuracy of the model

Job 3 - Predictions

Job 3 - Features

Scalable solution - can use up to “n” machines

Design Decisions

Why forest of Random Trees? Random Trees proves to be the right choice for unstable and skewed Data like ours. Using a forest of Random Trees made us exploit the distributed parallel feature of MapReduce to a good extent to train models parallely on worker machines. Also , the accuracy we obtained using forest of Random trees with Bootstrapping/Bagging was the highest out of all the models we explored. The approach achieves a lower test error solely by variance reduction. Since each random tree is trained on random subset of bagged data and random attributes out of all the attributes.
Deciding ‘N’ for number of models? After trying with lot of values for ‘N’ we selected 10 as the Results was efficient and Prediction were reaching an accuracy of 83%.
Parameters explored for Random Trees? After researching and experimenting , we found the following parameter to be best suitable for Random trees : Minimum size of the each node of a tree = 1 , Attributes to randomly select for building the tree= square root of total attributes(25) , Depth of Tree= 0(Infinity)

Accuracy RoadMaps

Single NaiveBayes Weka Model : 67%
Single Predefined Mboost Weka Pre-Defined: 74%
Single Random Tree on 1/10th: 75%
n Random Trees on 1/n of 80% Training Data: 81%(n=10) 78%(n=100)
10 Random Trees with Bagging/Bootstrapping of 80% - Training Data with Bagging :83.7%

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CODE		CODE
ProjectReport_CS6240_2 - Google Docs.pdf		ProjectReport_CS6240_2 - Google Docs.pdf
README.md		README.md
_config.yml		_config.yml
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BirdSightingPrediction

Key Points

Approach

Job 1 - Preprocessing and Modelling

Job 1 - Features

Job 2 - Validation

Job 2 - Features

Job 3 - Predictions

Job 3 - Features

Design Decisions

Accuracy RoadMaps

About

Releases

Packages

Languages

AbhayKasturia/BirdSightingPrediction

Folders and files

Latest commit

History

Repository files navigation

BirdSightingPrediction

Key Points

Approach

Job 1 - Preprocessing and Modelling

Job 1 - Features

Job 2 - Validation

Job 2 - Features

Job 3 - Predictions

Job 3 - Features

Design Decisions

Accuracy RoadMaps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages