discomll

Disco Machine Learning Library (discomll) is a python package for machine learning with MapReduce paradigm. It works with Disco framework for distributed computing. discomll is suited for analysis of large datasets as it offers classification, regression and clustering algorithms.

Algorithms

Classification algorithms

naive Bayes - discrete and continuous features,
linear SVM - continuous features, binary target,
logistic regression - continuous features, binary target,
forest of distributed decision trees - discrete and continuous features,
distributed random forest - discrete and continuous features,
distributed weighted forest (experimental) - discrete and continuous features,
distributed weighted forest rand (experimental) - discrete and continuous features,

Clustering algorithms

k-means - continuous features,

Regression algorithms

linear regression - continuous features, continuous target,
locally weighted linear regression - continuous features, continuous target,

Utilities

evaluation of the accuracy,
distribution views,
model views.

Features of discomll

discomll works with following data sources:

datasets on the Disco Distributed File System,
text or gziped datasets accessible via file server.

discomll enables multiple settings for a dataset:

multiple data sources,
feature selection,
feature type specification,
parsing of data,
handling of missing values.

Installing

Prerequisites

Disco 0.5.4,
numpy should be installed on all worker nodes,
orange and scikit-learn are used in unit tests.

pip install discomll

Performance analysis

In performance analysis, we compare speed and accuracy of discomll algorithms with scikit and Knime. We measure speedups of discomll algorithms with 1, 3, 6 and 9 Disco workers.

Performance analysis 2##

In second performance analysis, we compare accuracy of distributed ensemble algorithms with scikit-learn algorithms. We train the model on whole dataset with distributed algorithms and on a subset with single core algorithms. We show that distributed ensembles achieve similar accuracy as single core algorithms.

Try it now

You can try discomll algorithms on the ClowdFlows platform. ClowdFlows is an open sourced cloud based platform for composition, execution, and sharing of interactive machine learning and data mining workflows. For instruction see the User Guide.

Public workflows:

Release notes

version 0.1.4.2 (Released 18/oct/2015)

model view bug fixes for ensembles,
ensembles missing values support.

version 0.1.4.1 (Released 17/oct/2015)

model view fixed for ensembles,
bug fixes in examples and tests.

version 0.1.4 (Released 11/oct/2015)

distributed weighted forest Rand was added. Algorithm is similar to distributed weighted forest, but it uses randomly selected medoids.
improvements of algorithms, especially ensembles,
performance analysis 2.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
discomll		discomll
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
big_data_workflow.png		big_data_workflow.png
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discomll

Algorithms

Features of discomll

Installing

Performance analysis

Performance analysis 2##

Try it now

Release notes

version 0.1.4.2 (Released 18/oct/2015)

version 0.1.4.1 (Released 17/oct/2015)

version 0.1.4 (Released 11/oct/2015)

About

Releases

Packages

Languages

License

romanorac/discomll

Folders and files

Latest commit

History

Repository files navigation

discomll

Algorithms

Features of discomll

Installing

Performance analysis

Performance analysis 2##

Try it now

Release notes

version 0.1.4.2 (Released 18/oct/2015)

version 0.1.4.1 (Released 17/oct/2015)

version 0.1.4 (Released 11/oct/2015)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages