Disco Machine Learning Library (discomll) is a python package for machine learning with MapReduce paradigm. It works with Disco framework for distributed computing. discomll is suited for analysis of large datasets as it offers classification, regression and clustering algorithms.
Classification algorithms
- naive Bayes - discrete and continuous features,
- linear SVM - continuous features, binary target,
- logistic regression - continuous features, binary target,
- forest of distributed decision trees - discrete and continuous features,
- distributed random forest - discrete and continuous features,
- distributed weighted forest (experimental) - discrete and continuous features,
- distributed weighted forest rand (experimental) - discrete and continuous features,
Clustering algorithms
- k-means - continuous features,
Regression algorithms
- linear regression - continuous features, continuous target,
- locally weighted linear regression - continuous features, continuous target,
Utilities
- evaluation of the accuracy,
- distribution views,
- model views.
discomll works with following data sources:
- datasets on the Disco Distributed File System,
- text or gziped datasets accessible via file server.
discomll enables multiple settings for a dataset:
- multiple data sources,
- feature selection,
- feature type specification,
- parsing of data,
- handling of missing values.
Prerequisites
- Disco 0.5.4,
- numpy should be installed on all worker nodes,
- orange and scikit-learn are used in unit tests.
pip install discomll
In performance analysis, we compare speed and accuracy of discomll algorithms with scikit and Knime. We measure speedups of discomll algorithms with 1, 3, 6 and 9 Disco workers.
In second performance analysis, we compare accuracy of distributed ensemble algorithms with scikit-learn algorithms. We train the model on whole dataset with distributed algorithms and on a subset with single core algorithms. We show that distributed ensembles achieve similar accuracy as single core algorithms.
You can try discomll algorithms on the ClowdFlows platform. ClowdFlows is an open sourced cloud based platform for composition, execution, and sharing of interactive machine learning and data mining workflows. For instruction see the User Guide.
Public workflows:
- naive Bayes - lymphography dataset,
- naive Bayes - segmentation dataset,
- logistic regression - sonar dataset,
- logistic regression - ionosphere dataset,
- linear SVM - sonar dataset,
- linear SVM - ionosphere dataset,
- forest of distributed decision trees - lymphography dataset,
- forest of distributed decision trees - segmentation dataset,
- distributed random forest - lymphography dataset,
- distributed random forest - segmentation dataset,
- distributed weighted forest rand - lymphography dataset,
- distributed weighted forest rand - segmentation dataset,
- k-means - linear dataset,
- k-means - segmentation dataset,
- linear regression - linear dataset,
- linear regression - fraction dataset,
- model view bug fixes for ensembles,
- ensembles missing values support.
- model view fixed for ensembles,
- bug fixes in examples and tests.
- distributed weighted forest Rand was added. Algorithm is similar to distributed weighted forest, but it uses randomly selected medoids.
- improvements of algorithms, especially ensembles,
- performance analysis 2.