Skip to content
/ maplearn Public

Parallelization of various popular machine learning tools in R for use in Hadoop

Notifications You must be signed in to change notification settings

oeo4b/maplearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

"maplearn" is a library of classifier/regression tools to be used with the Hadoop Streaming utility in Apache Hadoop.
Note: All scripts take space-delimited files as input. 

linear
Description: Simple linear regression uses ordinary least squares estimates to fit parameters. 
Usage: Reducer returns keys {A, b} where  => (X'X)^-1 * X' * y = A^-1 * b = parameter estimates

naivebayes
Description: Naive bayes classifier assumes a non-singular, linearly-independent design matrix and discrete response/predictor variables. 
Usage: Reducer returns a-priori (such as - P(y = yk)) and conditional (such as - P(xi = a | y = yk)) probabilities. Predictions can be made by applying Baye's law and the naive assumptions made, P(Y) = P(Y = yk) * P(X = xi | Y = yk)....P(X = xn | Y = yk). Maximization over the the posterior probabilities reveals the most likely outcome. 

kmeans
Description: The k-means clustering algorithm iteratively minimizes the sum of squares for K cluster centroids. Parallelization is achieved by independently clustering each subgroup and using the partial sums to calculate the weighted averages.
Usage: Reducer returns centroid features for each K cluster. Starting centers must be stored in the space-delimited 'R/kmeans/centers' file - the number of clusters are determined by the dimensions of the matrix. 

About

Parallelization of various popular machine learning tools in R for use in Hadoop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published