GitHub - oeo4b/maplearn: Parallelization of various popular machine learning tools in R for use in Hadoop

oeo4b / maplearn Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

Parallelization of various popular machine learning tools in R for use in Hadoop

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
R		R
.gitignore		.gitignore
README		README

Repository files navigation

"maplearn" is a library of classifier/regression tools to be used with the Hadoop Streaming utility in Apache Hadoop.
Note: All scripts take space-delimited files as input. 

linear
Description: Simple linear regression uses ordinary least squares estimates to fit parameters. 
Usage: Reducer returns keys {A, b} where  => (X'X)^-1 * X' * y = A^-1 * b = parameter estimates

naivebayes
Description: Naive bayes classifier assumes a non-singular, linearly-independent design matrix and discrete response/predictor variables. 
Usage: Reducer returns a-priori (such as - P(y = yk)) and conditional (such as - P(xi = a | y = yk)) probabilities. Predictions can be made by applying Baye's law and the naive assumptions made, P(Y) = P(Y = yk) * P(X = xi | Y = yk)....P(X = xn | Y = yk). Maximization over the the posterior probabilities reveals the most likely outcome. 

kmeans
Description: The k-means clustering algorithm iteratively minimizes the sum of squares for K cluster centroids. Parallelization is achieved by independently clustering each subgroup and using the partial sums to calculate the weighted averages.
Usage: Reducer returns centroid features for each K cluster. Starting centers must be stored in the space-delimited 'R/kmeans/centers' file - the number of clusters are determined by the dimensions of the matrix.