-
Notifications
You must be signed in to change notification settings - Fork 1
oeo4b/maplearn
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
"maplearn" is a library of classifier/regression tools to be used with the Hadoop Streaming utility in Apache Hadoop. Note: All scripts take space-delimited files as input. linear Description: Simple linear regression uses ordinary least squares estimates to fit parameters. Usage: Reducer returns keys {A, b} where => (X'X)^-1 * X' * y = A^-1 * b = parameter estimates naivebayes Description: Naive bayes classifier assumes a non-singular, linearly-independent design matrix and discrete response/predictor variables. Usage: Reducer returns a-priori (such as - P(y = yk)) and conditional (such as - P(xi = a | y = yk)) probabilities. Predictions can be made by applying Baye's law and the naive assumptions made, P(Y) = P(Y = yk) * P(X = xi | Y = yk)....P(X = xn | Y = yk). Maximization over the the posterior probabilities reveals the most likely outcome. kmeans Description: The k-means clustering algorithm iteratively minimizes the sum of squares for K cluster centroids. Parallelization is achieved by independently clustering each subgroup and using the partial sums to calculate the weighted averages. Usage: Reducer returns centroid features for each K cluster. Starting centers must be stored in the space-delimited 'R/kmeans/centers' file - the number of clusters are determined by the dimensions of the matrix.
About
Parallelization of various popular machine learning tools in R for use in Hadoop
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published