-
Notifications
You must be signed in to change notification settings - Fork 12
Home
To be updated
distributedWekaSpark Options/command line parameters:
-task (task descriptor) possible values: buildHeaders , buildClassifier , buildClassifierEvaluation, buildFoldBasedClassifier , buildFoldBasedClassifierEvaluation , buildClusterer , findAssociationRules default:No (Exception)
-dataset-type (dataset type to use) possible values: Instances , ArrayInstance , ArrayString default: ArrayString
-caching (Storage Level to use) possible values : All Spark supported ex: MEMORY_ONLY default: Cashing Strategy Selection Algorithm
-compress (compress serialized rdds to save memory) possible values: y (or none) default: none (do not compress)
-kryo (use kryo serializer) possible values : y (or none) default: none (do not use kryo)
-caching-fraction (executor caching fraction) possible values: 0.1-1.0 default: 0.6
-overhead (dataset overhead of java objects (how much larger is the Object than the raw data)) possible values: any double default: 5.0
-hdfs-dataset-path (path to the dataset on HDFS) possible values: hdfs://(host):(port)/user/username/dataset.csv default:No (Exception)
-hdfs-names-path: (path to a names files on HDFS, names must be in a comma delimited format) possible values : hdfs://(host):(port)/user/username/names.txt default: Will try to compute att0,att1 etc
-hdfs-headers-path (path to pre-built headers for the dataset) possible values: hdfs://(host):(port)/user/username/someheaders.arff default: Will try to compute headers
-hdfs-classifier-path (path to trained classifier) possible values : hdfs://(host):(port)/user/username/someclassfier.model default: Will try to build classifier
-hdfs-output-path (path to an HDFS folder where generated models will be saved) possible values: hdfs://(host):(port)/user/username/somefolder/ default: No (will not save)
-num-partitions (number of partitions) possible values: Any Integer default: Spark default
-num-random-chunks (number of randomized/stratified partitions) possible values: any Integer default: No stratification/ randomisation
-num-of-attributes (number of attributes in the dataset) possible values: any Integer default: will try to compute from the dataset if no headers provided
-class-index (index of the class attribute) possible values: any Integer [0,num-of-attributes-1] default: num-of-attributes-1
-num-folds (number of folds) possible values: Any Integer default:1
-names (attribute names) possible values: a comma delimited list of names default:Will produce a list att0,att1 etc
-classifier (the name and package path of the classifier to use) possible values: weka.classifier.bayes.NaiveBayes (any weka core) default: NaiveBayes
-meta (meta learner to use) possible values: weka.classifiers.meta.Bagging (any weka core) default:None (WekaClassifierReduceTask default)
-rule-learner (association rule learner to use) posible values: weka.associations.Apriori and weka.associations.FPGrowth only default: FPGrowth
-clusterer (clusterer to use) possible value : weka.clusterer.Canopy only. default: Canopy
-num-clusters (number of clusters to find) possible values: any Integer default: 0 (the task will try to set a number of the clusterer supports auto-config)
-parser-options (options for the csv parser) possible values: "-N first-last etc.. " default: weka defaults based on the tasks DO NOT FORGET " " when grouping parameters
-weka-options (options for weka algorithms (base tasks and core algorithms)) possible values: " -depth 3 etc " default: weka defaults based on the tasks DO NOT FORGET " " when grouping parameters
Association rules
bin/spark-submit --master spark://ec2-...compute-1.amazonaws.com:7077 \
--class uk.ac.manchester.ariskk.distributedWekaSpark.main.distributedWekaSpark \
--executor-memory 28G \
/root/spark/distributedWekaSpark-0.1.0-SNAPSHOT.jar \
-task findAssociationRules -dataset-type ArrayInstance \
-hdfs-dataset-path hdfs://ec2-.....compute-1.amazonaws.com:9000/somedataset.csv -num-of-attributes 300 \
-parser-options " -N first-last " -weka-options "-T 0 -N 10 -C 0.87 -M 0.10 " \
-gen-header y -caching MEMORY_ONLY -compress y
Classifier Training
bin/spark-submit --master spark://ec2-.....compute-1.amazonaws.com:7077 \
--class uk.ac.manchester.ariskk.distributedWekaSpark.main.distributedWekaSpark \
--executor-memory 13G \
/root/spark/distributedWekaSpark-0.1.0-SNAPSHOT.jar \
-task buildClassifier -dataset-type ArrayString \
-hdfs-dataset-path hdfs://ec2-......amazonaws.com:9000/somedataset.csv -num-of-partitions 800 -num-of-attributes 29 -class-index 28 \
-classifier weka.classifiers.functions.LinearRegression -parser-options " -no-summary-stats " \
-caching MEMORY_ONLY