Setting up your python environment

Author of original version of PARC autoML: Raj Minhas, PARC Contributors: Dan Davies, PARC

Setting up your python environment

This version is being tested using Python 3.5. It is being tested on Windows 64-bit and Linux Ubuntu 64-bit. autoML.py should also work on MacOSx 64-bit and 32-bit Windows, Linux, and MacOSx. ARX, a Java program, has been modified to work with autoML for deprivatizing data with privacy concerns. Unless the -s or --privatize_data flag is set to "none", Java needs to be installed for autoML.py to work. The Python packages needed include NumPy, SciPy, cython, Pylab, and pyearth. The installation of these packages can be tricky on some systems. Here is a process that has usually been working on Linux:

Doing these in order is important-
NumPy: sudo apt-get install python3-numpy
SciPy: sudo apt-get install python3-scipy
cython: sudo pip3 install cython
Pylab: sudo pip3 install matplotlib (pylab is part of matplotlib)
Py-earth: go into the py-earth directory of your instance of this github repository and execute 'sudo python3 setup.py install'

The above steps are currently being tested on MacOS.
For Windows, I am using conda/anaconda. Do the installations in the order above.

The following packages get automatically installed if you are using Anaconda, but if not you may also need:
sudo pip3 install Pillow
sudo pip3 install cov-core
sudo pip3 install nose2
sudo apt-get install python3-tk

See Python_Windows_List_of_Packages.txt for the list of packages in a working Windows environment. (Not all packages are needed for autoML, but those that are have their version listed.)

Executing autoML-multiData

This is an extension to autoML library built by Raj Minhas at PARC. autoML has features added for desensitizing data with privacy concerns. A sample can be run by executing

    python autoML.py  -i INPUT_FILE    (if run without the -i argument, it will run the time series data in data.csv)

**usage:** autoML.py [-h] [-m MODEL_TYPE] [-i INPUT_FILE] [-file2 SECONDARY_FILE] [-w SECONDARY_WEIGHTS] [-d DISTANCEFN] [-sprs SPARSITY] [-r RADIUS] [-t MAX_TIME] [-n MAX_ITERATIONS] [-pca N_PCA] [-e N_EXPERTS] [-s {none,manual,auto,manual_both,auto_both}] [-f HIERARCHY_FOLDER]

A simple interface to autoML (lite, multiData)

optional arguments:<br>
  -h, --help            show this help message and exit
  -m MODEL_TYPE, --model_type MODEL_TYPE
                        choose classification(default), regression or
                        clustering
  -i INPUT_FILE, --input_file INPUT_FILE
                        primary input file to be analyzed (default=data.csv)
  -file2 SECONDARY_FILE, --secondary_file SECONDARY_FILE
                        optional secondary input file, triggers multi-dataset
                        analysis (default=None)
  -w SECONDARY_WEIGHTS, --secondary_weights SECONDARY_WEIGHTS
                        weights for features in secondary file, default=False
  -d DISTANCEFN, --distanceFn DISTANCEFN
                        choose L_1Norm(n), euclidean(n), L_infinityNorm(n),
                        distanceOnEarth(n), L_1Norm_cat(n), or
                        L_infinityNorm_cat(n), where n=1,2,3,... is the chosen
                        dimension for calculating distances. Default is
                        L_1Norm(1)
  -sprs SPARSITY, --sparsity SPARSITY
                        sparsity threshold for including records in secondary
                        input file
  -r RADIUS, --radius RADIUS
                        radius for cutoff of the distance function (default=1)
  -t MAX_TIME, --max_time MAX_TIME
                        maximum time in seconds for training all models. The
                        default value is 1440 seconds.
  -n MAX_ITERATIONS, --max_iterations MAX_ITERATIONS
                        max iterations for each individual model fit. The
                        default is 10 for clustering, 100 for classification
                        and 100 for regression.
  -pca N_PCA, --n_pca N_PCA
                        number of PCA components. (default is 2)
  -e N_EXPERTS, --n_experts N_EXPERTS
                        number of experts for Ensemble scoring. (default is 5)
  -s {none,manual,auto,manual_both,auto_both}, --privatize_data {none,manual,auto,manual_both,auto_both}
                        choose none, manual, or auto for privatization of the
                        data using ARX. For manual, an ARX window will launch.
                        For privatization of primary and secondary datasets,
                        choose manual_both or auto_both. (Default is 'manual')
  -f HIERARCHY_FOLDER, --hierarchy_folder HIERARCHY_FOLDER
                        folder containing hierarchy files for sensitive data,
                        if provided by the user. (Default is 'hierarchy')

It currently handles three types of models: classification, regression, and clustering. If the model type is classification or regression then the last column of the input data is assumed to be the dependent variable. Option to add a second dataset and a distance function. The distance function is used to assign elements of the second dataset to each row in the first dataset. A cutoff radius is used for the selection, with default initial value of 1. The -r option can be used to scale the distance function differently.

Executing analysis with multiple datasets

A sample run could be with following command:

python3 autoML.py -m classification -i datasets/accident-landmarks-dataset/primary.data.csv -file2 datasets/accident-landmarks-dataset/secondary.data.csv -sprs 0.1 -r 1000 -d distanceOnEarth(2) -s none -t 300

Subdirectories:

/example datasets - this subdirectory contains examples of how data from different datasets have been assembled into files autoML can use.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
arx/arxUI		arx/arxUI
example primary+secondary datasets		example primary+secondary datasets
graphlet-estimation-lib		graphlet-estimation-lib
hierarchy		hierarchy
testing		testing
.gitignore		.gitignore
CatEncoder.py		CatEncoder.py
ClusterWrapper.py		ClusterWrapper.py
Data.py		Data.py
DeIdentify.py		DeIdentify.py
DistanceFn.py		DistanceFn.py
Experts.py		Experts.py
FuseData.py		FuseData.py
Image.py		Image.py
LICENSE		LICENSE
Model.py		Model.py
PruneFeatures.py		PruneFeatures.py
Python_Windows_List_of_Packages.txt		Python_Windows_List_of_Packages.txt
README.md		README.md
RandomizedSearchCluster.py		RandomizedSearchCluster.py
TimeSeries.py		TimeSeries.py
arx_autoML_cocoa_macOSx.jar		arx_autoML_cocoa_macOSx.jar
arx_autoML_cocoa_macOSx_x86_64.jar		arx_autoML_cocoa_macOSx_x86_64.jar
arx_autoML_linux32.jar		arx_autoML_linux32.jar
arx_autoML_linux64.jar		arx_autoML_linux64.jar
arx_autoML_windows32.jar		arx_autoML_windows32.jar
arx_autoML_windows64.jar		arx_autoML_windows64.jar
autoML.bak.py		autoML.bak.py
autoML.py		autoML.py
data.csv		data.csv
data.csv.bak		data.csv.bak
plotting.py		plotting.py
tmp.py		tmp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up your python environment

Executing autoML-multiData

Executing analysis with multiple datasets

Subdirectories:

About

Releases

Packages

Contributors 2

Languages

License

fetorres/autoML-multiData

Folders and files

Latest commit

History

Repository files navigation

Setting up your python environment

Executing autoML-multiData

Executing analysis with multiple datasets

Subdirectories:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages