DaCe_AutoOpt_DL: Optimizing Data-Centric Applications - A Machine Learning Approach to Optimizations in DaCe
This git repository contains the code, documentation, and all relevant files for the Master's thesis Optimizing Data-Centric Applications - A Machine Learning Approach to Optimizations in DaCe. In this thesis, we provide a cost model implementation for the DaCe Parallel Programming framework alongside a Beam search implementation. For the data generation, we utilize the MLIR-Forge implementation by Berke Ates et al. Our cost model is based on the Tiramisu cost model architecture, however adjusted to work within the DaCe framework. Our Beam search algorithm can be configured to use real-time runtime measurements or cost model predictions.
To install and run the project, follow the steps below:
git clone --recurse-submodules https://spclgitlab.ethz.ch/dofilip/dace_autoopt_dl.git
data
: Contains plotting data points, plots, as well as a small data set collection for testing.notes
: Textfiles for notespapers
: Annotated papers used within the thesissrc
: All source files for this thesisdaisytuner_evaluation
: Benchmark files for the Daisytuner evaluationdataset_generation
: Script files to generate a set of base SDFGs with MLIR-Forgemodel
: Contains source files for the cost model implementationpaper_examples
: Code examples used within the thesis writingpass_application
: Implementation of the transformation pass and benchmarking infrastructureplotting
: Scripts to plot various data points from our cost model analysisscripts
: Contains auxiliary scripts for the data point generationsearch_space_exploration
: Beam search implementation filesworkload_analysis
: SDFG structure analysis scripts
More information about the repository setup can be found in the paragraphs below.
If you want to generate a custom dataset for training, you have to set up MLIR-Forge in the following way:
cd MLIR-Smith
Follow the steps on MLIR-Forge to build all necessary components.
To train the model, use python training.py
in src/model
. Make sure that the training data was properly initialized in the next step.
To generate a training data set based on some SDFG directory, run generate_test_dataset.py
in the src/scripts
directory. This generates the base graphs and their transformed graphs in src/model/train_graphs/
. By running python data_loader.py
in src/model
, the training data will be initialized and stored in src/model/train_data
.
To generate a training data set based on random programs provided by MLIR-Forge, ensure a proper installation of MLIR-Forge first. Next, generate base SDFGs by using src/dataset_generation/dataset_generation.sh
. You can parameterize MLIR-Forge with the gen_config
file (See more information about this below). Lastly, run generate_gen_dataset.py
in the src/scripts
directory.
Training our cost model requires a large dataset of SDFGs. However, randomly generated SDFGs do not resemble the workload that scientific applications exhibit. For this reason, we carry out a structure analysis on NPBench SDFGs and set the generation parameters of MLIR-Forge according to our analysis in our SDFG generation pipeline.
The workload_analysis
directory holds two files analyze_sdfg.py
and analyze_python.py
which may be used to analyze the NPBench programs in SDFG and Python representation, respectively. The SDFGs for all NPBench programs are generated with npbench_to_sdfg.py
and subsequently stored in npbench_sdfgs/
. By running python analyze_sdfg.py
a workload analysis of the npbench_to_sdfg.py
directory is carried out and printed into SDFG_analysis.txt
.
The src/search_space_exploration
directory contains all relevant files for conducting the Beam search. Run python beam_search.py +beam.sdfg_filename=FILENAME +beam.batch_number=BATCH_NR
to conduct a Beam search on an SDFG file in a specific batch (Use batch number 0). Please note, that you need to set up the search_graphs/base_graphs
directory prior to the Beam search. This directory must specifically contain the SDFGs that you want to run the Beam search on. The supplied filepath should lead to the SDFG within this directory.
Filip Dobrosavljevic Advisors: Andrei Ivanov, Lukas Gianinazzi, and Afif Boudaoud Supervisors: Prof. Dr. Torsten Hoefler
Acknowledgements: Thanks to Lukas Truemper for providing parts of the benchmarking infrastructure and for providing a data set of SDFGs for our cost model training.
TODO