Skip to content

Commit

Permalink
[ARTIFACTS TPDS] initial prepare
Browse files Browse the repository at this point in the history
  • Loading branch information
DavideConficconi committed Oct 31, 2022
1 parent 0914d9e commit 15ec6e3
Show file tree
Hide file tree
Showing 8 changed files with 249 additions and 0 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,19 @@ Contributors: D'Arnese Eleonora, Conficconi Davide, Del Sozzo Emanuele, Fusco Lu

If you find this repository useful, please use the following citation(s):

```
@article{faber2022,
title={Faber: a Hardware/Soft-ware Toolchain for Image Registration},
author={D'Arnese, Eleonora and Conficconi, Davide and Del Sozzo, Emanuele and Fusco, Luigi and Sciuto, Donatella and Santambrogio, Marco D},
journal={IEEE Transactions on Parallel and Distributed Systems},
year="2022",
publisher = "IEEE Computer Society",
address = "Los Alamitos, CA, USA",
pages = "To Appear"
}
```

```
@inproceedings{iron2021,
author = {Conficconi, Davide and D'Arnese, Eleonora and Del Sozzo, Emanuele and Sciuto, Donatella and Santambrogio, Marco D},
Expand Down
121 changes: 121 additions & 0 deletions artifacts_tpds22_scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Artifacts for IEEE Transaction on Parallel and Distributed Systems (TPDS) Open Initiative

Paper Title: Faber a Hardware/Software Toolchain for Image Registration
Authors: Eleonora D'Arnese, Davide Conficconi, Emanuele Del Sozzo, Luigi Fusco, Donatella Sciuto, Marco D. Santambrogio.
Affiliation: Politecnico di Milano

Paper Main Contributions:
* The first open-source HW/SW toolchain to automatically create custom Image Registration (IRG) pipelines exploiting FPGA-based accelerators.
* Three levels of customization hyperparameters to support users in building IRG pipelines
* A design automation methodology for non-FPGA experts to exploit default HW configurations as off-the-shelf SW
* A latency and resource model to guide HW expert users during the customization of the HW accelerators

Faber achieves up to 54xin speedup and 177x in energy efficiency improvements over State of the Art

## Artifacts' Objectives

With this repo all the Faber's manuscript results can be reproduced:
* HW generation
* Single accelerator testing
* Resource prediction and actual usage extraction
* IRG application execution
* Accuracy Extraction
* Latency prediction
* State of the Art comparison

We exclude the biomedical dataset since it is open and available, as well as Matlab and SimpleITK applications, and to respect their intellectual property.
Please Artifact Evaluators contact us for more details or for ready to use setup.

## Testing Environment <a name="testing_env"></a>
1. We tested the hardware code generation on two different machines based on Ubuntu 18.04/20.4 and Centos OS 7.6 respectively.
2. We used Xilinx Vitis Unified Platform and Vivado HLx toolchains 2019.2.
3. We used Python 3 with `argparse` `numpy` `math` packets on the generation machine.
4. a) On the host machines, or hardware design machines, we used Pynq 2.5 on the Zynq based platforms (Pynq-Z2, Ultra96, Zcu104), where we employ `cv2`, `numpy`, `pandas`, `multiprocessing`, `statistics`, `argparse`, `pydicom`, and `scipy` packetes. For pure SW deployment the user will also need `torch` and `kornia` packets.
4. b) We tested the Alveo u200 on a machine with CentOS 7.6, i7-4770 CPU @ 3.40GHz, and 16 GB of RAM, and we installed Pynq 2.5.1 following the [instructions by the Pynq team](https://pynq.readthedocs.io/en/v2.5.1/getting_started/alveo_getting_started.html) with the same packets as point 4a.
5. [Optional] Possible issues with locale: export LANG="en_US.utf8".

## Artifact Installation and Deployment Process
1. Make sure to have installed Vitis and Vivado 2019.2, the Alveo u200 and the U96 devices, as well as Python 3 and its packages.
2. Follow the [PYNQ's team instruction to setup your devices](https://pynq.readthedocs.io/en/v2.5.1/) (Note: the FPGAs can be on a completly different place than the building system)
2. Clone the repo `https://github.com/necst/faber_fpga.git -b artifacts_tpds22`
3. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
4. Start with one or alle the following reproduction steps
5. All the on-board executions refer to [Testing a HW Design](#testing_designs) or [Deployment Example and example of Dataset Structuring](#deploymentexample)

## Artifacts Reproduction
Follows all the possible manuscript reproduction instructions.

### Table 2: Performance analysis of the MI similarity metric with and without the HW transformation at 200MHz.
Expected time: consider 16 builds (12 Vitis 4-6 hours each; 4 Vivado 2-3 hours each) it is extremely machine dependent, and then the time to collect the results, Worst case 16\*5\*2 minutes..
1. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
2. build the accelerators (i.e., `bash metric_w-wo_transform.bash`)
3. check in `build/ultra96_v2` and `build/alveo_u200` the presence of a CSV with the resources of such builds
4. Deploy your solution according to [Testing a HW Design](#testing_designs)
5. To compute the performance values extract the run-time values and then apply for Powell's method `=(EXETIME*1000)/((246*512*512)/10^6)/(3*246)` for 1+1 `=(EXETIME*1000)/((246*512*512)/10^6)/(100*246)` since they employ generally different iterations (i.e., 3 for Powell's and 100 for 1+1).
6. For the Energy Efficiency the ZCU104 and the Alveo U200 have an example PYNQ-based code in `faber_fpga/src/sw/example_measurements` to collect the overall power consumption, while for the U96 we instrumented the plug of the device.

### Figure 6: Resource and FPS Scaling for defaults and image size scaling
Expected time: consider six builds (3 Vitis 4-6 hours each; 3 Vivado 2-3 hours each) it is extremely machine dependent, and then the time to collect the results (5/10 minutes).
1. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
2. build the accelerators (i.e., `bash fps.bash`)
3. check in `build/ultra96_v2` and `build/alveo_u200` the presence of a CSV with the resources of such builds
4. Deploy your solution according to [Testing a HW Design](#testing_designs)

### Figure 7, Figure 9: Execution Time, Model accuracy
Expected time: consider 8 builds (4 Vitis 4-6 hours each; 4 Vivado 2-3 hours each) it is extremely machine dependent, and then the time to collect the results , Worst case 8\*5\*2 minutes..
1. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
2. build the accelerators (i.e., `bash build_top_bits.bash`)
3. check in `build/ultra96_v2` and `build/alveo_u200` the presence of a CSV with the resources of such builds
4. Deploy your solution according to [Testing a HW Design](#testing_designs)
5. To compute the performance values extract the run-time values and then apply for Powell's method `=(EXETIME*1000)/((246*512*512)/10^6)/(3*246)` for 1+1 `=(EXETIME*1000)/((246*512*512)/10^6)/(100*246)` since they employ generally different iterations (i.e., 3 for Powell's and 100 for 1+1).
6. For the Energy Efficiency the ZCU104 and the Alveo U200 have an example PYNQ-based code in `faber_fpga/src/sw/example_measurements` to collect the overall power consumption, while for the U96 we instrumented the plug of the device.
7. For the model evaluation `bash evaluate_model.bash` and check the results in that folder (i.e., files `*.log`)


### Table 4: State of the Art comparison table
Expected time: consider 9 builds (4 Vitis 4-6 hours each; 5 Vivado 2-3 hours each) it is extremely machine dependent, and then the time to collect the results for each build, Worst case 9\*5\*2 minutes.
1. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
2. build the accelerators (i.e., `bash build_soa.bash` or if done the previous step `cd ..; make hw_gen PE=1 CORE_NR=3 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=zcu104 METRIC="mse" TRANSFORM="wax"; cd -`)
3. check in `build/ultra96_v2` and `build/alveo_u200` the presence of a CSV with the resources of such builds
4. Deploy your solution according to [Testing a HW Design](#testing_designs)
5. To compute the performance values extract the run-time values and then apply for Powell's method `=(EXETIME*1000)/((246*512*512)/10^6)/(3*246)` for 1+1 `=(EXETIME*1000)/((246*512*512)/10^6)/(100*246)` since they employ generally different iterations (i.e., 3 for Powell's and 100 for 1+1).
6. For the Energy Efficiency the ZCU104 and the Alveo U200 have an example PYNQ-based code in `faber_fpga/src/sw/example_measurements` to collect the overall power consumption, while for the U96 we instrumented the plug of the device.

### Table 3: Accuracy Evaluation
Expected time: consider 4 builds (4 Vitis 4-6 hours each or 4 Vivado 2-3 hours each) it is extremely machine dependent, and then the time to collect the results for each build, Worst case 4\*5\*2 minutes.

1. Prepare the environment e.g., `source </my/path/to/xilinx/tools/Vitis/settings64.sh` and `source /opt/xilinx/xrt/setup.sh`
2. Build and execute on whatever kind of platform the accelerators for every similarity metric without the transform (e.g., for the ALVEO with 1 PE `bash accuracy_eval.bash`, otherwise the one of the previous build)
3. check in `build/ultra96_v2` and `build/alveo_u200` the presence of a CSV with the resources of such builds
4. Deploy your solution according to [Testing a HW Design](#testing_designs)
5. Execute the `src/sw/res_extraction.py` to extract the single image accuracy and then compute the average (e.g., `python3 res_extreaction.py -f 0 -rg <path/to/gold/images/folder> -rt <where/to/find/registered/images> -l <optimizer_label> -rp <where/to/store/results>`).


### Testing a HW Design <a name="testing_designs"></a>

1. Complete at least one design in the previous section, and prepare the HW design for deployment (i.e., `make resyn_extr_zynq_ultra96_v2 ` or `make resyn_extr_vts_alveo_u200`, done by all the bash for the artifacts).
2. `make pysw` creates a deploy folder for the Python code.
3. `make deploybitstr` or `make deployxclbin` `BRD_IP=<target_ip> BRD_USR=<user_name_on_remote_host> BRD_DIR=<path_to_copy>` copy onto the deploy folders the needed files.
4. connect to the remote device, i.e., via ssh `ssh <user_name_on_remote_host>@<target_ip>`.
5. [Optional] install all needed Python packages as above, or the pynq package on the Alveo host machine.
6. Navigate to the `<path/where/deployed>/sw_py`.
7.
* 7a) Launch the script `python_tester_launcher.sh <path/where/deployed>/bitstream_ultra96` (or where you transferred the folder of the .bit) for the Ultra96 testing.
* 7b) Modify the script with PLATFORM=Alveo, and launch the script `python_tester_launcher.sh <path/where/deployed>/xclbn_alveo_u200` (or where you transfered the folder of the .xclbin) for the Alveo testing.
* 7c) the script will automatically detect the accelerator configuraiton (based on the folder name) and setup the testing of both, single accelerator, powell's, and 1+1 registrations, with a dataset structured as [descirbed here](#dataset_description).
8. `python_tester_extractor.sh <path/where/deployed>/bitstream_ultra96 <name for the csv result>` to automatically derive a .csv with most of the useful results of the experimental campaign.

If you wish to have a **single test of the accelerator** for a single bitstream please follow these steps after previous step 6.
7. set `BITSTREAM=<path_to_bits>`, `CLK=200`, `CORE_NR=<target_core_numbers>`, `PLATFORM=Alveo|Zynq`, `RES_PATH=path_results`, and source xrt on the Alveo host machine, e.g., `source /opt/xilinx/xrt/setup.sh`.
8. [Optional] `python3 test-single-mi.py --help` for a complete view of input parameters.
9. Execute the test `python3 test-single-mi.py -ol $BITSTREAM -clk $CLK -t $CORE_NR -p $PLATFORM -im 512 -rp $RES_PATH` (if on Zynq you will need `sudo`).

To execute a **single registration**, follow similar steps such as the previous single accelerator test, or [have a look here](#deploymentexample).
`python3 faber-powell-blocked.py --help` will show a complete view of input parameters for powell HW-based registration procedure.


### <a name="deploymentexample"></a> Deployment Example on Ultra96/Alveo u200/ Pure SW
An example of deployment is `make deploybitstr TRGT_PLATFORM=ultra96_v2 BRD_USR=xilinx BRD_IP=<board ip> BRD_DIR=/home/xilinx/faber` on the Ultra96.
For an Alveo u200 device `make deployxclbin TRGT_PLATFORM=alveo_u200 BRD_USR=<machine user> BRD_IP=<server ip> BRD_DIR=<path of where to deploy>`.

Connect to the host machine and go the target folder.
9 changes: 9 additions & 0 deletions artifacts_tpds22_scripts/accuracy_eval.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash
cd ..
make hw_gen METRIC='mi' PE=1 CORE_NR=1 HT='float' FREQ_MHZ=200 TRGT_PLATFORM=alveo_u200;
make hw_gen METRIC='cc' PE=1 CORE_NR=1 FREQ_MHZ=200 TRGT_PLATFORM=alveo_u200;
make hw_gen METRIC='mse' PE=1 CORE_NR=1 FREQ_MHZ=200 TRGT_PLATFORM=alveo_u200;
make hw_gen METRIC='prz' PE=1 CORE_NR=1 HT='float' FREQ_MHZ=200 TRGT_PLATFORM=alveo_u200;
make resyn_extr_vts_alveo_u200

cd -
7 changes: 7 additions & 0 deletions artifacts_tpds22_scripts/build_soa.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
bash build_top_bits.bash
cd ..
make hw_gen PE=1 CORE_NR=3 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=zcu104 METRIC="mse" TRANSFORM="wax"
make resyn_extr_zynq_zcu104

cd -
20 changes: 20 additions & 0 deletions artifacts_tpds22_scripts/build_top_bits.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

cd ..
make default_ultra96
make default_alveo_u200

make hw_gen PE=1 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=ultra96_v2 METRIC="mse" TRANSFORM="wax";
make hw_gen PE=1 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=ultra96_v2 METRIC="mi" TRANSFORM="wax";
make hw_gen PE=1 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=ultra96_v2 METRIC="nmi" TRANSFORM="wax";

echo "Hello moto"

make hw_gen METRIC='cc' PE=32 CORE_NR=1 FREQ_MHZ=300 TRGT_PLATFORM=alveo_u200;
make hw_gen METRIC='mse' PE=32 CORE_NR=1 FREQ_MHZ=300 TRGT_PLATFORM=alveo_u200;
make hw_gen METRIC='prz' PE=16 CORE_NR=1 HT='float' FREQ_MHZ=300 TRGT_PLATFORM=alveo_u200;

make resyn_extr_zynq_ultra96_v2
make resyn_extr_vts_alveo_u200

cd -
15 changes: 15 additions & 0 deletions artifacts_tpds22_scripts/evaluate_model.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
cd ../src/model

python3 model.py -m cc -n 2 -pe 1 -b 8 -d 512 -p -t -i nn ultra96 > default_ultra96.log
python3 model.py -m mi -n 1 -pe 16 -b 8 -d 512 -p default_alveo_u200 > default_alveo_u200.log

python3 model.py -m cc -n 2 -pe 1 -b 8 -d 512 -p -t -i nn ultra96 > waxmse_u96.log
python3 model.py -m cc -n 2 -pe 1 -b 8 -d 512 -p -t -i nn ultra96 > waxmi_u96.log
python3 model.py -m cc -n 2 -pe 1 -b 8 -d 512 -p -t -i nn ultra96 > waxnmi_u96.log

python3 model.py -m cc -n 1 -pe 32 -b 8 -d 512 -p default_alveo_u200 > cc_alveo.log
python3 model.py -m mse -n 1 -pe 32 -b 8 -d 512 -p default_alveo_u200 > mse_alveo.log
python3 model.py -m nmi -n 1 -pe 16 -b 8 -d 512 -p default_alveo_u200 > nmi_alveo.log

cd -
4 changes: 4 additions & 0 deletions artifacts_tpds22_scripts/fps.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
cd ../; make default_ultra96; make default_ultra96 D=1024; make default_ultra96 D=2048; cd -
cd ../; make default_alveo_u200; make default_alveo_u200 D=1024; make default_alveo_u200 D=2048; cd -
cd ../; make resyn_extr_zynq_ultra96_v2; make resyn_extr_vts_alveo_u200; cd -
60 changes: 60 additions & 0 deletions artifacts_tpds22_scripts/metric_w-wo_transform.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash
cd ..
############## Zynq builds
make hw_gen PE=1 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=ultra96_v2 METRIC="mi" TRANSFORM="wax";
make hw_gen PE=2 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=ultra96_v2 METRIC="mi" ;

make hw_gen PE=1 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=zcu104 METRIC="mi" TRANSFORM="wax";
make hw_gen PE=2 CORE_NR=2 TARGET=hw CLK_FRQ=200 TRGT_PLATFORM=zcu104 METRIC="mi" ;

make resyn_extr_zynq_ultra96_v2

############## Alveo builds
PES=(32 16 8 4 2 1)
CORE_NRS=(1)
HTYPS=('float')
PE_ENTROP=(1)
CACHING=(false)
URAM=(false)
FREQZ=200
INTERPOLATIONS=('nearestn')

for p in ${PES[@]}; do
for cn in ${CORE_NRS[@]}; do
for h in ${HTYPS[@]}; do
for pe in ${PE_ENTROP[@]}; do
for it in ${INTERPOLATIONS[@]}; do
for c in ${CACHING[@]}; do
for u in ${URAM[@]}; do
for wu in ${CACHING[@]}; do
make hw_gen PE=$p CORE_NR=$cn HT=$h TARGET=hw OPT_LVL=3 \
FREQ_MHZ=$FREQZ PE_ENTROP=$pe CACHING=$c URAM=$u TRGT_PLATFORM=alveo_u200 \
WAX_URAM=$wu METRIC="mi" INTERP_TYPE=$it TRANSFORM="wax";
done;
done;
done;
done;
done;
done;
done;
done;


for p in ${PES[@]}; do
for cn in ${CORE_NRS[@]}; do
for h in ${HTYPS[@]}; do
for pe in ${PE_ENTROP[@]}; do
for c in ${CACHING[@]}; do
for u in ${URAM[@]}; do
make hw_gen PE=$p CORE_NR=$cn HT=$h TARGET=hw OPT_LVL=3 \
FREQ_MHZ=$FREQZ PE_ENTROP=$pe CACHING=$c URAM=$u TRGT_PLATFORM=alveo_u200 METRIC="mi";
done;
done;
done;
done;
done;
done;

make resyn_extr_vts_alveo_u200

cd -

0 comments on commit 15ec6e3

Please sign in to comment.