This repository contains the source code of the paper ``One Automaton To Rule Them All: Beyond Multiple Regular Expressions Execution'' where we deal with the problem of efficient Regular Expressions (REs) matching. The key idea of this work is to exploit REs morphological similarities within the same dataset to enable memory reduction when storing the patterns and improving the dataset-matching throughput. This idea is expressed through:
- the Multi-RE Finite State Automata (MFSA) model that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset
- a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s) automatically
- an execution engine called iMFAnt, which is an extension of the state-of-the-art iNFAnt algorithm
- Repo Overview
- CGO'24 AE Instructions
- Artifacts Setup
- Running the Experiments
- Charts Visualization
- Customization Options
- Credits and Contributors
benchmark_scripts
andbenchmark_results
contain the scripts to evaluate datasets similarity and the numericalresult of such evaluation, respectively (Fig. 1).datasets
contains the RE datasetsframework
contains the compilation frameworkinput_streams
contains the data streams to matchmatching
contains the iMFAnt enginemfsa
contains the automata in ANML format, resulting from the compilation frameworkoutput_matches
contains the numeric results of MFSAs analysis (time, #matches)plots
contains the plots resulting from the execution of the scripts inscripts
folderplotting scripts
contains the scripts for the plotting of the numerical resultsraw_results
contains the numerical results corresponding to the plotsscripts
contains the scripts to plot the compression, compilation time, execution time and throughput results (Figs. 7, 8, 9, 10)
To customize the experiments go to the customization section
Run the scripts in iMFAnt/scripts
folder to reproduce the experimental results reported in the paper.
These include the automatic generation of the plots with the default merging factors (0, 1, 2, 5, 10, 20, 50, 100)
The following instructions are to reproduce the CGO'24 results. The fast way is to go to the artifacts setup
An x86 CPU is mandatory for the Docker setup, other CPUs are usable in general but not tested.
Clone the repository:
git clone https://github.com/necst/iMFAnt.git
and navigate into the repo folder
cd iMFAnt
Please follow one of the two setups: 1) native Ubuntu-based machine; 2) any OS with Docker and then to the Running the Experiments.
To execute CGO'24 AE are required
make cmake
flex
bison
g++
openmp
(included ing++
installation)python3.10
withnumpy pandas matplotlib
A possible command to install everything you need:
sudo apt-get install -y locales curl wget tar sudo git apt-utils nano
sudo apt-get install -y g++ python3 python3-pip automake make cmake flex bison
sudo pip3 install numpy pandas matplotlib
navigate into the scripts folder
cd scripts/
It is supposed you have installed and configured Docker according to the docker instructions. An x86 CPU is mandatory.
Open a terminal and then run the script
run_docker.sh
that will build and run the docker container.
navigate into the scripts folder
cd iMFAnt/scripts/
All the experiments take a shorter time than expected if you open them and reduce the number of repetitions, changing the value of -r
flag in the execution of python3 merging.py -r XXX
.
More info on experiment customization here
Each experiment will execute and collect the corresponding results and concur to the paper figure.
To reproduce Fig. 7 (ETA: 5 min)
./compression.sh
To visualize these results go here
To reproduce Fig. 8 (ETA: 30 min with 30 reps)
./compilation_time.sh
To visualize these results go here
To reproduce Figs. 9 and 10 (ETA: 12 h with 15 reps)
./iMFAnt_performance.sh
Figure 9 and 10 are expected to display different optimal merging factor according to testing machine characteristics. Anyhow, the MFSAs are expected to improve the simple multi-thread scaling according to the trends in the paper.
To visualize these results go here
The scripts run automatically the desired experiments and you will find the corresponding PDF plots in iMFAnt/plots
folder.
If you follow the Docker setup, exit
the container and you will find there the results.
Then open the PDFs as you prefer.
- To merge a new dataset save the RE file and the input stream file in
dataset
andinput_streams
folders, directly. Add the corresponding directory intoframework/compiler/merging.py
file, ininputRE
list and run it to perform the merging. - To pattern-match a new dataset, add the corresponding
mfsa
andinput_streams
directories toimfant.py
file and run it, ensuring that the merging factors are consistent with the ones employed during the merging.
-
The following options are available for the compilation of REs in
framework/re2automata/re2automata.cpp
:Variable Meaning STATES if 1 eval. states compression, else eval. trans. compression ANML if 1 output ANML file, else output DOT file STACK_TIME if 1 eval. compilation time V if 1 enable verbose mode These parameters are active by default.
-
Enter the compilation framework and compile the compiler:
cd framework/compiler
and build the compiler, entering the following:
cd framework/compiler
bison -d compiler.yy -v
flex -o compiler.yy.cc compiler.lex
- At compile-time, set the desired values of
STATES
andSTACK_TIME
according to the values in the table:
g++ -o compiler compiler.yy.cc compiler.tab.cc ast.cpp ../re2automata/re2automata.cpp -DSTATES=1 -DSTACK_TIME=0 -w -std=c++11 -g -O3
- Run
python merging.py -r REPS -b [M1 ... Mn]
to generate the automata for the benchmark datasets. -r
indicates the number of repetitions to evaluate the compilation time, and -b
indicates the list of merging factors to reproduce the results in the paper. In practice, any value for M is valid. To merge the entire dataset, set M=0. Adjust the merging factors consistently in matching/imfant.py
to run the pattern matching engine.
- The generated automata are in
mfsa
- The compression results are in
raw_results
- Compile the matching algorithm:
cd ../../matching/ && make clean all
- To match a single folder of MFSAs run:
export OMP_NUM_THREADS=N; make && ./multithreaded_imfant STREAM_IN + MFSA_DIR NUM OUTPUT_FILE
where N
is the number of threads, STREAM_IN
is the input stream to be matched, MFSA_DIR
is the directory of the folder containing the MFSAs to match, NUM
is the number of MFSAs to match in that directory, and OUTPUT_FILE
contains the results of the matching (matching time, #matches).
Contributors: Luisa Cicolini, Filippo Carloni, Marco Domenico Santambrogio, Conficconi Davide
If you find this repository useful, please use the following citation(s):
@inproceedings{cicolini2024one,
title={One Automaton to Rule Them All: Beyond Multiple Regular Expressions Execution},
author={Cicolini, Luisa and Carloni, Filippo and Santambrogio, Marco D and Conficconi, Davide},
booktitle={2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
pages={193--206},
year={2024},
organization={IEEE}
}