Pattern recognition for HGCAL reconstructions

This repository documents the AI approach to trackster smoothing for CMS HGCAL reconstructions, developed in the scope of a master's thesis in Pattern recognition for particle shower reconstruction by Eduard Cuba. The content is distilled from the original repository, covering data preprocessing and model training.

Pipeline

The pipeline includes:

Preprocessing ntuplized ROOT files into a PyTorch dataset
- selecting tracksters of interest
- finding tracksters in their neighborhood
- feature extraction
Training the model
- training
- validation
- evaluation
Export to ONNX (MLP model)
Integration into CMSSW (MLP model)

Setup

This work is organized as a collection of Jupyter Notebooks, one for each model, supported by a small Python codebase (reco). The training configuration, such as dataset routes and model parameters, are adjusted directly in the Jupyter Notebooks.

The project uses a virtualenv setup and requires Python 3.7 or higher. Several Python libraries are required, including torch, scikit-learn, and uproot. To avoid compatibility issues, use a clean virtual environment and install the requirements defined in requirements.txt as follows:

virtualenv ve                     # initialize the environment
. ve/bin/activate                 # enter the environment
pip install torch                 # torch needs to be installed first
pip install -r requirements.txt   # install the remaining project requirements

Notebooks

The notebooks are organized as follows:

Pairwise MLP: train_mlp_pairwise_PU.ipynb
GNN for Node Classification: train_graph_net_classification_PU.ipynb
GNN for Link Prediction: train_graph_net_link_prediction_PU.ipynb

Data preprocessing

The Notebooks can work directly with n-tuplized ROOT files, which need to be put into a folder defined in the notebooks (raw_dir). Before the model training, the events are extracted and stored in PyTorch dataset files. This process uses the wrapper classes reco.dataset_pair.TracksterPairs and reco.dataset_graph.TracksterGraph. The dataset wrappers will process a specified number of ROOT files and save the content into a PyTorch dataset in the data_root folder defined in the notebook. The configuration of each dataset is embedded in the filename, and if a preprocessed version of the given dataset configuration is already available, it will be loaded directly. If preprocessed datasets are available, the preprocessing can be skipped by placing them in the data_root folder.

Feature extraction

During preprocessing, trackster pairs or trackster graphs are described by features. In the pairwise case, each pair is described by a total of 63 values detailed here. This includes most of the trackster features available in the ROOT files (for both tracksters), plus their pairwise distance, and z-coordinates of the first and last layer-cluster along the z-axis. In the graph case, the feature vector is extended by a set of structural features describing the trackster composition; see reco.features.

Batching

The datasets are randomly split into a training and validation set with a ratio of 90:10. A DataLoader object handles batching. In the graph case, it is crucial to consider that the graphs in the batches are flattened into one large graph, and all operations on the set of nodes in the graph (such as aggregations or nearest-neighbor search) need to respect the data.batch node assignment.

Models and CMSSW integration

Neural network components can be integrated into CMSSW via the ONNX format. This work compared standard feed-forward neural networks (MLP) applied to trackster pairs and graph neural networks (GNNs) applied to entire events or calorimeter regions (typically a cylinder defined around a selected trackster). GNNs were shown to outperform the MLP-based models by a small margin; however, due to lower complexity, only the MLP architecture was used to demonstrate the CMSSW integration. The integration of GNNs is possible, yet the dynamic nearest neighbor selection used in dynamic graph convolutional networks (DGCNN) implemented via a torch-cluster module is not currently supported by ONNX and needs to be implemented manually in CMSSW.

MLP in CMSSW

To run the MLP model in CMSSW, you need to:

Follow train_mlp_pairwise_PU.ipynb) to export the MLP model into ONNX (or use a pre-trained one from models folder)
Add the models to RecoHGCAL/TICL/data/tf_models and configure the model path
Enable smoothing in TrackstersMergeProducer
Configure the radius and energy threshold, and if there are any changes to the trackster pair features, adjust the pre-processing and feature extraction in SmoothingAlgoByMLP

Pretrained artifacts

The directory models contains serialized pre-trained models for each case. Using the default configuration, you can skip the training cell in the notebooks and directly load the pre-trained models.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
models		models
reco		reco
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_graph_net_classification_PU.ipynb		train_graph_net_classification_PU.ipynb
train_graph_net_link_prediction_PU.ipynb		train_graph_net_link_prediction_PU.ipynb
train_mlp_pairwise_PU.ipynb		train_mlp_pairwise_PU.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pattern recognition for HGCAL reconstructions

Pipeline

Setup

Notebooks

Data preprocessing

Feature extraction

Batching

Models and CMSSW integration

MLP in CMSSW

Pretrained artifacts

About

Languages

License

edcuba/hgcal-pattern-reco

Folders and files

Latest commit

History

Repository files navigation

Pattern recognition for HGCAL reconstructions

Pipeline

Setup

Notebooks

Data preprocessing

Feature extraction

Batching

Models and CMSSW integration

MLP in CMSSW

Pretrained artifacts

About

Topics

Resources

License

Stars

Watchers

Forks

Languages