Scripts to ease the reduction of MC data on the LST cluster at La Palma. With this package, the analysis/creation of R1/DL0/DL1/DL2/IRFs can be orchestrated.
Contact: Thomas Vuillaume, thomas.vuillaume [at] lapp.in2p3.fr Enrique Garcia, garcia [at] lapp.in2p3.fr Lukas Nickel, lukas.nickel [at] tu-dortmund.de
If lstMCpipe was used for your analysis, please cite:
@misc{garcia2022lstmcpipe, title={The lstMCpipe library}, author={Enrique Garcia and Thomas Vuillaume and Lukas Nickel}, year={2022}, eprint={2212.00120}, archivePrefix={arXiv}, primaryClass={astro-ph.IM} }
in addition to the exact lstMCpipe version used from https://doi.org/10.5281/zenodo.6460727
You may also want to include the config file with your published code for reproducibility.
As as user:
wget https://raw.githubusercontent.com/cta-observatory/lstmcpipe/master/environment.yml conda env create -f environment.yml conda activate lstmcpipe pip install lstmcpipe
This will setup a new enviroment with lstchain and other needed tools available in supported versions. If you already have your lstchain conda environment, you may simply activate it and install lstmcpipe there using pip install lstmcpipe.
HIPERTA (referred to as rta in the following) support is builtin, but no installation instructions can be provided as of now.
Alternatively, you can install lstmcpipe in your own enviroment to use different versions of the analysis pipelines. WARNING: Due to changing APIs and data models, we cannot support other versions than the ones specified in the enviroment.
As as developer:
git clone https://github.com/cta-observatory/lstmcpipe.git cd lstmcpipe conda env create -n lstmcpipe_dev -f environment.yml conda activate lstmcpipe_dev pip install -e . pre-commit install
This will setup a pre-commit hook: Given that you are in the right enviroment, it will run and format files you are about to commit with black. (You need to stage the changes again after that). This ensures the formatting of the code follows our guidelines and there is less work dealing with the code checker in the CI.
You may find the list of already run productions in the documentation. Please check in this list that the request you are about to make does not exist already!
As a LST member, you may require a MC analysis with a specific configuration, for example to later analyse a specific source with tuned MC parameters.
To do so, please:
- Open a pull request from your fork into lstMCpipe, adding the desired configuration in a new directory named date_ProdID in production_configs.
- You may have a look at the
production_configs/template_prod
as an example. - Add a descriptive production ID (e.g.
src***_psf_tuned
) to you directory and configuration. - The requested config must contain:
- a lstchain config file (please provide an exhaustive config that will help others and provide a more explicit provenance information)
- a lstmcpipe config file (to generate it, please refer to the documentation)
- a readme with a short description of why you require this analysis to be run. Do not add information that should not appear publicly (such as source names) here. If you are requesting a production for a specific new source, please edit this table on LST wiki. Also add the command line to generate the lstmcpipe config, that will help debugging.
The proposed configuration will be tested for validity by continuous integration tests and we will interact with you to run the analysis on the cluster at La Palma.
Depending on the number of requests, we may give priorities.
To generate your lstmcpipe configuration file, use lstmcpipe_generate_config command. If the type of production you want is not listed in the existing ones, you may create your own PathConfig class from an existing one, or generate a config from an existing prod type and edit the file manually.
Once you have your configuration file, you way launch the pipeline with the described stages in the config using:
lstmcpipe -c config_MC_prod.yml -conf_lst lstchain_*.json [-conf_cta CONFIG_FILE_CTA] [-conf_rta CONFIG_FILE_RTA] [--debug] [--log-file LOG_FILE]
The lstmcpipe_start.py
script is the orchestrator of the pipeline, it schedules the stages specified in the
onsite_MC_prod.yml
file. All the configuration related with the MC pipe must be declared in this file (stages,
particles to be analysed, zenith, pointing, type of MC production...).
Pipeline-specific configuration options (such as cleaning or model parameters) are declared in a different configuration file,
which is passed via the options -conf_lst/-conf_cta/-conf_rta
.
Note: You can always launch this command without fear; there is an intermediate step that verifies and shows the configuration that you are passing to the pipeline.
The use of slurms jobarrays in the r0_to_dl1 stage in combination with a limited amount of maximum jobs running at the same time reduces the load on the cluster compared to previous versions, but please note that it still requires a lot of resources to process a full MC production. Think about other LP-IT cluster users.
After launching of the pipeline all selected tasks will be performed in order.
These are referred to as stages and are collected in lstmcpipe/stages
.
Following is a short overview over each stage, that can be specified in the configuration.
r0_to_dl1
In this stage simtel-files are processed up to datalevel 1 and separated into files for training
and for testing.
For efficiency reasons files are processed in batches: N files (depending on paricle type
as that influences the averages duration of the processing) are submitted as one job in a jobarray.
To group the files together, the paths are saved in files that are passed to
python scripts in lstmcpipe/scripts
which then call the selected pipelines
processing tool. These are:
- lstchain: lstchain_mc_r0_to_dl1
- ctapipe: ctapipe-stage1
- rta: lstmcpipe_hiperta_r0_to_dl1lstchain (
lstmcpipe/hiperta/hiperta_r0_to_dl1lstchain.py
)
dl1ab
As an alternative to the processing of simtel r0 files, existing dl1 files can be reprocessed.
This can be useful to apply different cleanings or alter the images by adding noise etc.
For this to work the old files have to contain images, i.e. they need to have been processed
using the no_image: False
flag in the config.
The config key dl1_reference_id
is used to determine the input files.
Its value needs to be the full prod_id including software versions (i.e. the name of the
directories directly above the dl1 files).
For lstchain the dl1ab script is used, ctapipe can use the same script as for simtel
processing. There is no support for hiperta!
merge_dl1
In this stage the previously created dl1 files are merged so that you end up with train and test datesets for the next stages.
train_test_split
Split the dataset into training and testing datasets, performing a random selection of files with the specified ratio (default=0.5).
train_pipe
IMPORTANT: From here on out only lstchain
tools are available. More about that at the end.
In this stage the models to reconstruct the primary particles properties are trained
on the gamma-diffuse and proton train data.
At present this means that random forests are created using lstchains
lstchain_mc_trainpipe
Models will be stored in the models
directory.
dl1_to_dl2
The previously trained models are evaluated on the merged dl1 files using lstchain_dl1_to_dl2
from
the lstchain package.
DL2 data can be found in DL2
directory.
dl2_to_irfs
Point-like IRFs are produced for each set of offset gammas.
The processing is performed by calling lstchain_create_irf_files
.
dl2_to_sensitivity
A sensitivity curve is estimated using a script based on pyirf which performs a cut optimisation
similar to EventDisplay.
The script can be found in lstmcpipe/scripts/script_dl2_to_sensitivity.py
.
This does not use the IRFs and cuts computed in dl2_to_irfs, so this can not be compared to observed data.
It is a mere benchmark for the pipeline.
NOTE: lstmcpipe
expects the data to be located in a specific structure on the cluster.
Output will be written in a stanardized way next to the input data to make sure everyone can access it.
Analysing a custom dataset requires replicating parts of the directory structure and is not the
intended use case for this package.
All the `r0_to_dl1
stage job logs are stored /fefs/aswg/data/mc/running_analysis/.../job_logs
and later
moved to /fefs/aswg/data/mc/analysis_logs/.../
.
Every time a full MC production is launched, two files with logging information are created:
log_reduced_Prod{3,5}_{PROD_ID}.yml
log_onsite_mc_r0_to_dl3_Prod{3,5}_{PROD_ID}.yml
The first one contains a reduced summary of all the scheduled job ids (to which particle the job corresponds to), while the second one contains the same plus all the commands passed to slurm.
The directory structure and the stages to run are determined by the config stages. After that, the job dependency between stages is done automatically.
- If the full workflow is launched, directories will not be verified as containing data. Overwriting will only happen when a MC prods sharing the same
prod_id
and analysed the same day is run- If each step is launched independently (advanced users), no overwriting directory will take place prior confirmation from the user
Example of default directory structure for a prod5 MC prod:
/fefs/aswg/data/ ├── mc/ | ├── DL0/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ | | └── simtel files | | | ├── running_analysis/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ | | └── YYYYMMDD_v{lstchain}_{prod_id}/ | | └── temporary dir for r0_to_dl1 + merging stages | | | ├── analysis_logs/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ | | └── YYYYMMDD_v{lstchain}_{prod_id}/ | | ├── file_lists_training/ | | ├── file_lists_testing/ | | └── job_logs/ | | | ├── DL1/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ | | └── YYYYMMDD_v{lstchain}_{prod_id}/ | | ├── dl1 files | | ├── training/ | | └── testing/ | | | ├── DL2/20200629_prod5_trans_80/{particle}/zenith_20deg/south_pointing/ | | └── YYYYMMDD_v{lstchain}_{prod_id}/ | | └── dl2 files | | | └── IRF/20200629_prod5_trans_80/zenith_20deg/south_pointing/ | └── YYYYMMDD_v{lstchain}_{prod_id}/ | ├── off0.0deg/ | ├── off0.4deg/ | └── diffuse/ | └── models/ └── 20200629_prod5_trans_80/zenith_20deg/south_pointing/ └── YYYYMMDD_v{lstchain}_{prod_id}/ ├── reg_energy.sav ├── reg_disp_vector.sav └── cls_gh.sav
Real data analysis is not supposed to be supported by these scripts. Use at your own risk.
So far the reference pipeline is lstchain
and only with it a full analysis is possible.
There is however support for ctapipe
and hiperta
as well.
The processing up to dl1 is relatively agnostic of the pipeline; working implementations exist for all of them.
In the case of hiperta
a custom script converts the dl1 output to lstchain
compatible files and the later stages
run using lstchain
scripts.
In the case of ctapipe
dl1 files can be produced using ctapipe-stage1
. Once the dependency issues are solved and
ctapipe 0.12 is released, this will most likely switch to using ctapipe-process
. We do not have plans to keep supporting older
versions longer than necessary currently.
Because the files are not compatible to lstchain
and there is no support for higher datalevels in ctapipe
yet, it is not possible
to use any of the following stages. This might change in the future.