README containing instructions on how to run the code for the coursework.
The repository can be cloned with
git clone git@gitlab.developers.cam.ac.uk:phy/data-intensive-science-mphil/s1_assessment/ap2387.git
The conda environment can be created using the conda_env.yml, which contains all the packages needed to run the code
conda env create -n mphil --file conda_env.yml
The final report is presented in ap2387.pdf. The file is generated using LaTeX, but all LaTeX-related files are not being committed as per the instructions on the .gitignore file
The codes to run the exercises can be found in the src
folder, whereas the file Helpers/HelperFunctions.py contains the definition for the Model()
classes, as well as additonal functions used in the code
The code for this exercise can be found in src/solve_part_c.py. The code can be run using parser
options, which can be accessed with the following command
python src/solve_part_c.py -h
The available options are -a, --alpha
, -b, --beta
and -n, --nentries
, which allow to change the interval range --plots
can be used if one wants the plots to be displayed and not only saved.
The code will then generate N=n_entries models, selecting n_entries random values for the parameters --plots
flag is selected, the code will also produce a plot displaying the pdfs of all the models. This plot is then saved in the plots/
folder, under the name Part_c/a_alpha_beta_n_entries.pdf
. It is recommended to not use the flag if the number of entries is greater than 1000, because otherwise the plot is too messy and takes a lot of time to be opened (roughly 85 seconds for 10000).
Aftwerwards, the code will numerically evaluate the integral of the pdf in the
The code for this exercise can be found in src/solve_part_d.py. This code simply plots the true distributions of signal, background and of the combined pdf on the same canva, fixing the values of --plots
flag can be selected if you want to see the plot while running the code.
The code for this exercise can be found in src/solve_part_e.py. For this part, an additional option can be passed using -p, --points
, which specifies how many points the user wants to generate, with this number set by default at 100000.
This code generates p points according to the pdf of the Model, using the accept/reject method, and then performs a ML fit in order to estimate the parameters minuit
package, and after the minimisation of the NLL, it prints out the best values of the parameters with their uncertainties, as well as the Hessian matrix for the parameters. As an output, this code plots the generated sample alongside the estimates of signal, background, and total probability all overlaid. This plot can be found in plots/Part_e/fit.pdf.
The code for this exercise can be found in src/solve_part_f.py. This code tests the discovery rate of our model, evaluating how much data must be collected to discover at --fit
flag was implemented, which forces the code to re-do the fit step for all the different dataset sizes. This flag is turned off by default, so that the plot can be generated quicker.
The code for this exercise can be found in src/solve_part_g.py. The structure of this code is basically the same structure of parts e and f, except this time a different model is used. Therefore, there are three options that can be passed via command line, -p, --points
, which specifies the number of points to be generated for a single fit to the generated data (plot visible in plots/Part_g/true_pdf.pdf), as well as the usual --plots
and --fit
, respectively to show the plots and to do the fit step to evaluate the discovery rate for each dataset size. Similarly to part e), the plot displaying the generated sample alongside the estimates of signal, background, and total probability all overlaid is shown in plots/Part_g/fit.pdf, whereas the true pdf distribution, like the one generated in part d, can be found in plots/Part_g/true_pdf.pdf. Finally, the discovery rate for different dataset sizes is plot in plots/Part_g/discovery_rates.pdf, and similarly to part f) the numpy array containing the discovery rate values is saved in data/discovery_rates_g.npy, due to the code taking roughly 26 minutes to run.