Skip to content

Latest commit

 

History

History
380 lines (241 loc) · 19.4 KB

usage.rst

File metadata and controls

380 lines (241 loc) · 19.4 KB

Usage

The library aims to offer tools for two main operations:

All of this happens in a user-friendly python interface. :ref:`sec-usage-reading` is helpful if you need help getting your data as a python list. If you already have your data accessible in python, great! Skip right ahead to :ref:`sec-usage-writing`.

In the following sections, there are

HEPData and its data format

The HEPData data model revolves around Tables and Variables. At its core, a Variable is a one-dimensional array of numbers with some additional (meta-)data, such as uncertainties, units, etc. assigned to it. A Table is simply a set of multiple Variables. This definition will immediately make sense to you when you think of a general table, which has multiple columns representing different variables.

Reading data

Reading from plain text

If you save your data in a text file, a simple-to-use tool is the numpy.loadtxt function, which loads column-wise data from plain-text files and returns it as a numpy.array.

import numpy as np
my_array = np.loadtxt("some_file.txt")

A detailed example is available here. For documentation on the loadtxt function, please refer the numpy documentation.

Reading from ROOT files

In many cases, data in the experiments is available as one of various ROOT data types, such as TGraphs, TH1, TH2, etc, which are saved in *.root files.

To facilitate reading these objects, the RootFileReader class is provided. The reader is instantiated by passing a path to the ROOT file to read from:

from hepdata_lib import RootFileReader
reader = RootFileReader("/path/to/myfile.root")

After initialization, individual methods are provided for access to different types of objects stored in the file.

  • Reading TGraph, TGraphErrors, TGraphAsymmErrors: RootFileReader.read_graph
  • Reading TH1: RootFileReader.read_hist_1d
  • Reading TH2: RootFileReader.read_hist_2d

While the details of each function are adapted to their respective use cases, they follow a common input/output logic. The methods are called by providing the path to the object inside the ROOT file. They return a dictionary containing lists of all relevant numbers that can be extracted from the object, such as x values, y values, uncertainties, etc.

As an example, if a TGraph is saved as with name mygraph in the directory topdir/subdir inside the ROOT file, it can be retrieved as:

data = reader.read_graph("topdir/subdir/mygraph")

Since a graph is simply a set of (x,y) pairs for each point, the data dictionary will have two key/value pairs:

  • key "x" -> list of x values.
  • key "y" -> list of y values.

More complex information will be returned for TGraphErrors, etc, which can also be read in this manner. For detailed descriptions of the extraction logic and returned data, please refer to the documentation of the individual methods.

An example notebook shows how to read histograms from a ROOT file.

Writing data

Following the HEPData data model, the hepdata_lib implements four main classes for writing data:

  • Submission
  • Table
  • Variable
  • Uncertainty

The Submission object

The Submission object is the central object where all threads come together. It represents the whole HEPData entry and thus carries the top-level meta data that is equally valid for all the tables and variables you may want to enter. The object is also used to create the physical submission files you will upload to the HEPData web interface.

When using hepdata_lib to make an entry, you always need to create a Submission object. The most bare-bone submission consists of only a Submission object with no data in it:

from hepdata_lib import Submission
sub = Submission()
outdir="./output"
sub.create_files(outdir)

The create_files function writes all the YAML output files you need and packs them up in a tar.gz file ready to be uploaded.

Please note: creating the output files also creates a submission folder containing the individual files going into the tarball. This folder exists merely for convenience, in order to make it easy to inspect each individual file. It is not recommended to attempt to manually manage or edit the files in the folder, and there is no guarantee that hepdata_lib will handle any of the changes you make in a graceful manner. As far as we are aware, there is no use case where manual editing of the files is necessary. If you have such a use case, please report it in a Github issue.

Adding resource links or files

Additional resources, hosted either externally or locally, can be linked with the add_additional_resource function of the Submission object.

sub.add_additional_resource("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")
sub.add_additional_resource("Some file", "root_file.root", copy_file=True)
sub.add_additional_resource("Some file", "root_file.root", copy_file=True, resource_license={"name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/", "description": "This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator."})
sub.add_additional_resource("Archive of full likelihoods in the HistFactory JSON format", "Likelihoods.tar.gz", copy_file=True, file_type="HistFactory")
sub.add_additional_resource("Selection and projection function examples", "analysis.cxx", copy_file=True, file_type="ProSelecta")

The first argument is a description and the second is the location of the external link or local resource file. The optional argument copy_file=True (default value of False) will copy a local file into the output directory. The optional argument resource_license can be used to define a data license for an additional resource. The resource_license is in the form of a dictionary with mandatory string name and url values, and an optional description. The optional argument file_type="HistFactory" (default value of None) can be used to identify statistical models provided in the HistFactory JSON format rather than relying on certain trigger words in the description (see pyhf section of submission documentation). The optional argument file_type="ProSelecta" (default value of None) can be used to identify C++ snippets in the ProSelecta format for use with the NUISANCE framework for event generators in neutrino physics (see NUISANCE section of submission documentation).

Please note: The default license applied to all data uploaded to HEPData is CC0. You do not need to specify a license for a resource file unless it differs from CC0.

The add_link function can alternatively be used to add a link to an external resource:

sub.add_link("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")

Again, the first argument is a description and the second is the location of the external link.

Adding links to related records

To add a link to a related record object, you can use the add_related_recid function of the Submission object.

Please note: values must be entered as integers.

sub.add_related_recid(1)
sub.add_related_recid(2)
sub.add_related_recid(3)

In the last example, we are adding a link to the submission with the record ID value of 3.

Please note: This field should not be used for self-referencing, the IDs inserted should be for OTHER related records.

The documentation for this feature can be found here: Linking records.

Tables and Variables

The real data is stored in Variables and Tables. Variables come in two flavors: independent and dependent. Whether a variable is independent or dependent may change with context, but the general idea is that the independent variable is what you put in, the dependent variable is what comes out. Example: if you calculate a cross-section limit as a function of the mass of a hypothetical new particles, the mass would be independent, the limit dependent. The number of either type of variables is not limited, so if you have a scenario where you give N results as a function of M model parameters, you can have N dependent and M independent variables. All the variables are then bundled up and added into a Table object.

Let's see what this looks like in code:

from hepdata_lib import Variable

mass = Variable("Graviton mass",
                is_independent=True,
                is_binned=False,
                units="GeV")
mass.values = [ 1, 2, 3 ]

limit = Variable("Cross-section limit",
                is_independent=False,
                is_binned=False,
                units="fb")
limit.values = [ 10, 5, 2 ]

table = Table("Graviton limits")
table.add_variable(mass)
table.add_variable(limit)

That's it! We have successfully created the Table and Variables and stored our results in them. The only task left is to tell the Submission object about our new Table:

sub.add_table(table)

After we have done this, the table will be included in the output files the Submission.create_files function writes (see :ref:`sec-usage-submission`).

Binned Variables

The above example uses unbinned Variables, which means that every point is simply a single number reflecting a localized value. In many cases, it is useful to use binned Variables, e.g. to represent the x axis of a histogram. In this case, everything works the same way as in the unbinned case, except that we have to specify is_binned=True in the Variable constructor, and change how we format the list of values:

mass_binned = Variable("Same mass as before, but this time it's binned",
                       is_binned=True,
                       is_independent=True)
mass_binned.values = [ (0.5, 1.5), (1.5, 2.5), (2.5, 3.5) ]

The list of values has an entry for each bin of the Variable. The entry is a tuple, where the first entry represents the lower edge of the bin, while the second entry represents the upper edge of the bin. You can simply plug this definition into the code snippet of the unbinned case above to go from an unbinned mass to a binned value. Note that binning a Variable only really makes sense for independent variables.

Two-dimensional plots

In some cases, you may want to define information based on multiple parameters, e.g. in the case of a two-dimensional histogram (TH2 in ROOT). This can be easily accomplished by defining two independent Variables in the same Table:

table = Table()

x = Variable("Variable on the x axis",
             is_independent=True,
             is_binned=True)
# x.values = [ ... ]

y = Variable("Variable on the y axis",
             is_independent=True,
             is_binned=True)
# y.values = [ ... ]

v1 = Variable("A variable depending on x and y",
              is_independent=False,
              is_binned=False)
# v1.values = [ ... ]

v2 = Variable("Another variable depending on x and y",
              is_independent=False,
              is_binned=False)
# v2.values = [ ... ]

table.add_variable(x)
table.add_variable(y)
table.add_variable(v1)
table.add_variable(v2)

Note that you can add as many dependent Variables as you would like, and that you can also make the independent variables unbinned.

One common use case with more than one independent Variable is that of correlation matrices. A detailed example implementation of this case is available here.

Adding a plot thumb nail to a table

HEPData supports the addition of thumb nail images to each table. This makes it easier for the consumer of your entry to find what they are looking for, since they can simply look for the table that has the thumb nail of the plot they are interested in. If you have the full-size plot available on your drive, you can add it to your entry very easily:

table.add_image("path/to/image.pdf")

The library code then takes care of all the necessary steps, like converting the image to the right format and size, and copying it into your submission folder. The conversion relies on the ImageMagick library, and will only work if the convert command is available on your machine.

Adding resource links or files

In the same way as for the Submission object, additional resources, hosted either externally or locally, can be linked with the add_additional_resource function of the Table object.

table.add_additional_resource("Web page with auxiliary material", "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2012-02/")
table.add_additional_resource("Some file", "root_file.root", copy_file=True)

For a description of the arguments, see :ref:`sec-usage-resource` for the Submission object. A possible use case is to attach the data for the table in its original format before it was transformed into the HEPData YAML format.

Adding keywords to a table

To make HEPData entries more searchable, keywords should be used to define what information is shown in a table. HEPData keeps track of keywords separately from the rest of the information in an entry, and provides dedicated functionalities to search for and filter by a given set of keywords. If a user is e.g. interested in finding all tables relevant to graviton production, they can do so quite easily if the tables are labelled properly. This procedure becomes much harder, or even impossible, if no keywords are used. It is therefore considered good practice to add a number of sensible keywords to your tables.

The keywords are stored as a simple dictionary for each table:

table.keywords["observables"] = ["ACC", "EFF"]
table.keywords["reactions"] = ["P P --> GRAVITON --> W+ W-", "P P --> WPRIME --> W+/W- Z0"]

In this example, we specify that the observables shown in a table are acceptance ("ACC") and efficiency ("EFF"). We also specify the reaction we are talking about, in this case graviton or W' production with decays to SM gauge bosons. This code snippet is taken from one of our examples.

Lists of recognized keywords are available from the hepdata documentation for Observables, Phrases, and Particles.

Adding links to related tables

To add a link to a related table object, you can use the add_related_doi function of the Table class.

Please note: your DOIs must match the format: 10.17182/hepdata.[RecordID].v[Version]/t[Table].

table.add_related_doi("10.17182/hepdata.72886.v2/t3")
table.add_related_doi("10.17182/hepdata.12882.v1/t2")

In the second example, we are adding a link to the table with a DOI value of 10.17182/hepdata.12882.v1/t2.

Please note: This field should not be used for self-referencing, the DOIs inserted should be for OTHER related tables.

The documentation for this feature can be found here: Linking tables.

Adding a data license

You can add data license information to a table using the add_data_license function of the Table class. This function takes mandatory name and url string arguments, as well as an optional description.

Please note: The default license applied to all data uploaded to HEPData is CC0. You do not need to specify a license for a data table unless it differs from CC0.

table.add_data_license("CC BY 4.0", "https://creativecommons.org/licenses/by/4.0/")
table.add_data_license("CC BY 4.0", "https://creativecommons.org/licenses/by/4.0/", "This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.")

Uncertainties

In many cases, you will want to give uncertainties on the central values provided in the Variable objects. Uncertainties can be symmetric or asymmetric (up and down variations of the central value either have the same or different magnitudes). For symmetric uncertainties, the values of the uncertainties are simply stored as a one-dimensional list. For asymmetric uncertainties, the up- and downward variations are stored as a list of two-component tuples:

from hepdata_lib import Uncertainty
unc1 = Uncertainty("A symmetric uncertainty", is_symmetric=True)
unc1.values = [ 0.1, 0.3, 0.5]

unc2 = Uncertainty("An asymmetric uncertainty", is_symmetric=False)
unc2.values = [ (-0.08, +0.15), (-0.13, +0.20), (-0.18,+0.27) ]

After creating the Uncertainty objects, the only additional step is to attach them to the Variable:

variable.add_uncertainty(unc1)
variable.add_uncertainty(unc2)

See Uncertainties for more guidance. In particular, note that hepdata_lib will omit the errors key from the YAML output if all uncertainties are zero for a particular bin, printing a warning message "Note that bins with zero content should preferably be omitted completely from the HEPData table". A legitimate use case is where there are multiple dependent variables and a (different) subset of the bins has missing content for some dependent variables. In this case the uncertainties should be set to zero for the missing bins with a non-numeric central value like '-'. The warning message can be suppressed by passing an optional argument zero_uncertainties_warning=False when defining an instance of the Variable class. Furthermore, note that None can be used to suppress the uncertainty for individual bins in cases where the uncertainty components may only apply to a subset of the values.