Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to read in YODA files #229

Open
GraemeWatt opened this issue Jul 6, 2023 · 10 comments
Open

Add ability to read in YODA files #229

GraemeWatt opened this issue Jul 6, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@GraemeWatt
Copy link
Member

GraemeWatt commented Jul 6, 2023

For cases where an analyser has data already in the YODA format for use with Rivet, it would be useful if hepdata_lib could read YODA files for conversion to the HEPData YAML format. It would be preferable if YODA was an optional and not mandatory dependence. The question of converting YODA to HEPData YAML has been a long-standing issue (HEPData/hepdata-converter#10), but it would be better handled by hepdata_lib than the hepdata-converter.

Cc: @20DM

@GraemeWatt GraemeWatt added the enhancement New feature or request label Jul 6, 2023
@20DM
Copy link
Contributor

20DM commented Jul 22, 2024

Hi Graeme!

I'm just in the process of preparing submissions for the reference data files in Rivet that don't have a HepData entry yet. I'm currently struggling to use the hepdata_lib for cases with inhomogeneous error breakdowns across bins. For instance, I have a distribution with three bins where the first two bins have two error components 'A' and 'B' (but not 'C') and the third bin has error component 'C' (but not 'A' and 'B').

I know this is supported in principle, e.g. by just omitting the respective components in the dictionary. However, when using the library, it seems hepdata_lib/helpers.py raises a ValueError

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Is there a trick?

@20DM
Copy link
Contributor

20DM commented Jul 22, 2024

PS - just to be clear: of course I can "make it pass" by just setting the uncertainty to zero, but then all bins will have three uncertainty components, some of them zero, which is not the same as the bin not having the component in its breakdown to begin with. I think the problem is that the check for "non-zero uncertainties" only checks if there's at least one non-zero component and then adds all of them, regardless of their value. Can we make this more flexible?

@20DM
Copy link
Contributor

20DM commented Jul 23, 2024

On a different note: We have few cases where we have a discrete (string) axis where a subset of the edges is technically a floating point range. The library throws an error e.g. like this

error - independent_variable 'value' must not be a string range (use 'low' and 'high' to represent a range): '1.225-1.300' in 'independent_variables[0].values[6].value' (expected: {'type': 'number or string (not a range)'})

Of course I agree that a discrete axis where all bins are of the form float - float should just be a continuous axis and it's great that the validator enforces this. However, there are also a number of examples on HepData where we have a mix of these kind of bins with genuine discrete bins and we might want to allow this kind of axis in general, no?

One simple example I'm just looking at is one where we have two bins = [ "7 - 8", "13" ] corresponding to LHC centre-of-mass energies. One could get around the error by splitting this table into separate tables with a continuous [7.0, 8.0] bin or a discrete [ "13" ] bin, respectively, but then the two measurement points would not end up in the same plot without additional post-processing, which seems a shame. 🤔

@20DM
Copy link
Contributor

20DM commented Jul 23, 2024

On second thought, I suspect this requirement comes from the cases where we have a differential distribution, which is prepended/appended by a single bin corresponding to the average, which probably shouldn't be allowed. Maybe best to leave the validator as is and I will work around these cases (there's only 5 of them, so should be manageable).

@GraemeWatt
Copy link
Member Author

GraemeWatt commented Jul 23, 2024

This error comes from the hepdata-validator package rather than hepdata_lib. It was a common encoding mistake that uploaders specified a bin as a single value with the bin limits separated by a hyphen rather than giving separate low and high values (HEPData/hepdata-validator#33), so we implemented a check to catch it. I think hepdata_lib does not support mixed bins such as {low: 7, high: 8} and value: 13, although this is allowed in the HEPData YAML format. You could use {low: 13, high: 13} (unless a zero-width bin causes problems?) or use a separator other than - for the discrete bin "7 - 8" like "7 to 8" or "7 & 8".

@20DM
Copy link
Contributor

20DM commented Jul 23, 2024

Well, there were only 5 cases where I encountered this issue, so I've just replaced the dash with a "to" or "&" , depending on the context. It's sufficiently rare that this is probably good enough for now.

Good news, though: I've now managed to create submission tarballs that make the validator happy for all of the Rivet reference files that don't have a HepData entry yet. There's a total of 780 tarballs. What's the best way to submit them? I hope I don't have to upload them through the browser one by one? 😉

@20DM
Copy link
Contributor

20DM commented Jul 23, 2024

PS - I have a guest account for the IPPP cluster if it would be helpful for me to upload them there somewhere?

@GraemeWatt
Copy link
Member Author

Great work! You should log into hepdata.net and click "Request Coordinator Privileges" on your Dashboard, then enter "Rivet" as the Experiment/Group. You can then click the "Submit" button to initiate a submission with an INSPIRE ID and specify an Uploader and Reviewer (maybe just yourself in both roles, unless you want a check from someone else). This will create an empty record that allows you to upload, then the record can be reviewed (there's a shortcut "Approve All Tables") and finalised from your Dashboard.

In terms of automation, we haven't yet encountered a need for bulk uploads like this, so unfortunately, there's not an easy way to finalise 780 records. The upload stage could be done from the command line (or from Python) using the hepdata-cli tool (see Example 9), but it requires an invitation cookie specific for each record. The record creation, reviewing and finalisation can only be done from the web interface. It might be possible to (semi-)automate these steps using something like Selenium, but I think that each record should undergo a basic visual check by a human before it is finalised. I suggest that you perform the create/upload/review/finalise workflow manually for a few records until you see what is involved, then you can decide whether it is worthwhile to look into writing scripts to (semi-)automate the procedure.

@GraemeWatt
Copy link
Member Author

I've approved your Coordinator request. I realised that we already have a module for bulk imports that was written to import records from hepdata.net to a developer's local instance. Previously, we had a similar module for bulk migration of records from the old HepData site to the new hepdata.net site. The importer module bypasses the web interface of the normal submission system, so it would be a more efficient way of importing a large number of tarballs. If you could copy the tarballs to a web-accessible location and provide a list of INSPIRE IDs in a format similar to https://www.hepdata.net/search/ids?inspire_ids=true , I'll look into making the necessary changes to the importer module. I've opened a new issue HEPData/hepdata#811 so please continue the discussion there as it no longer relates to hepdata_lib.

@20DM
Copy link
Contributor

20DM commented Jul 24, 2024

Great - thank you!! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants