Read shapes in from pandas dataframe #548

shane-breeze · 2019-06-01T11:24:16Z

I kept converting pandas dataframes into TH1s for my binned shape fits so instead I've included this conversion in combine and others might find it useful (shapes can be saved in human readable csv/json files or even excel spreadsheets e.g.).

Changed ShapeTools.py to interpret files with the extensions [".csv", ".json", ".html", ".pkl", ".xlsx", ".h5", ".parquet"] as a pandas dataframe (see here for IO). Any other extensions are dealt with as before, i.e. as ROOT files. Note that multiindexed dataframes are used and some file extensions need to be converted to multiindex where all but the last 2 columns are used for indexing.

DataFrameWrapper.py adds a class to wrap pandas dataframe so there's a Get method which acts in a similar way to ROOT::TFile::Get in return TH1s.

Example csv file included and the following commands gave the same results (apart from file names/CP time):

combine -M MultiDimFit --algo singles data/tutorials/shapes/simple-shapes-TH1.txt -v 4
combine -M MultiDimFit --algo singles data/tutorials/shapes/simple-shapes-df.txt -v 4

Excel spreadsheets depends on openpyxl and xlrd, parquet depends on pyarrow or fastparquet, and hdf depends on pytables (can be installed through pip).

This can be extended to unbinned fits by having a similar multiindex for channel/process categorisation and a column(s) with the observable(s).

…convert_to_th1

…for event yield and variance

… of the histogram

nucleosynthesis · 2019-06-03T08:25:10Z

Thanks @shane-breeze . Will this also support the use of the autoMCStats directive in the datacards?

shane-breeze · 2019-06-03T08:42:19Z

Yes. There's a variance column in the dataframe that fills the error of the histograms. Just tested the simple shapes card with autoMCStats and got the same results with TH1s and the dataframe.

shane-breeze · 2019-06-05T08:21:41Z

Added a cast for the index selection in the datacard to the index dtypes. This is taken as a string inside the datacard but the dataframe can have int, float, ...

nsmith- · 2022-03-28T08:26:21Z

Just wondering about the label, what work needs to be done on this?

amarini · 2022-04-05T13:37:41Z

@nsmith- , the action items on this PR were the following in the last discussion:

resolve the conflicts
add a warning when pandas shapes are loaded that non all the features may be available

amarini · 2022-04-05T14:08:00Z

@shane-breeze , can you allow editing from maintainers?

shane-breeze · 2022-04-05T16:14:17Z

@amarini, I have unarchived my fork of this repository. It should now be writable.

Conflicts: python/ShapeTools.py

nsmith- · 2022-04-05T22:43:15Z

non all the features may be available

autoMCStats is available (I checked it still works), so I think there is no missing feature. Unbinned data is anyway detected by the output of getShape so it shouldn't cause any datacard-level requirement to fail iiuc.

hcombbot · 2022-04-05T23:13:18Z

Pull Request Test.
Summary
========
Running options:
* MODE : cmssw
* COMBINE_TAG : 102x
* COMBINE_REPO : cms-analysis
* COMBINE_MERGE : shane-breeze/shapes-df
* GITHUB_PR : 548

Ratio to reference values:
--------
| comb_2019_hbb_boosted_standalone | comb_2019_hgg | comb_2019_hmm | comb_2019_htt | comb_2019_hww | comb_2019_tth_hbb | comb_2019_tth_hgg | comb_2019_tth_multilepton | comb_2019_vh_htt | comb_2019_vhbb | comb_2019_vhbb2017 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

You can find more detail at https://gitlab.cern.ch/cms-hcg/performances/ci/-/pipelines/3810969

Merge pull request #548 from shane-breeze/shapes-df Read shapes in from pandas dataframe

Cherry-pick #548 merge into 112x

Shane Breeze added 9 commits June 1, 2019 09:36

Add DataFrameWrapper to load shape TH1s from a dataframe in ShapeBuilder

733db07

DataFrameWrapper: Implement Get to return a TH1 using a staticmethod …

4d0fc74

…convert_to_th1

DataFrameWrapper: user specified (in the datacard) columns to select …

d22f183

…for event yield and variance

Add simple shape examples for reading in a dataframe

f09fae5

Rename simple shapes df input

a9588a5

DataFrameWrapper: last bin in dataframe is placed in the overflow bin…

24963bd

… of the histogram

Update docs - shape datacards can read in from pandas dataframe

3672a96

DataFrameWrapper: update handling of json, html and xlsx

fbfd360

DataFrameWrapper: update docstring

c8bdac4

Add missing parenthesis

1839c9a

DataFrameWrapper: cast df index selection dtypes

eac4808

ajgilbert added the needs work label Mar 30, 2020

amarini added the safe to test label Mar 31, 2022

nsmith- mentioned this pull request Apr 5, 2022

Python 3 #647

Closed

Merge branch '102x' into shapes-df

685bc98

Conflicts: python/ShapeTools.py

nsmith- removed the needs work label Apr 6, 2022

nsmith- merged commit e8a7a2a into cms-analysis:102x Apr 7, 2022

nsmith- added the port to 112x label Apr 8, 2022

nsmith- added a commit that referenced this pull request Apr 8, 2022

Cherry-pick #548 merge into 112x

d6a9253

Merge pull request #548 from shane-breeze/shapes-df Read shapes in from pandas dataframe

nsmith- mentioned this pull request Apr 8, 2022

Cherry-pick #548 merge into 112x #756

Merged

nsmith- added a commit that referenced this pull request Apr 8, 2022

Merge pull request #756 from cms-analysis/rebase_548_onto_112x

e887c1c

Cherry-pick #548 merge into 112x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read shapes in from pandas dataframe #548

Read shapes in from pandas dataframe #548

shane-breeze commented Jun 1, 2019

nucleosynthesis commented Jun 3, 2019

shane-breeze commented Jun 3, 2019

shane-breeze commented Jun 5, 2019

nsmith- commented Mar 28, 2022

amarini commented Apr 5, 2022

amarini commented Apr 5, 2022

shane-breeze commented Apr 5, 2022

nsmith- commented Apr 5, 2022

hcombbot commented Apr 5, 2022

Read shapes in from pandas dataframe #548

Read shapes in from pandas dataframe #548

Conversation

shane-breeze commented Jun 1, 2019

nucleosynthesis commented Jun 3, 2019

shane-breeze commented Jun 3, 2019

shane-breeze commented Jun 5, 2019

nsmith- commented Mar 28, 2022

amarini commented Apr 5, 2022

amarini commented Apr 5, 2022

shane-breeze commented Apr 5, 2022

nsmith- commented Apr 5, 2022

hcombbot commented Apr 5, 2022