Add pythonizations for collection subscript #570

m-fila · 2024-03-18T17:34:59Z

BEGINRELEASENOTES

Added mechanism to load cppyy pythonizations
Added pythonization for bound-check subscript operation in collections

ENDRELEASENOTES

#458 and key4hep/EDM4hep#288 show there is some interest in tuning the cppyy generated bindings. The cppyy has a feature for this called pythonization -> a callback can be registered for all the classes that fulfill some predicate (e.g. derived from podio::CollectionBase).

This PR contains:

an abstraction for adding pythonizations (each pythonization implemented in a class derived from podio.pythonizations.utils.Pythonizer)
pythonization adding bound check __getitem__ for classes derived from podio::CollectionBase

The mechanism of loading the pythonizations is inspired by ROOT pythonizations with a change that de order of loading of the pythonizations is defined.

To load the pythonizations in a downstream projects, for instance for project utilizing a "edm4hep" namespace

from podio.pythonizations import load_pythonizations

load_pythonizations("edm4hep")

alternatively a selection of pythonizations can be loaded by importing and applying them one by one

The alternative mechanism I tough about:

c++ callback pythonization - instead of in python the callback to cppyy could be defined in C++ using the Python API. The downsides are extra dependency for C++ code (Python::Module) and knowing the Python C API is required
generating python wrappers with jinja2 - it would be rather cumbersome to add another set jinja2 templates for tuning the python bindings

tmadlener · 2024-03-21T09:37:15Z

If I understand correctly this would effectively give an "opt-in" solution for generated EDMs, as the EDM has to explicitly load these pythonizations. For EDM4hep this would not really be a problem, as we have some python glue code already in any case.

For me this looks like a good solution, but we should add a bit of documentation on how to use this in generated EDMs.

m-fila · 2024-03-21T09:58:58Z

If I understand correctly this would effectively give an "opt-in" solution for generated EDMs, as the EDM has to explicitly load these pythonizations. For EDM4hep this would not really be a problem, as we have some python glue code already in any case.

yes, it's opt-in. load_pythonizations could be added to EDM4hep's __init__.py to load on import

For me this looks like a good solution, but we should add a bit of documentation on how to use this in generated EDMs.

Thanks, I will add it

python/podio/pythonizations/__init__.py

python/podio/pythonizations/collection_subscript.py

jmcarcell · 2024-03-25T06:47:27Z

Note that this will make indexing significantly slower:
With this PR and the one in EDM4hep:

In [2]: import edm4hep

In [3]: coll = edm4hep.MCParticleCollection()

In [4]: coll.create()
Out[4]: <cppyy.gbl.edm4hep.MutableMCParticle object at 0x55f984f08120>

In [5]: %timeit coll[0]
3.45 µs ± 72.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

and without this PR:

2.46 µs ± 26.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It should be said somewhere that the current behaviour is you get an object when indexing out of bounds and then it crashes when you try to do something with that object. But I think the case of indexing out of bounds should be rare since typically you run either a for loop without indexes or if it has indexes then it's bounded by len(coll).

python/podio/pythonizations/collection_subscript.py

m-fila · 2024-03-25T09:23:50Z

Note that this will make indexing significantly slower

Thanks for pointing this out, I'll check if this can be improved by mapping __getitem__ to at and raising IndexError

m-fila · 2024-03-25T16:09:35Z

I tried benchmarking on following function:

def bench():
    try:
        coll[0]
    except:
        pass

coll = MCParticleCollection()
bench() # give cppyy time to load lazily
%timeit bench()

coll.create()
bench() # give cppyy time to load lazily
%timeit bench()

	Bound check	`__getitem__` implementation	Average time, no-exception branch	Average time, exception branch
1.	❌	Default cppyy mapping to `operator[]`	683 ns	-
2.	✔️	Originally proposed in this PR, bound-check in python	1.2 µs	779 ns
3.	✔️	Mapped to `at`	730 ns	30.8 µs
4.	✔️	`at` wrapped to raise `IndexError`	826 ns	32.1 µs

The implementations using at seems to be more performant in the usual case. Would it be acceptable to go with option 3. or 4.?

A C++ callback added to the Collection template could potentially achieve the same as at (option 3.) on the no-exception branch and slightly better than python side bound-check (option 2.) on the exception branch. But then extending with other c++ callbacks would be very ugly

The performance of at based implementations is poor if there is an exception, because by design cppyy tries to evaluate all eligible overloads if there was a throw (in this caseat(size_t) and at(size_t) const ). Since cppyy-1.7.0 it's possible to explicitly select between const and non-const overload, but it's newer than what we have

tmadlener · 2024-03-27T09:27:09Z

I see there has been quite some discussion (and progress?) here. A few comments / questions from my side in no particular order:

Did you by any chance check how the times compare to pure c++? Mainly for comparison, I don't think we learn too much from them other than a 30 % increase on the python side is pretty negligible compared to pure c++.
I think the exceptional case should not be the (main) driver for performance considerations. After all you are dealing with an error at that point and not with something "usual".
IIUC all of the presented methods are still possible with python callbacks, right?

m-fila · 2024-03-27T14:13:30Z

Did you by any chance check how the times compare to pure c++? Mainly for comparison, I don't think we learn too much from them other than a 30 % increase on the python side is pretty negligible compared to pure c++.

Pure C++ google benchmark between a single element access with operator[] vs at:

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
squareBracketsNoError       4.59 ns         4.59 ns    152241304
atNoError                   5.26 ns         5.25 ns    131108856
atError                     1552 ns         1550 ns       452227

where

static void squareBracketsNoError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll[0]);
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

static void atNoError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll.at(0));
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

static void atError(benchmark::State& state) {
  edm4hep::MCParticleCollection coll;
  coll.create();
  for (auto _ : state) {
    try {
      benchmark::DoNotOptimize(coll.at(1));
      benchmark::ClobberMemory();
    } catch (const std::out_of_range&) {
    }
  }
}

both this and python benchmark are isolated single access, looping over the collection will probably give different results

IIUC all of the presented methods are still possible with python callbacks, right?

Yes, all the alternatives from that table were python callbacks

I think the exceptional case should not be the (main) driver for performance considerations. After all you are dealing with an error at that point and not with something "usual".

In the last commit I changed the implementation proposed in PR to option 4 from the table. It seems to me as a reasonable compromise

tmadlener

Thanks for also checking the c++ times. The orders of magnitude difference between c++ and python looks like expected, I would say.

I would also vote for option 4 here. I just have one minor question below.

python/podio/pythonizations/collection_subscript.py

jmcarcell

This functionality should be in podio. For EDM4hep, I would like to have only what's necessary since there can be some drawbacks for interactive usage

tmadlener · 2024-04-09T08:20:41Z

Can you rebase this onto the latest master to pick up the RNTuple API fixes? That should also make CI pass again.

Co-authored-by: Juan Miguel Carceller <22276694+jmcarcell@users.noreply.github.com>

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

m-fila · 2024-04-25T07:46:47Z

Sorry for the delay. Rebased and added documentation. I also split the callback into two functions:

conditional to filter classes to by modified
modification body doing actual work here

I think previously it was not clear that callback should start with some conditional statement to select what should be modified

doc/python.md

tmadlener · 2024-04-26T09:29:45Z

python/podio/test_CodeGen.py

+        with self.assertRaises(IndexError):
+            _ = collection[20]


This fails on dev3 due to some overload resolution issue. I would assume it's a ROOT issue that we are hitting here, but I am not sure yet which one.

Well an exception is thrown, but of a different type than expected:

Traceback (most recent call last): File "/home/runner/work/podio/podio/python/podio/test_CodeGen.py", line 50, in test_bound_check _ = collection[20] File "/home/runner/work/podio/podio/python/podio/pythonizations/collection_subscript.py", line 22, in get_item return self.at(i) TypeError: none of the 2 overloaded methods succeeded. Full details: nsp::EnergyInNamespace nsp::EnergyInNamespaceCollection::at(size_t index) => out_of_range: deque::_M_range_check: __n (which is 20)>= this->size() (which is 1) nsp::MutableEnergyInNamespace nsp::EnergyInNamespaceCollection::at(size_t index) => out_of_range: deque::_M_range_check: __n (which is 20)>= this->size() (which is 1)

According to the docs if there is an exception then cppyy checks the rest of overloads and if they all throw the same type then the exception is propagated, if they all throw different types then TypeError is risen
Here both overloads throws std::out_of_range but somehow it isn't recognized as the same type

The possible solutions I could think of now are:

explicitly select overload - here the overloads are const and non-const. Selection between const and non-const was added in cppyy-1.7.0, nightlies have 1.6.2

change pythonization to catch either cppyy.gbl.std.out_of_range or TypeError and raise IndexError- I don't like it becasue it's also used for other things, like actual type mismatch, eg. calling with str instead of int

allow both types of exception in test - getting an exception is already an improvement, the message says it's out of range. The users get confused why it's TypeError not IndexError

Maybe this can be boiled down to a small reproducer that we can submit to ROOT because they probably didn't want to introduce too many "breaking" changes in their update of the cppyy they bundle.

Opened an issue wlav/cppyy#230 as the problems appears also outside of ROOT bundle

I also made the ROOT folks aware via root-project/root#15375

doc/python.md

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

tmadlener · 2024-05-14T19:33:36Z

Looks like all the necessary fixes have landed in upstream root (master and v6.32). IIRC this can be merged now, right?

m-fila · 2024-05-14T19:48:52Z

Looks like all the necessary fixes have landed in upstream root (master and v6.32). IIRC this can be merged now, right?

Awesome 😎 Yes, it's ready to go

hegner · 2024-05-14T19:54:30Z

and off we go...

m-fila mentioned this pull request Mar 21, 2024

add loading pythonizations from podio key4hep/EDM4hep#290

Merged

jmcarcell reviewed Mar 25, 2024

View reviewed changes

python/podio/pythonizations/__init__.py Outdated Show resolved Hide resolved

python/podio/pythonizations/collection_subscript.py Outdated Show resolved Hide resolved

jmcarcell reviewed Mar 25, 2024

View reviewed changes

python/podio/pythonizations/collection_subscript.py Outdated Show resolved Hide resolved

tmadlener reviewed Mar 28, 2024

View reviewed changes

python/podio/pythonizations/collection_subscript.py Outdated Show resolved Hide resolved

tmadlener reviewed Apr 2, 2024

View reviewed changes

python/podio/pythonizations/collection_subscript.py Outdated Show resolved Hide resolved

jmcarcell approved these changes Apr 3, 2024

View reviewed changes

tmadlener linked an issue Apr 9, 2024 that may be closed by this pull request

Better out-of-bounds checks for collection access #458

Closed

m-fila and others added 6 commits April 24, 2024 11:14

added pythonizer base class

598e039

added collection subscript pythonization

8172c85

Update python/podio/pythonizations/__init__.py

d972f41

Co-authored-by: Juan Miguel Carceller <22276694+jmcarcell@users.noreply.github.com>

collection __getitem__ uses at wrapped to throw python exception

734b786

fix exception stacktrace readability

f724f4f

Update python/podio/pythonizations/collection_subscript.py

a489371

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

m-fila force-pushed the pythonizations branch from 7ba418b to a489371 Compare April 24, 2024 14:22

m-fila added 2 commits April 24, 2024 16:24

split pythonization callback to predicate and modifcation

d438c5d

added documentation

16f8a7d

m-fila force-pushed the pythonizations branch from 5fabb23 to 16f8a7d Compare April 24, 2024 14:49

tmadlener reviewed Apr 26, 2024

View reviewed changes

Applied suggestions in docs

a7cf951

Co-authored-by: Thomas Madlener <thomas.madlener@desy.de>

hegner merged commit a18229b into AIDASoft:master May 14, 2024
18 checks passed

m-fila deleted the pythonizations branch May 23, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pythonizations for collection subscript #570

Add pythonizations for collection subscript #570

m-fila commented Mar 18, 2024

tmadlener commented Mar 21, 2024

m-fila commented Mar 21, 2024

jmcarcell commented Mar 25, 2024 •

edited

Loading

m-fila commented Mar 25, 2024

m-fila commented Mar 25, 2024 •

edited

Loading

tmadlener commented Mar 27, 2024

m-fila commented Mar 27, 2024

tmadlener left a comment

jmcarcell left a comment

tmadlener commented Apr 9, 2024

m-fila commented Apr 25, 2024

tmadlener Apr 26, 2024

m-fila Apr 26, 2024

tmadlener Apr 29, 2024

m-fila Apr 29, 2024

tmadlener Apr 29, 2024

tmadlener commented May 14, 2024

m-fila commented May 14, 2024

hegner commented May 14, 2024

Add pythonizations for collection subscript #570

Add pythonizations for collection subscript #570

Conversation

m-fila commented Mar 18, 2024

tmadlener commented Mar 21, 2024

m-fila commented Mar 21, 2024

jmcarcell commented Mar 25, 2024 • edited Loading

m-fila commented Mar 25, 2024

m-fila commented Mar 25, 2024 • edited Loading

tmadlener commented Mar 27, 2024

m-fila commented Mar 27, 2024

tmadlener left a comment

Choose a reason for hiding this comment

jmcarcell left a comment

Choose a reason for hiding this comment

tmadlener commented Apr 9, 2024

m-fila commented Apr 25, 2024

tmadlener Apr 26, 2024

Choose a reason for hiding this comment

m-fila Apr 26, 2024

Choose a reason for hiding this comment

tmadlener Apr 29, 2024

Choose a reason for hiding this comment

m-fila Apr 29, 2024

Choose a reason for hiding this comment

tmadlener Apr 29, 2024

Choose a reason for hiding this comment

tmadlener commented May 14, 2024

m-fila commented May 14, 2024

hegner commented May 14, 2024

jmcarcell commented Mar 25, 2024 •

edited

Loading

m-fila commented Mar 25, 2024 •

edited

Loading