feat: Improved FPE monitoring #2157

paulgessinger · 2023-05-26T16:11:50Z

Overall the goal is to not fail a job when an FPE occurs, but to mask that FPE type in the signal handler, take a stack trace, resume execution. The sequencer can then demask the type again for the next algorithm. Overall I implemented the resuming based on discussion with @stephenswat and only for x86_64 for now. It keeps stack traces, accumulates them across algorithms / events / threads, deduplicates stack traces, and can print a summary at the end, looking something like this:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/pagessin/dev/acts, configfile: pytest.ini, testpaths: Examples/Python/tests
plugins: pytest_check-1.0.4, rerunfailures-10.2, xdist-3.2.1
collected 227 items / 226 deselected / 1 selected

Examples/Python/tests/test_fpe.py 18:10:42    Sequencer      INFO      Create Sequencer with -1 threads
18:10:42    Sequencer      INFO      Add Algorithm 'FpeMaker'
18:10:42    Sequencer      INFO      Processing events [0, 30)
18:10:42    Sequencer      INFO      Starting event loop with -1 threads
18:10:42    Sequencer      INFO        0 context decorators
18:10:42    Sequencer      INFO        1 sequence elements
18:10:42    Sequencer      INFO        0 readers
18:10:42    Sequencer      INFO        1 algorithms
18:10:42    Sequencer      INFO        0 writers
SIGACTION
floating point divide by zero
18:10:42    Sequencer      INFO      finished event 0
SIGACTION
floating point overflow
18:10:42    Sequencer      INFO      finished event 1
SIGACTION
floating point invalid operation
18:10:42    Sequencer      INFO      finished event 2
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
SIGACTION
floating point overflow18:10:42    Sequencer      INFO      finished event 15

18:10:42    Sequencer      INFO      finished event 7
SIGACTION
floating point invalid operation
18:10:42    Sequencer      INFO      finished event 5
SIGACTION
floating point divide by zero
SIGACTION
floating point overflow
18:10:42    Sequencer      INFO      finished event 18
SIGACTION
floating point invalid operation
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 26
18:10:43    Sequencer      INFO      finished event 6
SIGACTION
floating point overflow
18:10:43    Sequencer      INFO      finished event 16
18:10:43    Sequencer      INFO      finished event 22
SIGACTION
floating point overflow
SIGACTION
floating point overflow18:10:43    Sequencer      INFO      finished event 28

SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 19
18:10:43    Sequencer      INFO      finished event 23
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 8
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 27
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 17
SIGACTION
floating point invalid operation
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 20
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 9
18:10:43    Sequencer      INFO      finished event 24
SIGACTION
floating point overflow
SIGACTION
floating point invalid operation
SIGACTION
floating point overflow
SIGACTION
floating point overflow
SIGACTION
floating point overflow
18:10:43    Sequencer      INFO      finished event 25
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 3
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 13
18:10:43    Sequencer      INFO      finished event 29
18:10:43    Sequencer      INFO      finished event 21
18:10:44    Sequencer      INFO      finished event 10
18:10:44    Sequencer      INFO      finished event 14
18:10:44    Sequencer      INFO      finished event 4
18:10:44    Sequencer      INFO      finished event 12
18:10:44    Sequencer      INFO      finished event 11
FPE result summary:
- INTDIV: 0
- INTOVF: 0
- FLTDIV: 10
- FLTOVF: 10
- FLTUND: 0
- FLTRES: 0
- FLTINV: 10
- FLTSUB: 0

Stack traces:
- FLTDIV: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#9}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#9}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
- FLTOVF: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#10}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#10}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
- FLTINV: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#11}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#11}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
18:10:44    Sequencer      INFO      Processed 30 events in 1.961989 s (wall clock)
18:10:44    Sequencer      INFO      Average time per event: 327.442344 ms/event
.
----------------------------- Root file has checks -----------------------------
NOTE: Root file hash checks were skipped, enable with ROOT_HASH_CHECKS=on
See https://acts.readthedocs.io/en/latest/examples/python_bindings.html#root-file-hash-regression-checks for more details

====================== 1 passed, 226 deselected in 3.49s =======================

Currently, this doesn't fail the job, and the plan is to implement a masking mechanism based on the top level stack frame source file and line, as well as summation by algorithm / reader / writer, rather than just one global one.

Core/include/Acts/Utilities/FpeMonitor.hpp

Tests/UnitTests/Core/Utilities/FpeMonitorTests.cpp

Tests/UnitTests/Core/Utilities/CMakeLists.txt

Core/src/Utilities/FpeMonitor.cpp

Core/include/Acts/Utilities/FpeMonitor.hpp

Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>

…eat/fpe-imprv

codecov · 2023-06-05T13:35:15Z

Codecov Report

Merging #2157 (8298167) into main (dfc4940) will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2157      +/-   ##
==========================================
- Coverage   49.37%   49.36%   -0.02%     
==========================================
  Files         446      445       -1     
  Lines       25290    25259      -31     
  Branches    11657    11646      -11     
==========================================
- Hits        12488    12468      -20     
+ Misses       4515     4511       -4     
+ Partials     8287     8280       -7

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

github-actions · 2023-06-05T21:59:14Z

📊 Physics performance monitoring for `8298167`

Summary
Full report
Seeding: seeded, truth estimated, orthogonal
CKF: seeded, truth smeared, truth estimated, orthogonal
IVF: seeded, truth smeared, truth estimated, orthogonal
AMVF: seeded, truth smeared, truth estimated, orthogonal
Ambiguity resolution: seeded, orthogonal
Truth tracking
Truth tracking (GSF)

Vertexing

Vertexing vs. mu

IVF seeded

IVF truth_smeared

IVF truth_estimated

IVF orthogonal

AMVF seeded

AMVF truth_smeared

AMVF truth_estimated

AMVF orthogonal

Seeding

Seeding seeded

Seeding truth_estimated

Seeding orthogonal

CKF

CKF seeded

CKF truth_smeared

CKF truth_estimated

CKF orthogonal

Ambiguity resolution

seeded

Truth tracking (Kalman Filter)

Truth tracking

Truth tracking (GSF)

Truth tracking

paulgessinger · 2023-06-17T07:50:23Z

It's green. Let's quickly merge it before it breaks again @andiwand 😅

paulgessinger · 2023-06-17T13:45:39Z

Need to debug the compiler segfault on Monday.

andiwand

lets get this in

Our full chain pulls are in a bad state. Looks like the reconstruction and simulation energy loss did not match up. This PR switches the Fatras interactions on which should bring our pulls back to standard normal distribution. Fixes - #1643 Blocked by - #2157 - #2239 - #2295 - #2293 - #2294

…roject#2086) Our full chain pulls are in a bad state. Looks like the reconstruction and simulation energy loss did not match up. This PR switches the Fatras interactions on which should bring our pulls back to standard normal distribution. Fixes - acts-project#1643 Blocked by - acts-project#2157 - acts-project#2239 - acts-project#2295 - acts-project#2293 - acts-project#2294

paulgessinger added 9 commits May 25, 2023 16:22

basic improvement in!

f7ce657

fix boost backtrace setup

0bebb8b

fpemonitor bundles up stack traces!

31dfd0b

add count merging

2808e54

stack trace accumulation

5b4e304

refactor, add opaque wrapper around boost stack trace

e464475

stack trace deduplication based on top frame

fdbc71d

python level test harness

3ec89ae

prototype for sequencer level monitoring, no failure or masking yet

1d46873

paulgessinger added this to the next milestone May 26, 2023

add python fpe test alg

2cbb82a

andiwand self-requested a review May 26, 2023 16:32

andiwand reviewed May 26, 2023

View reviewed changes

Core/include/Acts/Utilities/FpeMonitor.hpp Outdated Show resolved Hide resolved

andiwand reviewed May 28, 2023

View reviewed changes

paulgessinger and others added 5 commits May 31, 2023 10:31

Update Tests/UnitTests/Core/Utilities/CMakeLists.txt

9e61b9d

Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>

Update Core/include/Acts/Utilities/FpeMonitor.hpp

a666e09

Co-authored-by: Andreas Stefl <stefl.andreas@gmail.com>

Merge remote-tracking branch 'origin/main' into feat/fpe-imprv

be85c04

compile fixes

d210bf4

Merge branch 'feat/fpe-imprv' of github.com:paulgessinger/acts into f…

26e4c84

…eat/fpe-imprv

fpe monitor and stack trace collection uses preallocated buffer

580a336

paulgessinger and others added 8 commits June 6, 2023 15:11

progress

2bbd024

masking mechanism!

99972be

format

f39ceab

move to plugin

9e7b636

adding the plugin

95ac652

compile fixes

584a99f

x86_64

986fcec

switch to using manual buffer

5be424f

paulgessinger and others added 8 commits June 15, 2023 17:11

||true

f531711

add physmon artifact

d2e9a74

bump ccache size to 2G, always upload physmon output

a696eee

physmon path

ee38214

another mask

ca89f33

skip fpe tests if disabled

5932c22

re-fix gsf tests

58944ba

move envvar based override into C++

cb7f79d

andiwand approved these changes Jun 17, 2023

View reviewed changes

andiwand added the automerge label Jun 17, 2023

kodiakhq bot and others added 2 commits June 17, 2023 10:55

Merge branch 'main' into feat/fpe-imprv

3b3ba3d

reduce number of build threads

3c33ea7

paulgessinger added 🚧 WIP Work-in-progress and removed automerge labels Jun 17, 2023

andiwand and others added 2 commits June 18, 2023 21:56

Merge branch 'main' into feat/fpe-imprv

a624841

compilation fix

d76dcce

paulgessinger removed the 🚧 WIP Work-in-progress label Jun 19, 2023

Merge branch 'main' into feat/fpe-imprv

8298167

andiwand added the automerge label Jun 19, 2023

andiwand approved these changes Jun 19, 2023

View reviewed changes

kodiakhq bot merged commit 04bfbbc into acts-project:main Jun 19, 2023

paulgessinger deleted the feat/fpe-imprv branch June 19, 2023 15:27

github-actions bot removed the automerge label Jun 19, 2023

paulgessinger modified the milestones: next, v27.0.0 Jun 20, 2023

AJPfleger mentioned this pull request Nov 1, 2023

chore: Remove unused FpeMonitorTests from core #2607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improved FPE monitoring #2157

feat: Improved FPE monitoring #2157

paulgessinger commented May 26, 2023

codecov bot commented Jun 5, 2023 •

edited

Loading

github-actions bot commented Jun 5, 2023 •

edited

Loading

paulgessinger commented Jun 17, 2023 •

edited

Loading

paulgessinger commented Jun 17, 2023

andiwand left a comment

feat: Improved FPE monitoring #2157

feat: Improved FPE monitoring #2157

Conversation

paulgessinger commented May 26, 2023

codecov bot commented Jun 5, 2023 • edited Loading

Codecov Report

github-actions bot commented Jun 5, 2023 • edited Loading

📊 Physics performance monitoring for 8298167

Vertexing

Seeding

CKF

Ambiguity resolution

Truth tracking (Kalman Filter)

Truth tracking (GSF)

paulgessinger commented Jun 17, 2023 • edited Loading

paulgessinger commented Jun 17, 2023

andiwand left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 5, 2023 •

edited

Loading

github-actions bot commented Jun 5, 2023 •

edited

Loading

📊 Physics performance monitoring for `8298167`

paulgessinger commented Jun 17, 2023 •

edited

Loading