Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improved FPE monitoring #2157

Merged
merged 98 commits into from
Jun 19, 2023
Merged

Conversation

paulgessinger
Copy link
Member

Overall the goal is to not fail a job when an FPE occurs, but to mask that FPE type in the signal handler, take a stack trace, resume execution. The sequencer can then demask the type again for the next algorithm. Overall I implemented the resuming based on discussion with @stephenswat and only for x86_64 for now. It keeps stack traces, accumulates them across algorithms / events / threads, deduplicates stack traces, and can print a summary at the end, looking something like this:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/pagessin/dev/acts, configfile: pytest.ini, testpaths: Examples/Python/tests
plugins: pytest_check-1.0.4, rerunfailures-10.2, xdist-3.2.1
collected 227 items / 226 deselected / 1 selected

Examples/Python/tests/test_fpe.py 18:10:42    Sequencer      INFO      Create Sequencer with -1 threads
18:10:42    Sequencer      INFO      Add Algorithm 'FpeMaker'
18:10:42    Sequencer      INFO      Processing events [0, 30)
18:10:42    Sequencer      INFO      Starting event loop with -1 threads
18:10:42    Sequencer      INFO        0 context decorators
18:10:42    Sequencer      INFO        1 sequence elements
18:10:42    Sequencer      INFO        0 readers
18:10:42    Sequencer      INFO        1 algorithms
18:10:42    Sequencer      INFO        0 writers
SIGACTION
floating point divide by zero
18:10:42    Sequencer      INFO      finished event 0
SIGACTION
floating point overflow
18:10:42    Sequencer      INFO      finished event 1
SIGACTION
floating point invalid operation
18:10:42    Sequencer      INFO      finished event 2
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
SIGACTION
floating point overflow18:10:42    Sequencer      INFO      finished event 15

18:10:42    Sequencer      INFO      finished event 7
SIGACTION
floating point invalid operation
18:10:42    Sequencer      INFO      finished event 5
SIGACTION
floating point divide by zero
SIGACTION
floating point overflow
18:10:42    Sequencer      INFO      finished event 18
SIGACTION
floating point invalid operation
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 26
18:10:43    Sequencer      INFO      finished event 6
SIGACTION
floating point overflow
18:10:43    Sequencer      INFO      finished event 16
18:10:43    Sequencer      INFO      finished event 22
SIGACTION
floating point overflow
SIGACTION
floating point overflow18:10:43    Sequencer      INFO      finished event 28

SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 19
18:10:43    Sequencer      INFO      finished event 23
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 8
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 27
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 17
SIGACTION
floating point invalid operation
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 20
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 9
18:10:43    Sequencer      INFO      finished event 24
SIGACTION
floating point overflow
SIGACTION
floating point invalid operation
SIGACTION
floating point overflow
SIGACTION
floating point overflow
SIGACTION
floating point overflow
18:10:43    Sequencer      INFO      finished event 25
SIGACTION
floating point invalid operation
18:10:43    Sequencer      INFO      finished event 3
SIGACTION
floating point divide by zero
18:10:43    Sequencer      INFO      finished event 13
18:10:43    Sequencer      INFO      finished event 29
18:10:43    Sequencer      INFO      finished event 21
18:10:44    Sequencer      INFO      finished event 10
18:10:44    Sequencer      INFO      finished event 14
18:10:44    Sequencer      INFO      finished event 4
18:10:44    Sequencer      INFO      finished event 12
18:10:44    Sequencer      INFO      finished event 11
FPE result summary:
- INTDIV: 0
- INTOVF: 0
- FLTDIV: 10
- FLTOVF: 10
- FLTUND: 0
- FLTRES: 0
- FLTINV: 10
- FLTSUB: 0

Stack traces:
- FLTDIV: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#9}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#9}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
- FLTOVF: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#10}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#10}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
- FLTINV: (10 times)
 0# pybind11::cpp_function::initialize<pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#11}, void, , pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init_ActsPythonBindings(pybind11::module_&)::{lambda()#11}&&, void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:224
 1# pybind11::cpp_function::dispatcher(_object*, _object*, _object*) at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/pybind11.h:929
 2# PyCFunction_Call in /home/pagessin/dev/acts/bindvenv/bin/python3
 3# _PyObject_MakeTpCall in /home/pagessin/dev/acts/bindvenv/bin/python3
 4# _PyEval_EvalFrameDefault in /home/pagessin/dev/acts/bindvenv/bin/python3
 5# _PyFunction_Vectorcall in /home/pagessin/dev/acts/bindvenv/bin/python3
 6# 0x000000000050B23C in /home/pagessin/dev/acts/bindvenv/bin/python3
 7# PyObject_CallObject in /home/pagessin/dev/acts/bindvenv/bin/python3
 8# pybind11::object pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, ActsExamples::AlgorithmContext const&>(ActsExamples::AlgorithmContext const&) const at /home/pagessin/dev/acts/build/_deps/pybind11-src/include/pybind11/cast.h:1631
 9# ActsExamples::IAlgorithm::internalExecute(ActsExamples::AlgorithmContext const&) at /home/pagessin/dev/acts/Examples/Framework/include/ActsExamples/Framework/IAlgorithm.hpp:51
10# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}::operator()(tbb::blocked_range<unsigned long> const&) const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:455
11# tbb::interface9::internal::start_for<tbb::blocked_range<unsigned long>, ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const::{lambda(tbb::blocked_range<unsigned long> const&)#1}, tbb::auto_partitioner const>::execute() at /usr/include/tbb/parallel_for.h:144
12# 0x00007F431EE37545 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
13# 0x00007F431EE3780F in /usr/lib/x86_64-linux-gnu/libtbb.so.2
14# 0x00007F431EE34B68 in /usr/lib/x86_64-linux-gnu/libtbb.so.2
15# ActsExamples::Sequencer::run()::{lambda()#1}::operator()() const at /home/pagessin/dev/acts/Examples/Framework/src/Framework/Sequencer.cpp:418
18:10:44    Sequencer      INFO      Processed 30 events in 1.961989 s (wall clock)
18:10:44    Sequencer      INFO      Average time per event: 327.442344 ms/event
.
----------------------------- Root file has checks -----------------------------
NOTE: Root file hash checks were skipped, enable with ROOT_HASH_CHECKS=on
See https://acts.readthedocs.io/en/latest/examples/python_bindings.html#root-file-hash-regression-checks for more details

====================== 1 passed, 226 deselected in 3.49s =======================

Currently, this doesn't fail the job, and the plan is to implement a masking mechanism based on the top level stack frame source file and line, as well as summation by algorithm / reader / writer, rather than just one global one.

@paulgessinger paulgessinger added this to the next milestone May 26, 2023
@andiwand andiwand self-requested a review May 26, 2023 16:32
Tests/UnitTests/Core/Utilities/FpeMonitorTests.cpp Outdated Show resolved Hide resolved
Tests/UnitTests/Core/Utilities/FpeMonitorTests.cpp Outdated Show resolved Hide resolved
Tests/UnitTests/Core/Utilities/FpeMonitorTests.cpp Outdated Show resolved Hide resolved
Tests/UnitTests/Core/Utilities/FpeMonitorTests.cpp Outdated Show resolved Hide resolved
Tests/UnitTests/Core/Utilities/CMakeLists.txt Outdated Show resolved Hide resolved
Core/src/Utilities/FpeMonitor.cpp Outdated Show resolved Hide resolved
Core/src/Utilities/FpeMonitor.cpp Outdated Show resolved Hide resolved
Core/include/Acts/Utilities/FpeMonitor.hpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 5, 2023

Codecov Report

Merging #2157 (8298167) into main (dfc4940) will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2157      +/-   ##
==========================================
- Coverage   49.37%   49.36%   -0.02%     
==========================================
  Files         446      445       -1     
  Lines       25290    25259      -31     
  Branches    11657    11646      -11     
==========================================
- Hits        12488    12468      -20     
+ Misses       4515     4511       -4     
+ Partials     8287     8280       -7     

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link

github-actions bot commented Jun 5, 2023

📊 Physics performance monitoring for 8298167

Summary
Full report
Seeding: seeded, truth estimated, orthogonal
CKF: seeded, truth smeared, truth estimated, orthogonal
IVF: seeded, truth smeared, truth estimated, orthogonal
AMVF: seeded, truth smeared, truth estimated, orthogonal
Ambiguity resolution: seeded, orthogonal
Truth tracking
Truth tracking (GSF)

Vertexing

Vertexing vs. mu
IVF seeded

IVF truth_smeared

IVF truth_estimated

IVF orthogonal

AMVF seeded

AMVF truth_smeared

AMVF truth_estimated

AMVF orthogonal

Seeding

Seeding seeded

Seeding truth_estimated

Seeding orthogonal

CKF

CKF seeded

CKF truth_smeared

CKF truth_estimated

CKF orthogonal

Ambiguity resolution

seeded

Truth tracking (Kalman Filter)

Truth tracking

Truth tracking (GSF)

Truth tracking

@paulgessinger
Copy link
Member Author

paulgessinger commented Jun 17, 2023

It's green. Let's quickly merge it before it breaks again @andiwand 😅

@paulgessinger paulgessinger added 🚧 WIP Work-in-progress and removed automerge labels Jun 17, 2023
@paulgessinger
Copy link
Member Author

Need to debug the compiler segfault on Monday.

@paulgessinger paulgessinger removed the 🚧 WIP Work-in-progress label Jun 19, 2023
Copy link
Contributor

@andiwand andiwand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets get this in

@kodiakhq kodiakhq bot merged commit 04bfbbc into acts-project:main Jun 19, 2023
@paulgessinger paulgessinger deleted the feat/fpe-imprv branch June 19, 2023 15:27
@paulgessinger paulgessinger modified the milestones: next, v27.0.0 Jun 20, 2023
kodiakhq bot pushed a commit that referenced this pull request Jul 24, 2023
Our full chain pulls are in a bad state. Looks like the reconstruction and simulation energy loss did not match up. This PR switches the Fatras interactions on which should bring our pulls back to standard normal distribution.

Fixes
- #1643

Blocked by
- #2157
- #2239
- #2295
- #2293
- #2294
paulgessinger pushed a commit to paulgessinger/acts that referenced this pull request Jul 24, 2023
…roject#2086)

Our full chain pulls are in a bad state. Looks like the reconstruction and simulation energy loss did not match up. This PR switches the Fatras interactions on which should bring our pulls back to standard normal distribution.

Fixes
- acts-project#1643

Blocked by
- acts-project#2157
- acts-project#2239
- acts-project#2295
- acts-project#2293
- acts-project#2294
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Core Affects the Core module Component - Documentation Affects the documentation Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Infrastructure Changes to build tools, continous integration, ... Track Fitting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants