Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with syscall interruption due to (runtime) signals #853

Closed
lmb opened this issue Nov 15, 2022 · 0 comments · Fixed by #887
Closed

how to deal with syscall interruption due to (runtime) signals #853

lmb opened this issue Nov 15, 2022 · 0 comments · Fixed by #887
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@lmb
Copy link
Collaborator

lmb commented Nov 15, 2022

BPF_PROG_LOAD and BPF_PROG_RUN are interrupted when a signal is sent to the thread executing the syscall. This leads to an unfortunate interaction with the Go runtime, which uses signals to implement the following features:

  • SIGURG for async preemption
  • SIGPROF for runtime.CPUProfile
  • SIGSETXID for syscall.AllThreadSyscall

In my opinion, the library must deal with all of these gracefully.

Async preemption

Async preemption is not used when a thread is in a syscall, since that is a "blocked safe point" already. No need to worry about SIGURG therefore.

CPU profiling

The runtime periodically sends SIGPROF to all threads if CPU profiling is enabled. This includes threads in a syscall.

We can block SIGPROF at the expense of less accurate profiles. This solves the most common cause of problems. We currently do this for BPF_PROG_LOAD but not for BPF_PROG_RUN.

All threads syscall

syscall.AllThreadSyscall sends SIGSETXID to invoke runPerThreadSyscall from the signal handler.

Blocking SIGSETXID would prevent invoking runPerThreadSyscall and cause the runtime to hang. We have to leave the signal unblocked and retry the syscall instead.

Since we can't know when a syscall interruption was due to AllThreadsSyscall we have to always retry.

Conclusion

I think we should extend the BPF_PROG_LOAD behaviour to BPF_PROG_RUN:

  • Mask SIGPROF to reduce noise from the profiler
  • Retry on EINTR from PROG_RUN due to AllThreadsSyscall

I expect that AllThreadsSyscall interrupting PROG_RUN is actually very rare. So I propose to also remove RunOptions.Reset and instead document that AllThreadsSyscall or signals may skew the results of a Run. Most importantly we have to document that RunOptions.Repeat is basically a lower bound of how often a program is run due to retries.

@lmb lmb added the documentation Improvements or additions to documentation label Nov 15, 2022
@lmb lmb self-assigned this Dec 12, 2022
lmb added a commit to lmb/ebpf that referenced this issue Dec 12, 2022
Extend the profiler blocking we introduced for BPF_PROG_LOAD to
BPF_PROG_RUN, with the difference that EINTR is not handled in the
syscall wrapper itself. This allows us to have PROG_RUN specific
retry logic.

Deprecate RunOptions.Reset since its main motivation was to
account for profiler skew. Interruptions from the runtime are only
possible via runtime.AllThreadSyscall, which is very specialised
anyways. This doesn't protect from interruptions due to signals
that are being sent from "outside" the process.

Document that RunOptions.Repeat is a lower bound for how often
a program is run due to the retry behaviour. Also add a
workaround that makes the common case of repeat == 1 not suffer
from interruptions.

Fixes cilium#853
lmb added a commit to lmb/ebpf that referenced this issue Dec 12, 2022
Extend the profiler blocking we introduced for BPF_PROG_LOAD to
BPF_PROG_RUN, with the difference that EINTR is not handled in the
syscall wrapper itself. This allows us to have PROG_RUN specific
retry logic.

Deprecate RunOptions.Reset since its main motivation was to
account for profiler skew. Interruptions from the runtime are only
possible via runtime.AllThreadSyscall, which is very specialised
anyways. This doesn't protect from interruptions due to signals
that are being sent from "outside" the process, which will still
skew results and cause retries.

Document that RunOptions.Repeat is a lower bound for how often
a program is run due to the retry behaviour. Also add a
workaround that makes the common case of repeat == 1 not suffer
from interruptions.

Fixes cilium#853
lmb added a commit to lmb/ebpf that referenced this issue Dec 12, 2022
Extend the profiler blocking we introduced for BPF_PROG_LOAD to
BPF_PROG_RUN, with the difference that EINTR is not handled in the
syscall wrapper itself. This allows us to have PROG_RUN specific
retry logic.

Deprecate RunOptions.Reset since its main motivation was to
account for profiler skew. Interruptions from the runtime are only
possible via runtime.AllThreadSyscall, which is very specialised
anyways. This doesn't protect from interruptions due to signals
that are being sent from "outside" the process, which will still
skew results and cause retries.

Document that RunOptions.Repeat is a lower bound for how often
a program is run due to the retry behaviour. Also add a
workaround that makes the common case of repeat == 1 not suffer
from interruptions.

Fixes cilium#853
lmb added a commit to lmb/ebpf that referenced this issue Dec 13, 2022
Extend the profiler blocking we introduced for BPF_PROG_LOAD to
BPF_PROG_RUN, with the difference that EINTR is not handled in the
syscall wrapper itself. This allows us to have PROG_RUN specific
retry logic.

Deprecate RunOptions.Reset since its main motivation was to
account for profiler skew. Interruptions from the runtime are only
possible via runtime.AllThreadSyscall, which is very specialised
anyways. This doesn't protect from interruptions due to signals
that are being sent from "outside" the process, which will still
skew results and cause retries.

Document that RunOptions.Repeat is a lower bound for how often
a program is run due to the retry behaviour. Also add a
workaround that makes the common case of repeat == 1 not suffer
from interruptions.

Fixes cilium#853
ti-mo pushed a commit that referenced this issue Dec 13, 2022
Extend the profiler blocking we introduced for BPF_PROG_LOAD to
BPF_PROG_RUN, with the difference that EINTR is not handled in the
syscall wrapper itself. This allows us to have PROG_RUN specific
retry logic.

Deprecate RunOptions.Reset since its main motivation was to
account for profiler skew. Interruptions from the runtime are only
possible via runtime.AllThreadSyscall, which is very specialised
anyways. This doesn't protect from interruptions due to signals
that are being sent from "outside" the process, which will still
skew results and cause retries.

Document that RunOptions.Repeat is a lower bound for how often
a program is run due to the retry behaviour. Also add a
workaround that makes the common case of repeat == 1 not suffer
from interruptions.

Fixes #853
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant