Make Orbit aware of pid namespaces

Objective

Currently, Orbit implicitly assumes that the target process runs in the root namespace. There is no notion of pid namespaces in the code. This leads to mix ups and duplications of threads when handling tid’s from different namespaces. To fix this all the variables in the code that hold tid’s from a nested namespace should be mapped to the corresponding tid's in the root namespace such that they can be uniquely identified.

Note that the solution described in this document is not yet implemented in Orbit - see here.

Considering the non trivial performance impact (see here) of this solution, the alternative described here might still be worth considering.

Requirements

Events that need to have their pid/tid fields mapped
- user space instrumentation
- manual instrumentation
- Vulkan layer
Introspection needs to work properly; these events come from OrbitService and pid/tid fields must not be translated.
Are there other places where tid's need to be mapped?
- OrbitService keeps track of the threads spawned by the library injected into the target process for handling user space instrumentation. These threads are identified by their root namespace tid's (compare src/UserSpaceInstrumentation/InstrumentProcess.cpp, GetNewOrbitThreads). These tid's get transferred into the injected lib, they are used as a blocklist there such that no events from these threads are emitted. Since the injected lib resides in the target process namespace the tid's need to be translated into target process namespace (which can easily be done via /proc/pid/status NSpid field).

Non-requirements

Handle problems with other namespace (mount namespace is an obvious candidate)
Make this work with multiple target processes. Orbit currently only supports profiling one process. This is baked into the code in many locations but some are prepared to disambiguate between processes (e.g. user space instrumentation sends an event including a ‘pid’ field). When profiling multiple processes, theoretically, each of the processes could live in a different namespace. For now we will assume a single target process - either running in root or its own namespace.

Background

Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources and another set of processes sees a different set of resources. This doc is exclusively concerned with the ‘pid namespace’.

Some helpful context can be found e.g. here. The syscall to disassociate parts of the process execution context is called “unshare”. There is also a command line tool of the same name that allows to start a new process in a namespace (taking a lot of flags specifying what type of namespace exactly should be affected).

Helpful for testing things concerning pid namespace: this gives you a bash with an unshared pid namespace (taken from the link above):

unshare -Urfp --mount-proc

Processes started outside any container live in the root namespace. Processes that “unshare” the pid namespace will see pid's/tid's from their namespace in an independent numbering scheme. However each thread in a nested namespace will have a tid in all the namespaces above its own. Specifically, there is a root namespace tid for all threads on the system.

Currently Orbit implicitly assumes that the target process runs in the root namespace. This leads to misattribution of events when this is not the case: the events we obtain from the kernel via perf_event_open are getting tid's from the root namespace, the events obtained from within the target process (as mentioned above user space instrumentation, manual instrumentation and Vulkan layer) use the target process namespace. So the association between events and threads is broken. Technically, this leads to threads showing up twice in the Orbit UI (once with each tid). Besides that there are other problems e.g. around matching Vulkan events with tid's obtained from tracepoints.

Design ideas

We choose to run OrbitService in the root namespace and to identify threads by their root namespace tid. The tid's we obtain from the kernels via perf_event_open (sampling, tracepoints,...) are already root namespace tid's.

For user space instrumentation, manual instrumentation and the Vulkan layer the code collecting the data is running inside the target process and therefore only sees the tid's from its namespace. We will need to translate these tid's. So there are two steps:

Maintain the mapping for the thread identifier from target process namespace to root namespace. Note that this mapping needs to change when the target process creates or exits threads. Special care needs to be taken to assure that the creation of a thread mapping happens before any events produced by this thread are processed.
Apply the mapping at the correct places.

Maintain tid mapping

At the start of the capture we parse the proc file system for existing threads in the target process. The status file contains a NSpid field that enumerates the pid in all nested namespaces (usually this will be two: root and target process namespace).

During runtime we observe the tracepoints for task_newtask and clone{3}_exit. task_newtask gets triggered for each new thread that is created. It reports the tid of the parent thread and the tid of the new thread, both in the root namespace.

Immediately after this the clone or clone3 syscall that triggered the creation of the new thread returns. The clone tracepoint also provides the pid of the parent thread (in the root namespace) and the return value of clone which is the tid of the new thread in the namespace of the target process. We use the parent tid to match these two tracepoints and by that obtain the mapping from the tid in the namespace of the target process to the tid in the root namespace. This is done in the LinuxTracing module.

Apply tid mapping

The data to map tid's can easily be made available in the Visitors that translate the PerfEvents into ProducerCaptureEvent protos (compare overview graphics below). Since the events are sorted by timestamp (well, “sorted”: since events arrive out of order generally we only process events older than 333ms. We assume that by this time every event has arrived. There are more details/optimizations to this - compare PerfEventQueue.h) before they end up in the Visitors one can be sure that the tid's mappings are existent and up to date when applying them here.

Since the user space instrumentation events are piped through the UprobesUnwindingVisitor this is the location where these events should be translated.

Since the mapping is present in UprobesUnwindingVisitor, it is the easiest solution to also process the manual instrumentation / Vulkan layer events through this visitor for translating the tid's.

Remaining problems

Performance of the solution outlined above is not great - piping back the events from manual instrumentation into LinuxTracing causes a performance hit. Consider the following test case: a single threaded trivial program produces ~340k scopes / second.

Without any tid translation we end up with ~43% of cpu load for OrbitService.

With the change of piping manual instrumentation back to LinuxTracing we ended up ~64% cpu load.

Note that the 43% and 64% above include collecting scheduler information and low frequency sampling (10ms intervals). So the relative performance for manual instrumentation alone is even larger than the numbers suggest. On the other hand the test case is somewhat extreme. 340kHz of manual instrumentation is not what someone would expect from a real world use case.

At least we might consider makeing the pid mapping a capture option such that we only take the detour to LinuxTracing in case we really need it.

Alternatives considered

Do the tid translation in ProducerEventProcessor.

Pros

We don’t need to pipe the manual instrumentation / Vulkan layer events back into LinuxTracing. Doing so results in the performance hit mentioned above.

Cons

This doesn’t really solve anything. The events from manual instrumentation / Vulkan layer are not synchronized with the stuff from LinuxTracing (specifically the tid mapping obtained from there). So we would need a way to synchronize the events in ProducerEventProcessor leading to a different set of problems (we’d need to buffer events until the tid mapping arrives …).

Deploy OrbitService inside the target process container.

Pros

Consistent use of target process pid's out of the box. Little or no work in the service. Therefore the performance issue mentioned above would just not occur.

Cons

Root user in container is not root user for the system. We need to perf_event_open for all sorts of things (scheduler tracepoints, …). This needs to be worked around. I have not done any testing around this.

Additional complexity of deployment: Do we offer to deploy into the container as well? Or do we deploy a helper service to root first and then start an instance of OrbitService inside the container? How does ssh forwarding to the container work? Depending on the answer to the above profiling other processes outside the container might get less convenient.

All dependencies of OrbitService need to be visible inside the container. Arguably we need to solve half the issue anyway: the injected libraries need to run in the context of the target process.

Consistency with top, ps, perf on gamelet root. They all see the container from the outside.

Current State

There is an implementation processing usi events here. It was reverted because it is unclear if there is still a use case for this.

Manual instrumentation is handled in this unsubmitted draft PR. There might be some open detail but in principle it works. Processing the events from the Vulkan layer can follow this pattern - there is no implementation of that though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly