Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test4_4 seems to be in deadlock on amd64_ubu14 #274

Closed
mxz297 opened this issue Nov 23, 2016 · 9 comments
Closed

Test4_4 seems to be in deadlock on amd64_ubu14 #274

mxz297 opened this issue Nov 23, 2016 · 9 comments
Labels

Comments

@mxz297
Copy link
Member

mxz297 commented Nov 23, 2016

@cuviper We observed that current master branch will lead to a deadlock when running test4_4 on Ubuntu 14. Preliminary debugging suggests that the mutator is forever looping for checking whether two threads have finished. Merging your two pull requests about proccontrol does not seem solve this issue.

I attached the debugging log by enabling DYNINST_DEBUG_PROCCONTROL.

Do you have any insights on this issue?

https://gist.github.com/mxz297/23be1aa22fdecee056aa9fbfbe41464a

@cuviper
Copy link
Contributor

cuviper commented Nov 23, 2016

Is the failure consistent? It seems to be fine on Fedora 25.

A lot of suspicious lines in that log, like these stand out to me:

[process.C:1512-U] - Error: plat_readMem failed!

[int_thread_db.C:1715-U] - Error: Attempt to read user thread info of 19593/19593 before user thread create

[process.C:1076-U] - No threads are running to produce events

@wrwilliams
Copy link
Member

It's quite consistent (deterministic, even) and only showing up on Ubuntu 14 so far as we've been able to tell. We do seem to have something go sideways in parsing such that it takes quite a long time to make it through libc, but that also seems to be triggering some bad proccontrol-level behavior.

@cuviper
Copy link
Contributor

cuviper commented Nov 23, 2016

FWIW, the infinite loop is not the new one I added, as that would say "Waiting for neonatal threads" in the log. It's the testsuite's contAndWaitForAllProcs that's stuck. (Maybe you already knew that.)

Do you have an older commit that works on the same machine? If so, can you git-bisect it?

@mplegendre
Copy link
Contributor

The lines about failing to read user thread info and getTIDs are red herrings. Dyninst isn't careful about not reading user-thread info before that info is initialized, but it's not harmful.

I think the problem starts on line 5436 of the DYNINST_DEBUG_PROCCONTROL trace. After an exec, ProcControlAPI goes to re-create the breakpoint that monitors library loading, but fails to insert it. In the new process that's inserted into 7faba38e3940. In the previous process that was 7f45619a7940. Since only the upper bits changed, I'm guessing we have the symbol offset correct and the ld.so library's load address incorrect.

The ld.so load address should have come from the interpreter base read via /proc/PID/auxv in AuxvParser::readAuxvInfo in common/src/linuxKludges.C. Unfortunately, there's no good debug printing there, so you may have to add some to see if something's going wrong.

@cuviper
Copy link
Contributor

cuviper commented Nov 28, 2016

I can reproduce this running as a normal user on Ubuntu 14.04, but as root it's fine. Root indeed gets the same "Error: Attempt to read user thread info...", but does not get the "Error: plat_readMem failed!" I suspected Ubuntu's "yama" security model, and sysctl kernel.yama.ptrace_scope=0 does in fact make it work again.

I'm surprised that ptrace_scope=1 would cause a failure here, because I wouldn't think an exec should affect the parent-child connection. But Fedora kernels now have yama too, disabled by default, and here it also fails plat_readMem when enabled.

@cuviper
Copy link
Contributor

cuviper commented Nov 28, 2016

Maybe plat_readMem is also a red herring -- if the libraries were not detected properly in the new process, then that could simply be a bad address. But it's clear to me that yama is the gating factor.

@jdetter jdetter added the bug label Nov 29, 2016
@mplegendre
Copy link
Contributor

I'd normally expect that if it were ptrace_scope issues then we wouldn't be able to operate on child process 19593 at all, but we handled its fork just fine. It's after the exec that we lose access to the process.

In the above log, the parent process 19591 exits soon before we start having problems with 19593. Could 19591's exit cause a re-parenting of 19593, and trigger the yama/ptrace_scope restrictions that prevent you from debugging processes you're not the parent of? We might be able to test that by throwing a sleep before the exit of the test4_4 parent.

I'm curious, does test pc_fork_exec also fail on these systems?

@cuviper
Copy link
Contributor

cuviper commented Nov 30, 2016

pc_fork_exec is fine here.

OK, that's interesting about re-parenting though.

The actual call that's failing is process_vm_readv with EPERM. We have a fallback for bulk r/w to use normal ptrace calls after EFAULT, mainly because IIRC process_vm respects page protection and ptrace doesn't. If I extend that to fallback for EPERM too, then the test passes.

Now, ptrace only checks for security access when you're first attaching, whereas process_vm makes the check every time you call it. So yama_ptrace_access_check sees that the process is no longer a descendant of the tracer, since the re-parenting gave the orphan to init (pid 1), and denies process_vm. There's no leeway for the fact that we have an actual ptrace relationship already.

There is an escape hatch in ptracer_exception_found, but this is for relationships pre-arranged by prctl(PR_SET_PTRACER), which can set a specific parent or PR_SET_PTRACER_ANY. That might be an option just for the testsuite, but I don't see any evidence that this is inherited across forks, which makes it less useful.


I'll send a PR for the process_vm EPERM fallback, at least. This is easy and should be perfectly safe, because if we're already attached to a process, there's no other reason I can imagine for an EPERM besides yama obstruction. Note however that these access checks are also used on a lot on procfs files, so it's still possible we'll have other issues.

I'm also tempted to send a kernel patch for yama to allow active ptracers free reign. I'll think on it.

cuviper added a commit to cuviper/dyninst that referenced this issue Nov 30, 2016
Having sysctl kernel.yama.ptrace_scope=1, one may only call ptrace
attach on direct descendants.  The same restriction is also checked for
`process_vm_readv`/`writev` and certain procfs files.  However, if an
intermediate parent process already exited, we could end up with a
grandchild that we're still ptracing but isn't our descendant, so we
can't use the `process_vm` functions anymore -> `EPERM`.

We already had a fallback here for `EFAULT`, to just use `ptrace` memory
access, so use the same fallback after `EPERM` too.

Fixes dyninst#274.
@cuviper
Copy link
Contributor

cuviper commented Dec 1, 2016

I'm also tempted to send a kernel patch for yama to allow active ptracers free reign. I'll think on it.

https://lkml.org/lkml/2016/11/30/820

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants