-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPOLL_CTL_ADD returns EEXIST #1068
Comments
The latest Windows 10 preview builds contain enough functionality in Ubuntu/Bash on Windows (a.k.a. Windows Subsystem for Linux, or WSL) that .NET Core mostly works now. With the fix for #11058, we can even build/test the CoreFx tree in this environment. Currently there are a few test failures; I've been working on identifying the causes of these, and filing bugs in the appropriate repos. This change implements workarounds for all of the current known issues in the "inner loop" tests. Mostly, I've just conditionally disabled tests when running in the WSL environment. There is still an issue, under investigation, causing non-deterministic failures in the System.Net.Http tests (microsoft/WSL#1068). And I haven't tried OuterLoop runs of anything but the networking tests.
We have opened a bug internally to track this. Thanks for the report. |
Thank you for reporting this @ericeil. epoll_ctl(…, EPOLL_CTL_ADD, …) should return EEXIST when the supplied file descriptor is already registered with this epoll instance (manpages). I see exactly that happening from strace log, the same process is adding the same file descriptor to the same epoll file descriptor 2x, and the error is returned. We use NT thread scheduling rather than Linux thread scheduling, so I wonder if there could possibly be a race condition in the test code that just doesn't reveal itself on real Ubuntu? |
It's entirely possible that we have a race condition that is only causing problems when we run on WSL. But when I looked at strace logs from my own runs, it did not appear that we were registering the same file descriptor twice. I did see cases where we registered a file descriptor, then closed that file descriptor, then created a new one with the same value, and registered that new one. As I understand it, that should be perfectly legal (the FD should be removed from the epoll atomically as it's closing). |
I have a fix for this under review now. When a file descriptor gets closed, it needs to be removed from epoll context as well, which we were only doing when the file object was actually deleted (i.e. there were no more any references to it). The test calls clone syscall with CLONE_FILES flag that makes the cloned process share the description table of the original process and then closes the file descriptor - but since there were multiple file descriptors opened, we wouldn't clean the epoll state, which could cause hangs or EEXIST errors. With this fix the test mostly succeed, but there are still spurious failures (1 test case typically) on both Ubuntu and WSL. |
After a delay checked in a fix for this. Wanted to cover all edge cases when using epoll/fork/clone, as epoll is weird once you fork: links |
The latest Windows 10 preview builds contain enough functionality in Ubuntu/Bash on Windows (a.k.a. Windows Subsystem for Linux, or WSL) that .NET Core mostly works now. With the fix for #11058, we can even build/test the CoreFx tree in this environment. Currently there are a few test failures; I've been working on identifying the causes of these, and filing bugs in the appropriate repos. This change implements workarounds for all of the current known issues in the "inner loop" tests. Mostly, I've just conditionally disabled tests when running in the WSL environment. There is still an issue, under investigation, causing non-deterministic failures in the System.Net.Http tests (microsoft/WSL#1068). And I haven't tried OuterLoop runs of anything but the networking tests.
I'm testing on rs_preview build 14915.
Running the .NET CoreFx System.Net.Http functional tests, I pretty reliably have test failures due to
epoll_ctl(…, EPOLL_CTL_ADD, …)
returningEEXIST
. This does not occur on "real" Ubuntu.I only see this if I allow the tests to run in parallel; when run sequentially, they all pass. So I assume this is some sort of concurrency-related bug. I’ve tried putting together a smaller repro, but haven’t had any luck. Here are the repro steps:
Most of the time when I run these, at least one test fails with output like this:
Sometimes I've also seen the test hang. This may or may not be related; I haven't investigated the hang yet.
Very occasionally, the test passes. Run it again a couple of times, and you should see the failure in
SocketAsyncContext.Register
(which callsepoll_ctl
).The text was updated successfully, but these errors were encountered: