Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPOLL_CTL_ADD returns EEXIST #1068

Closed
ericeil opened this issue Sep 8, 2016 · 5 comments
Closed

EPOLL_CTL_ADD returns EEXIST #1068

ericeil opened this issue Sep 8, 2016 · 5 comments

Comments

@ericeil
Copy link

ericeil commented Sep 8, 2016

I'm testing on rs_preview build 14915.

Running the .NET CoreFx System.Net.Http functional tests, I pretty reliably have test failures due to epoll_ctl(…, EPOLL_CTL_ADD, …) returning EEXIST. This does not occur on "real" Ubuntu.

I only see this if I allow the tests to run in parallel; when run sequentially, they all pass. So I assume this is some sort of concurrency-related bug. I’ve tried putting together a smaller repro, but haven’t had any luck. Here are the repro steps:

  1. Install the .NET Core SDK
  2. Get a build of System.Net.Http.Functional.Tests from my branch of CoreFx (I'll send binaries I've built, privately).
  3. Run the tests:
./corerun xunit.console.netcore.exe System.Net.Http.Functional.Tests.dll -notrait Benchmark=true -notrait category=nonnetcoreapp1.0tests -notrait category=failing -notrait category=nonlinuxtests

Most of the time when I run these, at least one test fails with output like this:

      System.Net.InternalException : Exception of type 'System.Net.InternalException' was thrown.
      Stack Trace:
            at System.Net.Sockets.SocketAsyncContext.Register(SocketEvents events)
            ...

Sometimes I've also seen the test hang. This may or may not be related; I haven't investigated the hang yet.

Very occasionally, the test passes. Run it again a couple of times, and you should see the failure in SocketAsyncContext.Register (which calls epoll_ctl).

ericeil pushed a commit to dotnet/corefx that referenced this issue Sep 10, 2016
The latest Windows 10 preview builds contain enough functionality in Ubuntu/Bash on Windows (a.k.a. Windows Subsystem for Linux, or WSL) that .NET Core mostly works now. With the fix for #11058, we can even build/test the CoreFx tree in this environment. Currently there are a few test failures; I've been working on identifying the causes of these, and filing bugs in the appropriate repos. 
This change implements workarounds for all of the current known issues in the "inner loop" tests. Mostly, I've just conditionally disabled tests when running in the WSL environment. 
There is still an issue, under investigation, causing non-deterministic failures in the System.Net.Http tests (microsoft/WSL#1068). And I haven't tried OuterLoop runs of anything but the networking tests.
@sunilmut
Copy link
Member

We have opened a bug internally to track this. Thanks for the report.

@misenesi
Copy link

misenesi commented Oct 3, 2016

Thank you for reporting this @ericeil.

epoll_ctl(…, EPOLL_CTL_ADD, …) should return EEXIST when the supplied file descriptor is already registered with this epoll instance (manpages). I see exactly that happening from strace log, the same process is adding the same file descriptor to the same epoll file descriptor 2x, and the error is returned.

We use NT thread scheduling rather than Linux thread scheduling, so I wonder if there could possibly be a race condition in the test code that just doesn't reveal itself on real Ubuntu?

@ericeil
Copy link
Author

ericeil commented Oct 6, 2016

It's entirely possible that we have a race condition that is only causing problems when we run on WSL. But when I looked at strace logs from my own runs, it did not appear that we were registering the same file descriptor twice. I did see cases where we registered a file descriptor, then closed that file descriptor, then created a new one with the same value, and registered that new one. As I understand it, that should be perfectly legal (the FD should be removed from the epoll atomically as it's closing).

@misenesi
Copy link

I have a fix for this under review now. When a file descriptor gets closed, it needs to be removed from epoll context as well, which we were only doing when the file object was actually deleted (i.e. there were no more any references to it). The test calls clone syscall with CLONE_FILES flag that makes the cloned process share the description table of the original process and then closes the file descriptor - but since there were multiple file descriptors opened, we wouldn't clean the epoll state, which could cause hangs or EEXIST errors.

With this fix the test mostly succeed, but there are still spurious failures (1 test case typically) on both Ubuntu and WSL.

@misenesi
Copy link

After a delay checked in a fix for this. Wanted to cover all edge cases when using epoll/fork/clone, as epoll is weird once you fork:

links
http://lists.schmorp.de/pipermail/libev/2016q1/002681.html
http://lkml.iu.edu/hypermail/linux/kernel/0403.0/0299.html
https://lkml.org/lkml/2007/10/27/25
http://lists.schmorp.de/pipermail/libev/2016q1/002680.html

@jackchammons jackchammons removed the bug label Apr 11, 2017
carlossanlop pushed a commit to carlossanlop/maintenance-packages that referenced this issue Jan 26, 2024
The latest Windows 10 preview builds contain enough functionality in Ubuntu/Bash on Windows (a.k.a. Windows Subsystem for Linux, or WSL) that .NET Core mostly works now. With the fix for #11058, we can even build/test the CoreFx tree in this environment. Currently there are a few test failures; I've been working on identifying the causes of these, and filing bugs in the appropriate repos. 
This change implements workarounds for all of the current known issues in the "inner loop" tests. Mostly, I've just conditionally disabled tests when running in the WSL environment. 
There is still an issue, under investigation, causing non-deterministic failures in the System.Net.Http tests (microsoft/WSL#1068). And I haven't tried OuterLoop runs of anything but the networking tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants