Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in HTTP::Server spec with preview_mt #13211

Closed
straight-shoota opened this issue Mar 22, 2023 · 3 comments · Fixed by #14097
Closed

Timeout in HTTP::Server spec with preview_mt #13211

straight-shoota opened this issue Mar 22, 2023 · 3 comments · Fixed by #14097

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Mar 22, 2023

Some HTTP::Server specs occasionally time out in CI after reaching the 15 second timout.
Example:

  • HTTP::Server #bind fails after listen (Duration (p95): 11ms, 98% success rate)
  • HTTP::Server #listen fails after listen (Duration (p95): 7ms, 99% success rate)
  • HTTP::Server doesn't raise on accept after close #2692 (Duration (p95): 8s, 99% success rate)

Stats from https://app.circleci.com/insights/github/crystal-lang/crystal/workflows/nightly_release/tests?branch=master

We could consider increasing the timeout similar to #13043. But that would just be ignoring the symptoms. It really shouldn't take that long to execute these specs. The 95-percentile of the duration being several orders of magnitude below the timeout confirms that.

Maybe there is some kind of deadlock or perhaps all available threads are saturated for such a long time. It's flaky. We need to investigate what's actually going on and figure out what can be done to improve this.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Dec 4, 2023

The problems are easily reproduced on local, though I never get the 15s timeout. During some run I either get an infinite hang (strace shows repeated epoll_wait) or a libevent warn about a bad file descriptor... which is reminiscent of #2692 that is quoted in one of the reported failing spec.

Indeed, this is a race condition induced by MT: we're listening and closing a server from different fibers that happen to be run on different threads, and are no longer synchronized, so a spec suddenly closes a server while it's being started, ...

@ysbaddaden
Copy link
Contributor

To reproduce:

$ bin/crystal build spec/std/http/server/server_spec.cr -Dpreview_mt

Then repeatedly run ./server_spec. It usually fails every half dozen runs for me.

@ysbaddaden
Copy link
Contributor

This is hard to fix properly. So far only adding a few sleep 0.001 avoids the issue 😞

ysbaddaden added a commit to ysbaddaden/crystal that referenced this issue Dec 14, 2023
ysbaddaden added a commit to ysbaddaden/crystal that referenced this issue Dec 15, 2023
The core of the issue is a synchronization issue between fibers that
test edge cases in HTTP::Server, for example what's happening after
closing the fiber (from another fiber) while we have no guarantee that
the server is actually listening.

I couldn't find a proper way to fix the issue. The only solution that
fixed the reproducibility of the issue was adding a one millisecond
sleep after starting the server.

closes crystal-lang#13211
ysbaddaden added a commit to ysbaddaden/crystal that referenced this issue Dec 15, 2023
The core of the issue is a synchronization issue between fibers that
test edge cases in HTTP::Server, for example what's happening after
closing the fiber (from another fiber) while we have no guarantee that
the server is actually listening.

I couldn't find a proper way to fix the issue. The only solution that
fixed the reproducibility of the issue was adding a one millisecond
sleep after starting the server.

closes crystal-lang#13211
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants