Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcasters stuck after bandwidth pressure #1640

Closed
iameli opened this issue Oct 22, 2020 · 0 comments
Closed

broadcasters stuck after bandwidth pressure #1640

iameli opened this issue Oct 22, 2020 · 0 comments

Comments

@iameli
Copy link
Contributor

iameli commented Oct 22, 2020

Describe the bug
@ericxtang and I were running a big load test. After about 400ish streams, something broke that we're still investigating. The working theory is that we ran out of bandwidth in the transcoding cluster.

Afterward, lots of the broadcasters in the cluster believed they still had streams. There was no data flowing into the B, but GET /status still showed a lot in there, and the logs for the B showed that they were consistently doing a watchdog reset:

I1022 05:22:28.632438       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=1 dur=2000 started=2020-10-22 04:42:52.596496644 +0000 UTC m=+5627.615492239
I1022 05:22:28.902201       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=4 dur=1600 started=2020-10-22 04:58:10.896452534 +0000 UTC m=+6545.915448149
I1022 05:22:29.261122       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=158 dur=2000 started=2020-10-22 04:42:53.2199255 +0000 UTC m=+5628.238921105
I1022 05:22:38.652812       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=145 dur=2000 started=2020-10-22 04:42:08.642852375 +0000 UTC m=+5583.661847980
I1022 05:22:44.069157       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=163 dur=2000 started=2020-10-22 04:42:14.058966498 +0000 UTC m=+5589.077962124
I1022 05:22:54.067318       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=213 dur=2000 started=2020-10-22 04:42:24.050308485 +0000 UTC m=+5599.069304070

Full unredacted logs and /status output in internal Livepeer Discord here.

Seems like maybe there's a failure condition that we're not considering that's causing the session to remain open?

To Reproduce
I don't yet have a repro outside of stressing one of our clusters in a bandwidth-constrained setting

Expected behavior
The streams should timeout and return to 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant