broadcasters stuck after bandwidth pressure #1640

iameli · 2020-10-22T05:32:38Z

Describe the bug
@ericxtang and I were running a big load test. After about 400ish streams, something broke that we're still investigating. The working theory is that we ran out of bandwidth in the transcoding cluster.

Afterward, lots of the broadcasters in the cluster believed they still had streams. There was no data flowing into the B, but GET /status still showed a lot in there, and the logs for the B showed that they were consistently doing a watchdog reset:

I1022 05:22:28.632438       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=1 dur=2000 started=2020-10-22 04:42:52.596496644 +0000 UTC m=+5627.615492239
I1022 05:22:28.902201       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=4 dur=1600 started=2020-10-22 04:58:10.896452534 +0000 UTC m=+6545.915448149
I1022 05:22:29.261122       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=158 dur=2000 started=2020-10-22 04:42:53.2199255 +0000 UTC m=+5628.238921105
I1022 05:22:38.652812       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=145 dur=2000 started=2020-10-22 04:42:08.642852375 +0000 UTC m=+5583.661847980
I1022 05:22:44.069157       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=163 dur=2000 started=2020-10-22 04:42:14.058966498 +0000 UTC m=+5589.077962124
I1022 05:22:54.067318       1 mediaserver.go:771] watchdog reset mid=[redacted] seq=213 dur=2000 started=2020-10-22 04:42:24.050308485 +0000 UTC m=+5599.069304070

Full unredacted logs and /status output in internal Livepeer Discord here.

Seems like maybe there's a failure condition that we're not considering that's causing the session to remain open?

To Reproduce
I don't yet have a repro outside of stressing one of our clusters in a bandwidth-constrained setting

Expected behavior
The streams should timeout and return to 0

The text was updated successfully, but these errors were encountered:

Fixes #1640

darkdarkdragon mentioned this issue Oct 22, 2020

Log timeout used for HTTP transcode request #1642

Merged

iameli mentioned this issue Oct 22, 2020

cap Os based on bandwidth #1643

Open

darkdarkdragon mentioned this issue Oct 22, 2020

Fix memory leak #1644

Merged

3 tasks

darkdarkdragon added a commit that referenced this issue Oct 23, 2020

Timeout HTTP2 request in handshake phase

34961c8

Fixes #1640

yondonfu pushed a commit that referenced this issue Oct 23, 2020

Timeout HTTP2 request in handshake phase

1a9884c

Fixes #1640

yondonfu closed this as completed in 87ab9ff Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcasters stuck after bandwidth pressure #1640

broadcasters stuck after bandwidth pressure #1640

iameli commented Oct 22, 2020

broadcasters stuck after bandwidth pressure #1640

broadcasters stuck after bandwidth pressure #1640

Comments

iameli commented Oct 22, 2020