Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory pressure with live dash playback #6070

Closed
Illmatikx opened this issue Jan 10, 2024 · 45 comments
Closed

High memory pressure with live dash playback #6070

Illmatikx opened this issue Jan 10, 2024 · 45 comments
Assignees
Labels
component: DASH The issue involves the MPEG DASH manifest format priority: P1 Big impact or workaround impractical; resolve before feature release status: archived Archived and locked; will not be updated type: bug Something isn't working correctly type: performance A performance issue
Milestone

Comments

@Illmatikx
Copy link

Have you read the FAQ and checked for duplicate open issues?
Yes

If the problem is related to FairPlay, have you read the tutorial?
Not applicable

What version of Shaka Player are you using?
4.7.0

Can you reproduce the issue with our latest release version?
Not tested, but compared the branch against 4.7.0

Can you reproduce the issue with the latest code from main?
Not tested, but compared the branch against main

Are you using the demo app or your own custom app?
Custom app (just a page with shaka player)

If custom app, can you reproduce the issue using our demo app?
Not tried

What browser and OS are you using?
Web: Windows Chrome 120.0.6099.199 (Official Build) (64-bit)
webos 4,6: chromium 53 and 79 respectively
tizen 4: chromium 56

For embedded devices (smart TVs, etc.), what model and firmware version are you using?
webos 4: 43UM7300PLB (2019)
webos 6: 43UP81006LA (2021)
tizen 4: UE43NU7400U (2018)

What are the manifest and license server URIs?
I will send you the URL via email.

What configuration are you using? What is the output of player.getConfiguration()?
Player configuration is attached.
player_config_web.txt

What did you do?
Started playback of identical live dash stream with 4.3.0 and 4.7.0 shaka versions and kept it up to 2 hours. I monitored memory with distinct tools (depending on the platform), but namely on web I made heapdumps every 10-30 minutes.

What did you expect to happen?
Memory footprint should not be growing after buffer filling for 4.7.0 (as it is true for 4.3.0). Also, memory allocation rate should be smaller for 4.7.0 similar to 4.3.0 (tested on smart tv devices with webos 4/6 and tizen 4).

What actually happened?
Shaka 4.7.0 memory footprint grew over time during live playback (no interaction with the custom demo app, tested on web and webos 4/6). Allegedly, memory growth depends on DVR window size since it stabilized after 60 minutes of test on web (timeShiftBufferDepth is 1 hour). During two tests I observed heap reaching 39 and 68 MB respectively whereas with shaka 4.3.0 playback its size was around 11-12 MB.
shaka-smtvweb-heapdumps
Another one issue - memory allocation/clean up magnitude is much higher for 4.7.0 in comparison to 4.3.0 under the same testing conditions (tested on webos 4/6 and tizen 4 with 1h DVR). The consequence of high memory pressure on low-end smart tv devices is UI freezing during interaction with the app: OS and browser are busy with memory housekeeping tasks like swapping, GC, OS level memory allocation delays etc. Mere live playback with our own app started to freeze on webos 4 2019 device after ~15 minutes without any UI interaction.
webos6_shaka_470_memory
webos6_shaka_430_memory

Heapdumps comparison refers to memory kept by SegmentIndex/XML DOM data structures. As for high memory allocation rate, I am pondering about evict() changes in TimelineSegmentIndex class: slice() was added to timeline and references objects, which is about shallow copies of arrays (relevant changes may be in #5061).

@Illmatikx Illmatikx added the type: bug Something isn't working correctly label Jan 10, 2024
@shaka-bot shaka-bot added this to the v5.0 milestone Jan 10, 2024
@avelad avelad added type: performance A performance issue component: DASH The issue involves the MPEG DASH manifest format priority: P1 Big impact or workaround impractical; resolve before feature release labels Jan 10, 2024
@OrenMe
Copy link
Contributor

OrenMe commented Jan 17, 2024

Updating issue from Slack chat with @avelad
Profiling the shaka sdk on standalone test page, just shaka and test stream we can see the increase of memory
image

Above image is from a test was ran with the nightly build but compiled
We ran it again with uncompiled code and can see that there is retention of TimelineSegmentIndex
image

it is strange, cause the eviction logic seems correct but still we see it is retained
We tested streams with 1 hour dvr vs no(short) dvr and the memory retention is very visible on DVR streams
image

But in any case even with the no dvr there is some memory back pressure with the new version vs old version (we were on 4.3.0 and planning to upgrade)

We are going to try and test in which version this behviour started so we can try and track the change that caused it.

@ricardo-a-alves-alb
Copy link
Contributor

On our side, Tizen and WebOS seemed to be crashing with out-of-memory errors. Rollback to 4.5.0 for now as it seems stable.

@nrcrast
Copy link
Contributor

nrcrast commented Jan 17, 2024

Interesting. I will start to investigate this a bit too. Maybe using splice instead of slice would avoid some of the extra allocations.

@OrenMe
Copy link
Contributor

OrenMe commented Jan 17, 2024

Thanks for the feedback @ricardo-a-alves-alb and thanks @nrcrast, sounds like u might know possible root cause, we were about to start doing bisect to find the PR that started it but from these comments maybe u can spot it faster then us

@nrcrast
Copy link
Contributor

nrcrast commented Jan 17, 2024

It's probably just the PR that introduced the TimelineSegmentIndex -- #5061. Not 100% sure but that was a large change and I'd imagine if there was a memory issue with it, it would have been there since the beginning.

@OrenMe
Copy link
Contributor

OrenMe commented Jan 17, 2024

I remember this PR, and I actually think @avelad mentioned it to me on another issue related to performance on smart TV but we didn't notice improvement with VST specifically and assumed our app might consume too much memory maybe to feel the gains so we started improving more on FE side.
I also suspect that if you do not see this issue it might be related to dash manifest features, and maybe specific manifests suffer more than others, just a guess but it seems you did impressive work in that PR and I believe u would have saw such issues.
Are you able to check this and maybe create some PR? I will be able to test it on our side and provide a feedback

@nrcrast
Copy link
Contributor

nrcrast commented Jan 19, 2024

I think I've found one of the issues -- the initSegmentReference sticks around forever. When all references have been evicted from the TimelineSegmentIndex, we should be freeing everything we can so it can be garbage collected.

At that point, the TimelineSegmentIndex really only exists as a shell to keep track of numEvicted.

@OrenMe
Copy link
Contributor

OrenMe commented Jan 22, 2024

@nrcrast thanks for the update, any chance you are planning to push this fix so we can test it?

@nrcrast
Copy link
Contributor

nrcrast commented Jan 22, 2024

Hi! I will push at some point this week. I want to do some long running tests here myself so I'm confident in the fix myself before I push anything. I can push a branch though at some point if you want to do some preliminary testing before I open a PR.

@OrenMe
Copy link
Contributor

OrenMe commented Jan 22, 2024

@nrcrast for sure if you can set up a side branch we can build and test from it

@Iragne
Copy link
Contributor

Iragne commented Jan 26, 2024

I think I've found one of the issues -- the initSegmentReference sticks around forever. When all references have been evicted from the TimelineSegmentIndex, we should be freeing everything we can so it can be garbage collected.

At that point, the TimelineSegmentIndex really only exists as a shell to keep track of numEvicted.

you mean in the release function of "shaka.dash.TimelineSegmentIndex" we should ensure that segment reference are release
Not sure to follow what you mean.

Happy to check your branch, i'm testing it too

@nrcrast
Copy link
Contributor

nrcrast commented Jan 26, 2024

It cannot be released in the release method, as the init segment reference can only be released once it's guaranteed that the SegmentIndex has evicted all of its entries.

The issue with releasing everything in the overridden release method is that release is called also when you switch variants. When you switch variant, the segment index for Variant A is closed and then re-opened if you then switch back to that variant. At that time, all information needed to re-create the segment index must still exist somewhere in memory.

I am still chasing down the issue here -- it seems that for some reason even though I'm setting initSegmentReference to null in the TimelineSegmentIndex, something is still holding onto it preventing it from being garbage collected.

I've just made a branch with this change in it as well as using splice if you also want to do some testing:
https://github.com/nrcrast/shaka-player/tree/timeline-segment-index-free-all

@Iragne
Copy link
Contributor

Iragne commented Jan 27, 2024

Hi @nrcrast I tried to chase it down till the segment reference "partialReferences" and "initSegmentReference" but i'm not sure it's correct.

Thanks for the information about release and it's at the end what i thought reading more deeply the code.

@Illmatikx
Copy link
Author

Hello. Seems that leaking memory and high allocation rate are still present in timeline-segment-index-free-all branch.
shaka-webos6-retest
Heap snapshots for the same demo app (shaka live playback with a hardcoded fullhd channel) on web:
image
The said heap snapshots with debug build:
debug-smtv-web-shaka.zip
Some thoughts on the heap leak:
references arrays of two TimelineSegmentIndex (playing audio and video) grow over time due to growing amount of SegmentRefernce elements. Each of them retains ~9 KB of heap memory through InitSegmentReference after its assigning:
shaka-web-retest
As of high memory allocation rate: tizen/webos tests reveal that it fluctuates by magnitude of up to 50-70 MB and it happens not within the heap itself. It is up to native memory underlying data structures, which in turn may be related to dash manifest in-memory manipulation changes.

@nrcrast
Copy link
Contributor

nrcrast commented Jan 30, 2024

Interesting. From what I can tell, the eviction logic in the TimelineSegmentIndex seems to work pretty much the same way as the eviction logic in the base SegmentIndex, plus it only creates SegmentReference objects when it's asked to.

So I'm surprised at the number still remaining. That's definitely concerning.

Have we checked if #5762 had any impact? I have no reason to suspect that it did, but it would be an interesting test.

@nrcrast
Copy link
Contributor

nrcrast commented Jan 30, 2024

Some more interesting info!

Here is a comparison of my fork of v3.3.10 on the left and current shaka main on the right. Both contain the Timeline Segment Index, but my fork does not contain #5762 or any other changes that happened since my initial contribution.

image

Going to play the "binary search" game and see if I can reproduce similar results using official shaka releases and not my own fork.

@nrcrast
Copy link
Contributor

nrcrast commented Jan 30, 2024

image

Comparison of current main Shaka (on right) and commit 8d2b657 (#5842). This is after a bug fix for a race condition in the Timeline Segment Index. So seems like something went sideways sometime after this.

Continuing the hunt!

@Illmatikx
Copy link
Author

Illmatikx commented Jan 30, 2024

"Have we checked if #5762 had any impact? I have no reason to suspect that it did, but it would be an interesting test."

I just tagged and checked 2 versions from timeline-segment-index-free-all branch on web: a commit right before #5762 (#5848 ) and #5762 itself. Looks like the heap size is quite stable, no known memory leak being detected:
image

@Illmatikx
Copy link
Author

Seems that I found the root cause (checked on web and webos 6 for memory leaking atm) - #5899
calling appendTemplateInfo() of TimelineSegmentIndex on manifest arrival assigns init segment references by this.initSegmentReference_ = initSegmentReference and thereby "binds" the referenced data to the long-living object.
Compared tagged build of PR #5899 vs its preceding #5729 (4.6 release).
web:
image
webos 6:
webos6-PR-5899-memory-comparison

Memory allocation magnitude is still high on webos 6 for dash live playback:

  1. 25-30 MB for 4.3 release.
  2. 45-55 MB for chore(main): release 4.6.0 #5729 build (release 4.6).
  3. 70-80 MB for timeline-segment-index-free-all branch.
    One need to perform some additional investigation of that on TV devices.

@nrcrast
Copy link
Contributor

nrcrast commented Jan 31, 2024

Interesting.

I don't fully understand though what the real difference is or how that PR could cause this. Even before that PR, the TSI was holding a reference to an initSegmentReference -- that PR just makes it so that reference changes periodically.

It's also unclear to me why my timeline-segment-index-free-all wouldn't have fixed this issue.

Something isn't quite adding up for me 🤔

@OrenMe
Copy link
Contributor

OrenMe commented Jan 31, 2024

We are testing it commit by commit so hard to tell. One thing that can be a gotcha is that the code we see and code we ship is not the same in JS land. Maybe the code change seems safe but maybe compiled code optimizes and creates some closure.

@Illmatikx
Copy link
Author

According to heapdumps, we assign init segment references to initSegmentReference property of SegmentReference objects on fresh manifest arrival and appendTemplateInfo() calls. TimelineSegmentIndex and its references array have a lot of SegmentReference elements: with 1h DVR it is about 1800+ for video and audio indexes respectively.
image

@nrcrast
Copy link
Contributor

nrcrast commented Feb 1, 2024

According to heapdumps, we assign init segment references to initSegmentReference property of SegmentReference objects on fresh manifest arrival and appendTemplateInfo() calls. TimelineSegmentIndex and its references array have a lot of SegmentReference elements: with 1h DVR it is about 1800+ for video and audio indexes respectively. image

@Illmatikx I think this has always been the case, though.

Before the Timeline Segment Index was introduced, all of these SegmentReferences were created during manifest parse time, and each Segment Reference had a ref to the InitSegmentReference.

With the introduction of the Timeline Segment Index, these Segment References are only created upon retrieval (SegmentIndex.get).

@OrenMe The only difference between your two tests was that single commit for #5899? It just seems like such an innocuous change! @avelad any idea?

Maybe this is some weird compilation issue, but on a linear stream Segment References will be periodically evicted. They should then be garbage collected. As long as there is no other reference to the InitSegmentReference (which I tried to fix in my branch), the InitSegmentReference should be collected as well eventually.

In practice, I must be missing something, because it's not happening that way 😆 .

@cristian-atehortua
Copy link
Contributor

Apart from that, do you think it is appropriate to create a new initSegmentReference on each call to SegmentTemplate.createStreamInfo?

Shouldn't we check if the properties used to create the initSegementReference have changed and only in that case create a new one?

@avelad
Copy link
Member

avelad commented Apr 29, 2024

Shouldn't we check if the properties used to create the initSegementReference have changed and only in that case create a new one?

Yes, please do it

joeyparrish pushed a commit that referenced this issue Apr 29, 2024
… by updating old initSegmentReference (#6499)

Helps on #6070
joeyparrish pushed a commit that referenced this issue Apr 29, 2024
@avelad
Copy link
Member

avelad commented May 6, 2024

@Illmatikx Can you share the same comparison against 4.8.1? Thanks!

@Illmatikx
Copy link
Author

@avelad that is the plan from my side, but I am blocked atm by #6533
I managed to run some heap memory comparison tests on web though and 4.8.1 looks fine in terms of the notorious memory leak:
shaka481-memory-test
I intend to run the test on a webOS device and monitor memory allocation rate with the changes. It is a must for us since UI scrolling freezes were caused not by the leak but by high memory allocation.

joeyparrish pushed a commit that referenced this issue May 7, 2024
… by updating old initSegmentReference (#6499)

Helps on #6070

Backported to v4.7.x
joeyparrish pushed a commit that referenced this issue May 7, 2024
… by updating old initSegmentReference (#6499)

Helps on #6070

Backported to v4.7.x
@avelad
Copy link
Member

avelad commented May 9, 2024

@Illmatikx Yesterday we released 4.8.3, can you test on WebOS if this issue is fixed now?

@avelad avelad added the status: waiting on response Waiting on a response from the reporter(s) of the issue label May 13, 2024
@shaka-bot
Copy link
Collaborator

Closing due to inactivity. If this is still an issue for you or if you have further questions, the OP can ask shaka-bot to reopen it by including @shaka-bot reopen in a comment.

@shaka-bot shaka-bot removed the status: waiting on response Waiting on a response from the reporter(s) of the issue label May 20, 2024
@Illmatikx
Copy link
Author

Hello @avelad sorry for the late response, but I managed to take a glance at 4.,8.4 version with our demo app. The test targeted a start-over video with 15h35m timeShiftBufferDepth. so It was a kind of an edge use case.
Memory allocation magnitude has reduced from 40-65 (shaka 4.6.1) to 5-8 MB on account of manifest processing changes for TImelineSegmentIndex.
memory-allocation-shaka-comparison
At the same time, there was still a "benign" leak observed on webos 6 device, which is about TimeSegmentIndex in-memory structure itself:
memory-trend-shaka484
It was about leaking a couple of dozens of MB during 1h20m test.

@avelad avelad reopened this May 20, 2024
@Illmatikx
Copy link
Author

@avelad is it doable to enable/disable usage of TimelineSegmentIndex from #5061 by a flag/configuration option? This way one can opt for each specific use case: either go for manifest parsing time improvement or memory saving,

@avelad
Copy link
Member

avelad commented May 28, 2024

@Illmatikx can you test if #6610 (comment) reduce the pressure? (It seems like a bit of an ugly hack, but... I would be willing to add it if it works for everyone). Thanks!

@avelad avelad modified the milestones: v4.9, v4.10 May 30, 2024
@avelad avelad modified the milestones: v4.10, v4.11 Jul 1, 2024
@avelad
Copy link
Member

avelad commented Jul 15, 2024

In version 4.10.6 we have introduced some improvements, can you validate if it is enough or do we need more improvements? Thanks!

@avelad avelad added the status: waiting on response Waiting on a response from the reporter(s) of the issue label Jul 16, 2024
@shaka-bot
Copy link
Collaborator

Closing due to inactivity. If this is still an issue for you or if you have further questions, the OP can ask shaka-bot to reopen it by including @shaka-bot reopen in a comment.

@shaka-bot shaka-bot removed the status: waiting on response Waiting on a response from the reporter(s) of the issue label Jul 23, 2024
@shaka-bot shaka-bot added the status: archived Archived and locked; will not be updated label Sep 21, 2024
@shaka-project shaka-project locked as resolved and limited conversation to collaborators Sep 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
component: DASH The issue involves the MPEG DASH manifest format priority: P1 Big impact or workaround impractical; resolve before feature release status: archived Archived and locked; will not be updated type: bug Something isn't working correctly type: performance A performance issue
Projects
None yet
Development

No branches or pull requests

8 participants