Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] spill starts when researved size is less than offheap size #7380

Open
FelixYBW opened this issue Sep 28, 2024 · 6 comments
Open

[VL] spill starts when researved size is less than offheap size #7380

FelixYBW opened this issue Sep 28, 2024 · 6 comments
Labels
bug Something isn't working triage

Comments

@FelixYBW
Copy link
Contributor

FelixYBW commented Sep 28, 2024

Backend

VL (Velox)

Bug description

offheap is 8.5GB, no fallback. When an operator researves ~6.5G the spill is triggered. ~0.75% of offheap size.

@zhztheplayer

@FelixYBW FelixYBW added bug Something isn't working triage labels Sep 28, 2024
@FelixYBW
Copy link
Contributor Author

FelixYBW commented Sep 29, 2024

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio
The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin
In case you didn't noted it.

@zhztheplayer
Copy link
Member

zhztheplayer commented Sep 29, 2024

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR #7384. We can proceed once we are confident.

@surnaik
Copy link
Contributor

surnaik commented Sep 29, 2024

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR #7384. We can proceed once we are confident.

I agree, let's remove the config for now. If there are bugs in future, we can fix the underlying issue

@FelixYBW
Copy link
Contributor Author

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

@zhztheplayer
Copy link
Member

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

Let's run some tests and if it's true, we can increase the default memory overhead to address it.

@Yohahaha
Copy link
Contributor

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.

Thanks for the information, I always set it to 0 in our jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

4 participants