[VL] spill starts when researved size is less than offheap size #7380

FelixYBW · 2024-09-28T06:27:30Z

Backend

VL (Velox)

Bug description

offheap is 8.5GB, no fallback. When an operator researves ~6.5G the spill is triggered. ~0.75% of offheap size.

FelixYBW · 2024-09-29T03:29:36Z

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio
The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin
In case you didn't noted it.

zhztheplayer · 2024-09-29T03:48:21Z

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR #7384. We can proceed once we are confident.

surnaik · 2024-09-29T03:53:32Z

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR #7384. We can proceed once we are confident.

I agree, let's remove the config for now. If there are bugs in future, we can fix the underlying issue

FelixYBW · 2024-09-29T04:49:48Z

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

zhztheplayer · 2024-09-29T04:58:28Z

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

Let's run some tests and if it's true, we can increase the default memory overhead to address it.

Yohahaha · 2024-09-29T06:00:39Z

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.

Thanks for the information, I always set it to 0 in our jobs.

FelixYBW added bug Something isn't working triage labels Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] spill starts when researved size is less than offheap size #7380

[VL] spill starts when researved size is less than offheap size #7380

FelixYBW commented Sep 28, 2024 •

edited

Loading

FelixYBW commented Sep 29, 2024 •

edited

Loading

zhztheplayer commented Sep 29, 2024 •

edited

Loading

surnaik commented Sep 29, 2024

FelixYBW commented Sep 29, 2024

zhztheplayer commented Sep 29, 2024

Yohahaha commented Sep 29, 2024

[VL] spill starts when researved size is less than offheap size #7380

[VL] spill starts when researved size is less than offheap size #7380

Comments

FelixYBW commented Sep 28, 2024 • edited Loading

Backend

Bug description

FelixYBW commented Sep 29, 2024 • edited Loading

zhztheplayer commented Sep 29, 2024 • edited Loading

surnaik commented Sep 29, 2024

FelixYBW commented Sep 29, 2024

zhztheplayer commented Sep 29, 2024

Yohahaha commented Sep 29, 2024

FelixYBW commented Sep 28, 2024 •

edited

Loading

FelixYBW commented Sep 29, 2024 •

edited

Loading

zhztheplayer commented Sep 29, 2024 •

edited

Loading