Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cap live_bytes to zero in a few places where GC intervals are computed #170

Merged
merged 2 commits into from
Aug 12, 2024

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Jul 29, 2024

PR Description

Caps live_bytes to 0 when computing GC intervals.

Checklist

Requirements for merging:

@d-netto d-netto requested a review from kpamnany July 29, 2024 19:51
@github-actions github-actions bot added port-to-v1.10 This change should apply to Julia v1.10 builds port-to-master This change should apply to all future Julia builds labels Jul 29, 2024
@d-netto d-netto removed the port-to-master This change should apply to all future Julia builds label Jul 29, 2024
@d-netto
Copy link
Member Author

d-netto commented Jul 29, 2024

As usual, we should definitely benchmark this before merging. Bonus points if we can run this on one of the problematic workloads pointed out by Todd.

Copy link
Collaborator

@kpamnany kpamnany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but let's test this in RAICode before merging here.

// XXX: we've observed that the `live_bytes` was negative in a few cases
// which is not expected. We should investigate this further, but let's just
// cap it to 0 for now.
int64_t live_bytes_for_interval_computation = live_bytes < 0 ? 0 : live_bytes;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to cap it in the other direction? It should not exceed maximum memory, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to cap anything in the other direction (i.e. put an upper bound on anything), that would be the gc_num.interval, probably.

With this change, I believe it's already upper-bounded by max_total_memory, but we can possibly make that bound a bit tighter (i.e. max_total_memory / 2 or so).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just a patch for v.1.10.2+RAI and not intended for upstream, could we limit gc_num.interval to max_total_memory / log2(max_total_memory) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for dividing by log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise: why the choice of log here? And not some other function?

Copy link

@tveldhui tveldhui Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OOMGuardian heuristic triggers full gc when the heap size increases by 15% over the high-water mark in a rolling window (10 minutes). This avoids having the heap be more than 15% larger than it needs to be (resulting in shorter gc pauses and fewer TLB misses in pointer dereferences). It also means that if the reachable heap size is increasing monotonically, we don't do more than a logarithmic number of collections. In the scenario where we don't know the actual heap size because of the accounting problem, I'm going for a logarithmic number of collections in physical memory size. The max_total_memory / log2(max_total_memory) calculation would give an interval of 0.5GiB for 16GiB ram (32 collections until all of physical memory allocated), and 6.7GiB for 256GiB ram (38 collections).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in latest commit.

src/gc.c Show resolved Hide resolved
@tveldhui
Copy link

tveldhui commented Aug 9, 2024

Can we bump up the urgency level of fixing this- it is causing major headaches for customers running XL instances in spcs (e.g. cashapp).

@d-netto
Copy link
Member Author

d-netto commented Aug 9, 2024

Acked. Will be opening a PR to test this on raicode and try to merge this ASAP.

Copy link

@tveldhui tveldhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@tveldhui
Copy link

Are we close to merging this?

@d-netto
Copy link
Member Author

d-netto commented Aug 12, 2024

Will meet with @kpamnany to discuss the performance results from https://github.com/RelationalAI/raicode/pull/20626.

If there are no significant regressions we expect to merge it today.

@d-netto
Copy link
Member Author

d-netto commented Aug 12, 2024

Benchmarks look fine.

@d-netto d-netto merged commit 1c192fd into v1.10.2+RAI Aug 12, 2024
2 checks passed
@d-netto d-netto deleted the dcn-cap-live-bytes-to-zero branch August 12, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
port-to-v1.10 This change should apply to Julia v1.10 builds
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants