Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: graphs don't load on 240 node cluster #72986

Closed
nvanbenschoten opened this issue Nov 19, 2021 · 4 comments · Fixed by #74662
Closed

ui: graphs don't load on 240 node cluster #72986

nvanbenschoten opened this issue Nov 19, 2021 · 4 comments · Fixed by #74662
Assignees
Labels
A-kv-ts Relating to time-series management. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@nvanbenschoten
Copy link
Member

This is identical to #24018, only at a larger scale. We resolved that issue by dropping down the number of query workers by a factor of 8. This increased the per timeseries query memory limit from 1MiB to 8MiB. It should come as no surprise then that we once again see this issue on 240 node clusters (given 30*8=240).

@mrtracy laid out a few options on that issue #24018 (comment). I think we should explore the first one - "Raise the overall limit (currently 64MiB)". Specifically, I think it would make sense to scale this server-wide limit by the available memory on the machine. On a machine with 100s of GiB of available memory (e.g. n2-standard-48 machines have 192 GiB of memory), 64MiB isn't much. Or instead of a static limit, we should hook this memory budget into the --max-sql-memory budget.

cc. @piyush-singh

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-ts Relating to time-series management. labels Nov 19, 2021
@thtruo
Copy link
Contributor

thtruo commented Dec 2, 2021

cc @nkodali assigned to our team label for triage and discussion in next Mondays' backlog grooming session

@thtruo
Copy link
Contributor

thtruo commented Dec 14, 2021

Confirming that this is in our December milestone

@dhartunian
Copy link
Collaborator

(sorry repeat comment since I accidentally made it in Jira)

@nvanbenschoten have you considered running Prometheus alongside these clusters instead of using DB Console for timeseries metrics? That could alleviate this directly. I do want to tackle this issue but Prometheus will likely work for you immediately.

We have a demo configuration with matching dashboards from DB Console under /monitoring/demo.


I am still going ahead with a proposed solution but just wanted to throw out the workaround above. Some large customers will hit this limit at some point soonish so I definitely want to address it directly.

I like the idea of using an existing memory limit like system-wide or --max-sql-memory. I'll put up a quick prototype for review and see if that will be adequate.

dhartunian added a commit to dhartunian/cockroach that referenced this issue Jan 11, 2022
Previously, the memory limit per-`tsdb` worker was set at a static
64MiB. This cap created issues seen in cockroachdb#24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic. The per-worker limit is now raised based on the amount of memory
available to the SQL Pool via the `MemoryPoolSize` configuration
variable. This is set to be 25% of the system memory by default. The
`tsdb` memory cap per-worker is now doubled until it reaches `1/128` of
the memory pool setting.

For example, on a node with 128 - 256 GiB of memory, this will
correspond to 512 MiB allocated per worker.

TODO(davidh): Can the tests be faster? They iterate on a server create
TODO(davidh): Is 1/128 a good setting? How do we validate this.
TODO(davidh): Should this behavior be gated behind a feature flag? It's
possible on some memory-constrained deployments a sudden spike in memory
usage by tsdb could cause problems.

Resolves cockroachdb#72986

Release note (ops change): customers running clusters with 240 nodes or
more can effectively access tsdb metrics.
@nvanbenschoten
Copy link
Member Author

@nvanbenschoten have you considered running Prometheus alongside these clusters instead of using DB Console for timeseries metrics? That could alleviate this directly. I do want to tackle this issue but Prometheus will likely work for you immediately.

This is a good point. I was planning to run with a patched version of CRDB with a higher queryMemoryMax value if this issue wasn't fixed in time, but you're right that Prometheus would also work to get around this issue for the time being.

dhartunian added a commit to dhartunian/cockroach that referenced this issue Jan 14, 2022
Previously, the memory limit for all `tsdb` workers was set at a static
64MiB. This cap created issues seen in cockroachdb#24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced, raising the per-worker allocation.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic. The per-worker limit is now raised based on the amount of memory
available to the SQL Pool via the `MemoryPoolSize` configuration
variable. This is set to be 25% of the system memory by default. The
`tsdb` memory cap per-worker is now doubled until it reaches `1/128` of
the memory pool setting.

For example, on a node with 128 - 256 GiB of memory, this will
correspond to 512 MiB allocated for all running `tsdb` queries.

In addition, the ts server is now connected to the same `BytesMonitor`
instance as the SQL memory monitor and workers will becapped at double
the query limit. Results are monitored as before but a cap is not
introduced there since we didn't have one present previously.

This behavior is gated behind a private cluster setting that's enabled
by default.

TODO(davidh): Can the tests be faster? They iterate on a server create
TODO(davidh): Is 1/128 a good setting? How do we validate this.

Resolves cockroachdb#72986

Release note (ops change): customers running clusters with 240 nodes or
more can effectively access tsdb metrics.
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jan 27, 2022
Previously, the memory limit for all `tsdb` workers was set at a static
64MiB. This cap created issues seen in cockroachdb#24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced, raising the per-worker allocation.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic. The per-worker limit is now raised based on the amount of memory
available to the SQL Pool via the `MemoryPoolSize` configuration
variable. This is set to be 25% of the system memory by default. The
`tsdb` memory cap per-worker is now doubled until it reaches `1/128` of
the memory pool setting.

For example, on a node with 128 - 256 GiB of memory, this will
correspond to 512 MiB allocated for all running `tsdb` queries.

In addition, the ts server is now connected to the same `BytesMonitor`
instance as the SQL memory monitor and workers will becapped at double
the query limit. Results are monitored as before but a cap is not
introduced there since we didn't have one present previously.

This behavior is gated behind a private cluster setting that's enabled
by default and sets the ratio at 1/128 of the SQL memory pool.

Resolves cockroachdb#72986

Release note (ops change): customers running clusters with 240 nodes or
more can effectively access tsdb metrics.
dhartunian added a commit to dhartunian/cockroach that referenced this issue Jan 31, 2022
Previously, the memory limit for all `tsdb` workers was set at a static
64MiB. This cap created issues seen in cockroachdb#24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced, raising the per-worker allocation.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic. The per-worker limit is now raised based on the amount of memory
available to the SQL Pool via the `MemoryPoolSize` configuration
variable. This is set to be 25% of the system memory by default. The
`tsdb` memory cap per-worker is now doubled until it reaches `1/128` of
the memory pool setting.

For example, on a node with 128 - 256 GiB of memory, this will
correspond to 512 MiB allocated for all running `tsdb` queries.

In addition, the ts server is now connected to the same `BytesMonitor`
instance as the SQL memory monitor and workers will becapped at double
the query limit. Results are monitored as before but a cap is not
introduced there since we didn't have one present previously.

This behavior is gated behind a private cluster setting that's enabled
by default and sets the ratio at 1/128 of the SQL memory pool.

Resolves cockroachdb#72986

Release note (ops change): customers running clusters with 240 nodes or
more can effectively access tsdb metrics.
craig bot pushed a commit that referenced this issue Feb 15, 2022
74563: kv,kvcoord,sql: poison txnCoordSender after a retryable error r=lidorcarmel a=lidorcarmel

Previously kv users could lose parts of a transaction without getting an
error. After Send() returned a retryable error the state of txn got reset
which made it usable again. If the caller ignored the error they could
continue applying more operations without realizing the first part of the
transaction was discarded. See more details in the issue (#22615).

The simple case example is where the retryable closure of DB.Txn() returns
nil instead of returning the retryable error back to the retry loop - in this
case the retry loop declares success without realizing we lost the first part
of the transaction (all the operations before the retryable error).

This PR leaves the txn in a "poisoned" state after encountering an error, so
that all future operations fail fast. The caller is therefore expected to
reset the txn handle back to a usable state intentionally, by calling
Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will
reset the handle and run the retry even if the callback returned nil.

Closes #22615

Release note: None

74662: tsdb: expand mem per worker based on sql pool size r=dhartunian a=dhartunian

Previously, the memory limit for all `tsdb` workers was set at a static
64MiB. This cap created issues seen in #24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced, raising the per-worker allocation.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic. The per-worker limit is now raised based on the amount of memory
available to the SQL Pool via the `MemoryPoolSize` configuration
variable. This is set to be 25% of the system memory by default. The
`tsdb` memory cap per-worker is now doubled until it reaches `1/128` of
the memory pool setting.

For example, on a node with 128 - 256 GiB of memory, this will
correspond to 512 MiB allocated for all running `tsdb` queries.

In addition, the ts server is now connected to the same `BytesMonitor`
instance as the SQL memory monitor and workers will becapped at double
the query limit. Results are monitored as before but a cap is not
introduced there since we didn't have one present previously.

This behavior is gated behind a private cluster setting that's enabled
by default and sets the ratio at 1/128 of the SQL memory pool.

Resolves #72986

Release note (ops change): customers running clusters with 240 nodes or
more can effectively access tsdb metrics.

75677: randgen: add PopulateRandTable r=mgartner a=msbutler

PopulateRandTable populates the caller's table with random data. This helper
function aims to make it easier for engineers to develop randomized tests that
leverage randgen / sqlsmith.

Informs #72345

Release note: None

76334: opt: fix missing filters after join reordering r=mgartner a=mgartner

#### opt: add TES, SES, and rules to reorderjoins

This commit updates the output of the `reorderjoins` opt test command to
display the initial state of the `JoinOrderBuilder`. It adds additional
information to the output including the TES, SES, and conflict rules for
each edge.

Release note: None

#### opt: fix missing filters after join reordering

This commit eliminates logic in the `assoc`, `leftAsscom`, and
`rightAsscom` functions in the join order builder that aimed to prevent
generating "orphaned" predicates, where one or more referenced relations
are not in a join's input. In rare cases, this logic had the side effect
of creating invalid conflict rules for edges, which could prevent valid
predicates from being added to reordered join trees.

It is safe to remove these conditionals because they are unnecessary.
The CD-C algorithm already prevents generation of orphaned predicates by
checking that the total eligibility set (TES) is a subset of a join's
input vertices. In our implementation, this is handled by the
`checkNonInnerJoin` and `checkInnerJoin` functions.

Fixes #76522

Release note (bug fix): A bug has been fixed which caused the query optimizer
to omit join filters in rare cases when reordering joins, which could
result in incorrect query results. This bug was present since v20.2.


Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
@craig craig bot closed this as completed in 97150df Feb 15, 2022
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Previously, the memory limit for all `tsdb` workers was set at a static
64MiB. This cap created issues seen in cockroachdb#24018 where this limit was hit
on a 30 node cluster. To alleviate the issue, the number of workers was
reduced, raising the per-worker allocation.

We've currently hit this limit again as part of load testing with larger
clusters and have decided to make the per-query worker memory limit
dynamic.

This PR introduces a new CLI flag `--max-tsdb-memory` to mirror the
functionality of the `--max-sql-memory` flag by accepting bytes or a
percentage of system memory. The default is set to be `1%` of system
memory or 64 MiB, whichever is greater. This ensures that performance
after this PR is equal or better for timeseries queries without eating
too far into memory budgets for SQL.

In addition, the ts server is now connected to the same `BytesMonitor`
instance as the SQL memory monitor and workers will becapped at double
the query limit. Results are monitored as before but a cap is not
introduced there since we didn't have one present previously.

Resolves cockroachdb#72986

Release note (cli change, ops change): A new CLI flag `--max-tsdb-memory`
is now available, that can set the memory budget for timeseries queries
when processing requests from the Metrics page in DB Console. Most
customers should not need to tweak this setting as the default of 1% of
system memory or 64 MiB, whichever is greater, is adequate for most
deployments. In the case where a deployment of hundreds of nodes has
low per-node memory available (below 8 GiB for instance) it may be
necessary to increase this value to `2%` or higher in order to render
timeseries graphs for the cluster using the DB Console. Otherwise, the
default settings will be adequate for the vast majority of deployments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-ts Relating to time-series management. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants