Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not hash the entire schema on every query plan cache lookup #5374

Merged
merged 4 commits into from
Jun 7, 2024

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Jun 7, 2024

This is causing performance issues on big schemas


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@Geal Geal requested review from a team as code owners June 7, 2024 08:51

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Jun 7, 2024

CI performance tests

  • step - Basic stress test that steps up the number of users over time
  • reload - Reload test over a long period of time at a constant rate of users
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • xlarge-request - Stress test with 10 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • xxlarge-request - Stress test with 100 MB request payload
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • no-graphos - Basic stress test, no GraphOS.
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • large-request - Stress test with a 1 MB request payload
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled

Copy link
Member

@abernix abernix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this PR, I think it's reasonable to check the "Changes are compatible" checkbox but not the introduction of new tests, so long as manual testing can be done — mostly because I think this might be complicated to test. And of course, "Performance impact assessed and acceptable" should be checked, if that is indeed true since this performance related.

Overall, it looks good to me, and thank you for taking care to make sure it matches our existing schema ID.

Copy link
Contributor

@SimonSapin SimonSapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since schema_id() is now precomputed and kept in the Schema struct it’s tempting to make its other use (inject_schema_id) take it from Schema rather than compute it, but Schema is not created by that point. It’s something we can refactor later (potentially easier after removing Deno entirely?) but doesn’t need to block this PR.

@Geal
Copy link
Contributor Author

Geal commented Jun 7, 2024

I'd like to have Schema parsed before injecting the schema id yes. I think the reason it was not done there yet was that the API schema needed the JS planner to be available, so that will be removed soon.
We're actually already parsing the schema once before that, when checking for enterprise features in the schema, but at that point we only work with apollo-parser IIRC.
I really want to clean all of that up :)

@Geal
Copy link
Contributor Author

Geal commented Jun 7, 2024

the perf tests have run and they're ok, but I suspect we would only see an effect on a huge schema

@@ -261,7 +261,7 @@ where
query: query.clone(),
operation: operation.clone(),
hash: doc.hash.clone(),
sdl: Arc::clone(&self.schema.raw_sdl),
schema_id: Arc::clone(&self.schema.hash),
Copy link
Member

@IvanGoncharov IvanGoncharov Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if schema_id contains schema.hash maybe it should be called schema_hash.

just thinking out loud 💭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to schema_id everywhere in fa19613 since we were already naming it that way elsewhere

@abernix
Copy link
Member

abernix commented Jun 7, 2024

the perf tests have run and they're ok, but I suspect we would only see an effect on a huge schema

Ok sounds good. Can you indicate that in the PR checklist? And check manual tests?

@Geal
Copy link
Contributor Author

Geal commented Jun 7, 2024

Manual tests show on 500kB schema that schema hashing and cache lookup could account for up to 15% of CPU time, while now it is down to less than 1%, with a 20% improvement in requests per second

@xuorig
Copy link
Contributor

xuorig commented Jun 7, 2024

We have a similar fix running in prod right now that showed similar improvements on real workloads 👍

@Geal Geal merged commit 2c8531f into dev Jun 7, 2024
14 checks passed
@Geal Geal deleted the geal/schema_hash branch June 7, 2024 13:56
@lrlna lrlna mentioned this pull request Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants