Introduce make_fake_batch() to avoid racy caches and redact more otel trace details #5638

garypen · 2024-07-10T15:26:50Z

It transpires that there are multiple race conditions in our tests which makes snapshotting traces problematic. This problem is particularly acute when we are testing batch requests, because of the way we build up our batch requests. The most significant impact has been on the otel traces tests which have been very flaky.

A number of the tests expose subtle races in their expectations around span creation. This is more complex and warrants a detailed explanation.

We already have a mechanism for generating a query from a fake supergraph query builder. When we added support for batching, we leveraged this functionality to convert a simple single request into a batch request by doing something (conceptually at least) like this:

map the request, so that we clone it and wrap it in an array which now contains two identical requests.

That is simple and easy to implement, so we've used it in a few places in our tests.

However, when the router executes....

At various places in the router we check if items are cached to decide whether or not to do some processing. That processing may involve the generation of a tracing span. If we don't do the processing, because, let's say, data is in a cache, then we now have a potential race condition, because not doing the processing means the various snapshots that we rely on in our tests will (or won't) match.

This, in a nutshell, are the problems that this PR fixes.

Make items in a batch different enough that caching effects are avoided.
Redact various details so that sequencing is not as much of an issue in the otel traces tests

This results in a lot of updated snapshots, but the changes aren't actually that significant.

Unblocks: #5644

I'll explain better in later commits if the idea pans out. Just creating a branch for now.

router-perf · 2024-07-10T15:27:41Z

CI performance tests

Make test batch data have a different operation name.

It's useful for testing and pointless to have three different versions of the same function.

:)

The global lock should be acquired when the test starts.

Not entirely sure why I had to change the apollo_reports.rs batch metrics test, but I'll figure that out later.

I don't know why, but I'm relaxing this test for now.

Maybe the backout of the spawn_blocking() for document parsing?

timbotnik

This seems fine - you are more of an expert here wrt batching. It would be nice if we could have some of the tests not rely on batching and keep a tighter match on important span attributes (the signature is one of the only "must have" pieces of the Apollo OTel trace ingestor, and I noticed in this PR we are now redacting it from snapshots).

garypen · 2024-07-12T11:25:29Z

This seems fine - you are more of an expert here wrt batching. It would be nice if we could have some of the tests not rely on batching and keep a tighter match on important span attributes (the signature is one of the only "must have" pieces of the Apollo OTel trace ingestor, and I noticed in this PR we are now redacting it from snapshots).

I know it's not ideal. I think we'd have to have a different version of the assert_report macro for batching vs non-batching. The non-batching wouldn't redact the additional details. That might be good enough.

I'll try it out later.

To try and strike a balance between flakiness and testing effectiveness, only redact additional information for batching tests.

garypen · 2024-07-12T12:53:07Z

I modified the macro to only perform the additional redaction if the test is a batch test. I think that's a nice compromise and keeps the tighter match that @timbotnik requested above.

These tests are too flaky. They never pass. Try to disable them for now.

… trace details (#5638)

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [apollographql/router](https://github.com/apollographql/router) | minor | `v1.51.0` -> `v1.52.0` | --- ### Release Notes <details> <summary>apollographql/router (apollographql/router)</summary> ### [`v1.52.0`](https://github.com/apollographql/router/releases/tag/v1.52.0) [Compare Source](https://github.com/apollographql/router/compare/v1.51.0-rc.0...v1.52.0-rc.0) #### 🚀 Features ##### Provide helm support for when router's health_check's default path is not being used([Issue #5652](https://github.com/apollographql/router/issues/5652)) When helm chart is defining the liveness and readiness check probes, if the router has been configured to use a non-default health_check path, use that rather than the default ( /health ) By [Jon Christiansen](https://github.com/theJC) in [https://github.com/apollographql/router/pull/5653](https://github.com/apollographql/router/pull/5653) ##### Support new span and metrics formats for entity caching ([PR #5625](https://github.com/apollographql/router/pull/5625)) Metrics of the router's entity cache have been converted to the latest format with support for custom telemetry. The following example configuration shows the the `cache` instrument, the `cache` selector in the subgraph service, and the `cache` attribute of a subgraph span: ```yaml telemetry: instrumentation: instruments: default_requirement_level: none cache: apollo.router.operations.entity.cache: attributes: entity.type: true subgraph.name: subgraph_name: true supergraph.operation.name: supergraph_operation_name: string subgraph: only_cache_hit_on_subgraph_products: type: counter value: cache: hit unit: hit description: counter of subgraph request cache hit on subgraph products condition: all: - eq: - subgraph_name: true - products - gt: - cache: hit - 0 attributes: subgraph.name: true supergraph.operation.name: supergraph_operation_name: string ``` To learn more, go to [Entity caching docs](https://www.apollographql.com/docs/router/configuration/entity-caching). By [@Geal](https://github.com/Geal) and [@bnjjj](https://github.com/bnjjj) in [https://github.com/apollographql/router/pull/5625](https://github.com/apollographql/router/pull/5625) ##### Helm: Support renaming key for retrieving APOLLO_KEY secret ([Issue #5661](https://github.com/apollographql/router/issues/5661)) A user of the router Helm chart can now rename the key used to retrieve the value of the secret key referenced by `APOLLO_KEY`. Previously, the router Helm chart hardcoded the key name to `managedFederationApiKey`. This didn't support users whose infrastructure required custom key names when getting secrets, such as Kubernetes users who need to use specific key names to access a `secretStore` or `externalSecret`. This change provides a user the ability to control the name of the key to use in retrieving that value. By [Jon Christiansen](https://github.com/theJC) in [https://github.com/apollographql/router/pull/5662](https://github.com/apollographql/router/pull/5662) #### 🐛 Fixes ##### Prevent Datadog timeout errors in logs ([Issue #2058](https://github.com/apollographql/router/issue/2058)) The router's Datadog exporter has been updated to reduce the frequency of logged errors related to connection pools. Previously, the connection pools used by the Datadog exporter frequently timed out, and each timeout logged an error like the following: 2024-07-19T15:28:22.970360Z ERROR OpenTelemetry trace error occurred: error sending request for url (http://127.0.0.1:8126/v0.5/traces): connection error: Connection reset by peer (os error 54) Now, the pool timeout for the Datadog exporter has been changed so that timeout errors happen much less frequently. By [@BrynCooke](https://github.com/BrynCooke) in [https://github.com/apollographql/router/pull/5692](https://github.com/apollographql/router/pull/5692) ##### Allow service version overrides ([PR #5689](https://github.com/apollographql/router/pull/5689)) The router now supports configuration of `service.version` via YAML file configuration. This enables users to produce custom versioned builds of the router. The following example overrides the version to be `1.0`: ```yaml telemetry: exporters: tracing: common: resource: service.version: 1.0 ``` By [@BrynCooke](https://github.com/BrynCooke) in [https://github.com/apollographql/router/pull/5689](https://github.com/apollographql/router/pull/5689) ##### Populate Datadog `span.kind` ([PR #5609](https://github.com/apollographql/router/pull/5609)) Because Datadog traces use `span.kind` to differentiate between different types of spans, the router now ensures that `span.kind` is correctly populated using the OpenTelemetry span kind, which has a 1-2-1 mapping to those set out in [dd-trace](https://github.com/DataDog/dd-trace-go/blob/main/ddtrace/ext/span_kind.go). By [@BrynCooke](https://github.com/BrynCooke) in [https://github.com/apollographql/router/pull/5609](https://github.com/apollographql/router/pull/5609) ##### Remove unnecessary internal metric events from traces and spans ([PR #5649](https://github.com/apollographql/router/pull/5649)) The router no longer includes some internal metric events in traces and spans that shouldn't have been included originally. By [@bnjjj](https://github.com/bnjjj) in [https://github.com/apollographql/router/pull/5649](https://github.com/apollographql/router/pull/5649) ##### Support Datadog span metrics ([PR #5609](https://github.com/apollographql/router/pull/5609)) When using the APM view in Datadog, the router now displays span metrics for top-level spans or spans with the `_dd.measured` flag set. The router sets the `_dd.measured` flag by default for the following spans: - `request` - `router` - `supergraph` - `subgraph` - `subgraph_request` - `http_request` - `query_planning` - `execution` - `query_parsing` To enable or disable span metrics for any span, configure `span_metrics` for the Datadog exporter: ```yaml telemetry: exporters: tracing: datadog: enabled: true span_metrics: ### Disable span metrics for supergraph supergraph: false ### Enable span metrics for my_custom_span my_custom_span: true ``` By [@BrynCooke](https://github.com/BrynCooke) in [https://github.com/apollographql/router/pull/5609](https://github.com/apollographql/router/pull/5609) and [https://github.com/apollographql/router/pull/5703](https://github.com/apollographql/router/pull/5703) ##### Use spawn_blocking for query parsing and validation ([PR #5235](https://github.com/apollographql/router/pull/5235)) To prevent its executor threads from blocking on large queries, the router now runs query parsing and validation in a Tokio blocking task. By [@xuorig](https://github.com/xuorig) in [https://github.com/apollographql/router/pull/5235](https://github.com/apollographql/router/pull/5235) #### 🛠 Maintenance ##### chore: Update rhai to latest release (1.19.0) ([PR #5655](https://github.com/apollographql/router/pull/5655)) In Rhai 1.18.0, there were changes to how exceptions within functions were created. For details see: https://github.com/rhaiscript/rhai/blob/7e0ac9d3f4da9c892ed35a211f67553a0b451218/CHANGELOG.md?plain=1#L12 We've modified how we handle errors raised by Rhai to comply with this change, which means error message output is affected. The change means that errors in functions will no longer document which function the error occurred in, for example: ```diff - "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)\nin call to function 'process_subgraph_response_string''" + "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)'" ``` Making this change allows us to keep up with the latest version (1.19.0) of Rhai. By [@garypen](https://github.com/garypen) in [https://github.com/apollographql/router/pull/5655](https://github.com/apollographql/router/pull/5655) ##### Add version in the entity cache hash ([PR #5701](https://github.com/apollographql/router/pull/5701)) The hashing algorithm of the router's entity cache has been updated to include the entity cache version. \[!IMPORTANT] If you have previously enabled [entity caching](https://www.apollographql.com/docs/router/configuration/entity-caching), you should expect additional cache regeneration costs when updating to this version of the router while the new hashing algorithm comes into service. By [@bnjjj](https://github.com/bnjjj) in [https://github.com/apollographql/router/pull/5701](https://github.com/apollographql/router/pull/5701) ##### Improve testing by avoiding cache effects and redacting tracing details ([PR #5638](https://github.com/apollographql/router/pull/5638)) We've had some problems with flaky tests and this PR addresses some of them. The router executes in parallel and concurrently. Many of our tests use snapshots to try and make assertions that functionality is continuing to work correctly. Unfortunately, concurrent/parallel execution and static snapshots don't co-operate very well. Results may appear in pseudo-random order (compared to snapshot expectations) and so tests become flaky and fail without obvious cause. The problem becomes particularly acute with features which are specifically designed for highly concurrent operation, such as batching. This set of changes addresses some of the router testing problems by: 1. Making items in a batch test different enough that caching effects are avoided. 2. Redacting various details so that sequencing is not as much of an issue in the otel traces tests. By [@garypen](https://github.com/garypen) in [https://github.com/apollographql/router/pull/5638](https://github.com/apollographql/router/pull/5638) #### 📚 Documentation ##### Update router naming conventions ([PR #5400](https://github.com/apollographql/router/pull/5400)) Renames our router product to distinguish between our non-commercial and commercial offerings. Instead of referring to the **Apollo Router**, we now refer to the following: - **Apollo Router Core** is Apollo’s free-and-open (ELv2 licensed) implementation of a routing runtime for supergraphs. - **GraphOS Router** is based on the Apollo Router Core and fully integrated with GraphOS. GraphOS Routers provide access to GraphOS’s commercial runtime features. By [@shorgi](https://github.com/shorgi) in [https://github.com/apollographql/router/pull/5400](https://github.com/apollographql/router/pull/5400) #### 🧪 Experimental ##### Enable Rust-based API schema implementation ([PR #5623](https://github.com/apollographql/router/pull/5623)) The router has transitioned to solely using a Rust-based API schema generation implementation. Previously, the router used a Javascript-based implementation. After testing for a few months, we've validated the improved performance and robustness of the new Rust-based implementation, so the router now only uses it. By [@goto-bus-stop](https://github.com/goto-bus-stop) in [https://github.com/apollographql/router/pull/5623](https://github.com/apollographql/router/pull/5623) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View the [repository job log](https://developer.mend.io/github/apollographql/rover).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Definitively modify the document to try and avoid races

c2b7d85

I'll explain better in later commits if the idea pans out. Just creating a branch for now.

garypen self-assigned this Jul 10, 2024

This comment has been minimized.

Sign in to view

garypen added 7 commits July 10, 2024 16:43

Make more changes in case it matters in other places.

6b91378

Make test batch data have a different operation name.

re-organise some code to make make_fake_batch public

2220a86

It's useful for testing and pointless to have three different versions of the same function.

Don't re-introduce hyper...

58f3a55

:)

Change locking strategy in the various otel/apollo exporter tests

031bc19

The global lock should be acquired when the test starts.

Some changes to try and help otel traces tests to pass.

8b6179c

Not entirely sure why I had to change the apollo_reports.rs batch metrics test, but I'll figure that out later.

improve the comments for make_fake_batch

dfc4483

test_batch_stats seems to vary between 1 and 2

9f683c4

I don't know why, but I'm relaxing this test for now.

garypen changed the title ~~Definitively modify the document to try and avoid races~~ Introduce make_fake_batch() to avoid racy caches and redact more otel trace details Jul 11, 2024

garypen requested review from timbotnik and bnjjj July 11, 2024 14:59

Merge branch 'dev' into garypen/modify-batch-for-tracing

1a906be

garypen marked this pull request as ready for review July 11, 2024 14:59

garypen requested review from a team as code owners July 11, 2024 14:59

garypen added 2 commits July 11, 2024 16:09

add a changeset

34f5de3

merge with dev seems to have perturbed the snapshots

bc07b42

Maybe the backout of the spawn_blocking() for document parsing?

garypen mentioned this pull request Jul 12, 2024

Reintroduce "use spawn_blocking for parsing" #5644

Merged

garypen requested a review from IvanGoncharov July 12, 2024 09:27

timbotnik approved these changes Jul 12, 2024

View reviewed changes

IvanGoncharov approved these changes Jul 12, 2024

View reviewed changes

only redact additional information for batch tests

c668fe8

To try and strike a balance between flakiness and testing effectiveness, only redact additional information for batching tests.

bnjjj approved these changes Jul 12, 2024

View reviewed changes

garypen enabled auto-merge (squash) July 12, 2024 13:11

try to disable the failing entity caching tests

76f0ffe

These tests are too flaky. They never pass. Try to disable them for now.

garypen merged commit 31c1bd9 into dev Jul 15, 2024
13 of 14 checks passed

garypen deleted the garypen/modify-batch-for-tracing branch July 15, 2024 09:44

garypen added a commit that referenced this pull request Jul 15, 2024

Introduce make_fake_batch() to avoid racy caches and redact more otel…

c0582b3

… trace details (#5638)

abernix pushed a commit that referenced this pull request Jul 16, 2024

Introduce make_fake_batch() to avoid racy caches and redact more otel…

76ffdce

… trace details (#5638)

abernix pushed a commit that referenced this pull request Jul 16, 2024

Introduce make_fake_batch() to avoid racy caches and redact more otel…

99dedb0

… trace details (#5638)

bnjjj mentioned this pull request Jul 30, 2024

prep release: v1.52.0 #5744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce make_fake_batch() to avoid racy caches and redact more otel trace details #5638

Introduce make_fake_batch() to avoid racy caches and redact more otel trace details #5638

garypen commented Jul 10, 2024 •

edited by abernix

Loading

This comment has been minimized.

router-perf bot commented Jul 10, 2024

timbotnik left a comment

garypen commented Jul 12, 2024

garypen commented Jul 12, 2024

Introduce make_fake_batch() to avoid racy caches and redact more otel trace details #5638

Introduce make_fake_batch() to avoid racy caches and redact more otel trace details #5638

Conversation

garypen commented Jul 10, 2024 • edited by abernix Loading

This comment has been minimized.

router-perf bot commented Jul 10, 2024

timbotnik left a comment

Choose a reason for hiding this comment

garypen commented Jul 12, 2024

garypen commented Jul 12, 2024

garypen commented Jul 10, 2024 •

edited by abernix

Loading