Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fetch sourcemaps from Elasticsearch #9722

Merged
merged 152 commits into from
Feb 7, 2023

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Dec 2, 2022

Motivation/summary

Add a new caching metadata fetcher that fetch sourcemap metadata from ES and forward the request to the backend fetcher if it exists. The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the sourcemap body.
Sourcemap metadata are periodically synced with ES by the caching metadata fetcher.
Keep fleet and kibana fetcher until the UI team has updated the sourcemap uploading flow.
Update ES sourcemap response to include sourcemap metadata.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Related issues

Closes #9643

Add a new caching metadata fetcher that fetch sourcemap metadata from ES
and forward the request to the backend fetcher if it exists.
The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the
sourcemap body.
Sourcemap metadata are periodically synced with ES by the caching metadata
fetcher.
Keep fleet and kibana fetcher until the UI team has updated the sourcemap
uploading flow.
Update ES sourcemap response to include sourcemap metadata.
@mergify
Copy link
Contributor

mergify bot commented Dec 2, 2022

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 2, 2022
@apmmachine
Copy link
Contributor

apmmachine commented Dec 2, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-07T16:09:40.824+0000

  • Duration: 8 min 27 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

internal/sourcemap/elasticsearch.go Outdated Show resolved Hide resolved
internal/sourcemap/elasticsearch.go Outdated Show resolved Hide resolved
internal/beater/beater.go Outdated Show resolved Hide resolved
sourcemaps are compressed with zlib and encoded as base64, keep that
in mind when retrieving the content from ES.
The document structure has been updated to reflect the current latest
commit in the kibana PR.
Use the _id to fetch the actual sourcemap since the service fields are
doc-value-only and cannot be used for filtering.
Log an error whenever sync fails.
the new kibana PR is upload sourcemaps to ES and migrating
old one to it.
We can restore the sourcemap test to use the standard method
instead of upload the sourcemap manually to ES.
The sourcemap cache is now refreshed whenever the content hashcode
changes or a sourcemap is removed from ES.
Update the sourcemap cache to use a LRU cache.
@kruskall
Copy link
Member Author

kruskall commented Dec 13, 2022

How to test:

@mergify
Copy link
Contributor

mergify bot commented Dec 13, 2022

This pull request is now in conflicts. Could you fix it @kruskall? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feat/es-sourcemap upstream/feat/es-sourcemap
git merge upstream/main
git push upstream feat/es-sourcemap

@apmmachine
Copy link
Contributor

apmmachine commented Dec 19, 2022

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │             bench.out              │
                                  │        sec/op        │    sec/op     vs base              │
geomean                                     62.16n         62.57n        +0.66%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ build/main/bench.out │              bench.out              │
                                  │         B/op         │    B/op      vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                  │ build/main/bench.out │              bench.out              │
                                  │      allocs/op       │  allocs/op   vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextReset/Forwarded_ipv4-12                         811.1n ± ∞ ¹   647.6n ± ∞ ¹  -20.16% (p=0.016 n=5)
geomean                                                938.8n         894.3n         -4.74%
¹ need >= 6 samples for confidence interval at level 0.95

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
geomean                                                           ³                  +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                           ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out           │
             │        sec/op        │   sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │           bench.out           │
                 │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                 │ build/main/bench.out │            bench.out             │
                 │         B/op         │     B/op       vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                 │ build/main/bench.out │           bench.out            │
                 │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │           bench.out           │
                        │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                        │ build/main/bench.out │           bench.out            │
                        │         B/op         │    B/op      vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                        │ build/main/bench.out │           bench.out            │
                        │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  580.2n         609.8n        +5.09%
¹ need >= 6 samples for confidence interval at level 0.95

               │ build/main/bench.out │               bench.out               │
               │         B/op         │     B/op       vs base                │
geomean                             ³                  +0.43%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

               │ build/main/bench.out │              bench.out              │
               │      allocs/op       │  allocs/op   vs base                │
geomean                             ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                            │ build/main/bench.out │             bench.out              │
                                            │        sec/op        │    sec/op     vs base              │
ReadEvents/nop_codec_big_tx/399_events-12             476.6µ ± ∞ ¹   448.1µ ± ∞ ¹  -5.97% (p=0.016 n=5)
IsTraceSampled/sampled-12                             68.51n ± ∞ ¹   69.90n ± ∞ ¹  +2.03% (p=0.032 n=5)
IsTraceSampled/unknown-12                             374.1n ± ∞ ¹   386.1n ± ∞ ¹  +3.21% (p=0.008 n=5)
geomean                                               29.02µ         29.28µ        +0.92%
¹ need >= 6 samples for confidence interval at level 0.95

                                            │ build/main/bench.out │               bench.out                │
                                            │         B/op         │      B/op       vs base                │
ReadEvents/nop_codec_big_tx/1000_events-12           2.079Mi ± ∞ ¹    2.090Mi ± ∞ ¹  +0.51% (p=0.032 n=5)
geomean                                              31.31Ki          31.50Ki        +0.61%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                            │ build/main/bench.out │              bench.out               │
                                            │      allocs/op       │  allocs/op    vs base                │
geomean                                                144.7          144.7        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, left some comment related to updating the cache. Please also update the description as the fallback logic is now removed.

We should also add metrics for better insights into the sourcemap handling: at least track the number of requests to ES for sourcemaps, track sucess of metadata requests.

internal/sourcemap/elasticsearch.go Outdated Show resolved Hide resolved
if len(s.set) == 0 {
// If cache is empty and we're trying to fetch a sourcemap
// make sure the initial cache population has ended.
s.once.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if ES is not yet reachable at this point but later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at this, I don't think it's a good idea, this will basically block until the cache is populated since we cannot retrieve sourcemaps until then with the current implementation.

I'm not sure what's the best way to proceed here. Returning nothing might be misleading. Should we bypass the cache and send the request to ES if the cache is not populated yet ?

Copy link
Contributor

@simitt simitt Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be if the metadata cache is not ready. If a sourcemap is part of the metadata cache but the body is not yet cached, it will be fetched from ES on request.
So let's take a look at the metadata cache. What are the scenarios for when the cache might not be ready?

  • ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.
  • Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.

Copy link
Member Author

@kruskall kruskall Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.

I don't think the metadata cache should be concerned If ES is not ready/reachable/available. A failure in that scenario would be "working as intended" imo because the metadata cache is proxying what it got from the ES fetcher (error, timeout, etc.).

Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.

I was thinking of "if init is not completed, forward the request to the backend fetcher". In theory the first request for each sourcemap would be forwarded to ES anyway so this wouldn't create additional work unless we get a request for a missing sourcemap on ES during init. The reasoning behind this was that when the metadata cache is not initialized:

  • if the sourcemap exists on ES: the metadata fetcher will send the request to the es fetcher anyway. No additional time/work/blocking: this would be the same as the metadata cache receiving a request for the first time.
  • if the sourcemap does not exist on ES: the metadata will send the request to the ES fetcher, creating additional work. The additional processing time is offset by the fact we would have had to wait for the init phase to finish if we had blocked so my assumption was that there wasn't gonna be a time difference between the two in the end.

WDYT ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is really used for triggering the first fetch, in case it hasn't yet been initiated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has been reworked and now we forward the request to ES if the metadata cache is not ready.

This does not create additional costs since the response is cached in the lru cache.

internal/sourcemap/metadata.go Outdated Show resolved Hide resolved
@simitt
Copy link
Contributor

simitt commented Feb 7, 2023

IMO the blocking behaviour is fine since it's bounded to a ping + search request, which is roughly the same as the old ES implementation. @simitt do you still have any concerns about that?

The existing behavior pre 8.7 was already blocking RUM ingest requests until sourcemaps were fetched, so while I don't think it's ideal, I have no concerns on moving forward with blocking.

@kruskall kruskall requested a review from simitt February 7, 2023 14:35
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good now!

The sourcemap_fetcher_tests only test the happy path, some checks for error cases would also be good, but test coverage looks good enough to be merged now.

Thanks for all the effort that went into this!

internal/beater/config/config_test.go Outdated Show resolved Hide resolved
@@ -3,6 +3,8 @@ apm_server:
indices:
- names: ['apm-*', 'traces-apm*', 'logs-apm*', 'metrics-apm*']
privileges: ['write','create_index','manage','manage_ilm']
- names: ['.apm-source-map']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the tests use admin credentials for sourcemap only?

I wouldn't suggest this, with admin credentials it is easy to miss breaking changes

IMO it's fine to define the privilege here, in alignment with privilege definition for the apm datastreams.

internal/sourcemap/metadata_fetcher.go Outdated Show resolved Hide resolved
@kruskall kruskall requested a review from simitt February 7, 2023 15:18
@kruskall kruskall enabled auto-merge (squash) February 7, 2023 16:09
@kruskall kruskall merged commit 4f59fa4 into elastic:main Feb 7, 2023
@kruskall kruskall deleted the feat/es-sourcemap branch February 7, 2023 17:13
@axw axw self-assigned this Feb 22, 2023
@axw
Copy link
Member

axw commented Feb 22, 2023

First test with Fleet-managed APM Server is successful:

  1. Created an Elastic Cloud deployment
  2. Send a RUM error, confirmed that its stack trace was not source mapped (naturally, there are no source maps uploaded yet)
  3. Uploaded a source map
  4. Deleted existing error documents, send another RUM error; confirmed that it was source mapped

I will run some more tests running without Fleet.

@axw
Copy link
Member

axw commented Feb 23, 2023

I tried running Elastic Agent locally in Docker, communicating with Elasticsearch in an Elastic Cloud us-west2 deployment. Errors were indexed with:

error.exception.stacktrace.sourcemap.error: failed to ping es cluster: fetcher unavailable: context deadline exceeded

Looking at the logs, this is coming from the metadata fetcher, which has a 1 second ping timeout.

Opened #10338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify test-plan test-plan-ok v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetch sourcemaps from Elasticsearch
7 participants