feat: fetch sourcemaps from Elasticsearch #9722

kruskall · 2022-12-02T02:42:15Z

Motivation/summary

Add a new caching metadata fetcher that fetch sourcemap metadata from ES and forward the request to the backend fetcher if it exists. The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the sourcemap body.
Sourcemap metadata are periodically synced with ES by the caching metadata fetcher.
Keep fleet and kibana fetcher until the UI team has updated the sourcemap uploading flow.
Update ES sourcemap response to include sourcemap metadata.

Checklist

Update CHANGELOG.asciidoc
Update package changelog.yml (only if changes to apmpackage have been made)
Documentation has been updated

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

Related issues

Closes #9643

Add a new caching metadata fetcher that fetch sourcemap metadata from ES and forward the request to the backend fetcher if it exists. The backend fetcher is an ES fetcher wrapped in a LRU cache to cache the sourcemap body. Sourcemap metadata are periodically synced with ES by the caching metadata fetcher. Keep fleet and kibana fetcher until the UI team has updated the sourcemap uploading flow. Update ES sourcemap response to include sourcemap metadata.

mergify · 2022-12-02T02:42:49Z

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

apmmachine · 2022-12-02T02:49:08Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-02-07T16:09:40.824+0000
Duration: 8 min 27 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

internal/sourcemap/elasticsearch.go

internal/beater/beater.go

internal/sourcemap/metadata.go

sourcemaps are compressed with zlib and encoded as base64, keep that in mind when retrieving the content from ES. The document structure has been updated to reflect the current latest commit in the kibana PR. Use the _id to fetch the actual sourcemap since the service fields are doc-value-only and cannot be used for filtering. Log an error whenever sync fails.

the new kibana PR is upload sourcemaps to ES and migrating old one to it. We can restore the sourcemap test to use the standard method instead of upload the sourcemap manually to ES.

The sourcemap cache is now refreshed whenever the content hashcode changes or a sourcemap is removed from ES. Update the sourcemap cache to use a LRU cache.

kruskall · 2022-12-13T01:41:05Z

How to test:

checkout [APM] Migrate fleet sourcemaps to APM managed index kibana#147208 and build a kibana docker image with it
Replace the current kibana docker image with the custom one at https://github.com/elastic/apm-server/blob/main/docker-compose.yml#L34
Bump elasticsearch docker image at https://github.com/elastic/apm-server/blob/main/docker-compose.yml#L4 so that it includes Update kibana_system privileges with access to .apm-source-map elasticsearch#92160
Change the ES user to admin at https://github.com/elastic/apm-server/blob/main/systemtest/apmservertest/config.go#L34
Validate sourcemap fetching is working with GOTESTFLAGS="-run=TestSourcemapElasticsearch -v" make system-test

mergify · 2022-12-13T06:26:48Z

This pull request is now in conflicts. Could you fix it @kruskall? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feat/es-sourcemap upstream/feat/es-sourcemap
git merge upstream/main
git push upstream feat/es-sourcemap

apmmachine · 2022-12-19T06:07:43Z

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │             bench.out              │
                                  │        sec/op        │    sec/op     vs base              │
geomean                                     62.16n         62.57n        +0.66%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ build/main/bench.out │              bench.out              │
                                  │         B/op         │    B/op      vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                  │ build/main/bench.out │              bench.out              │
                                  │      allocs/op       │  allocs/op   vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextReset/Forwarded_ipv4-12                         811.1n ± ∞ ¹   647.6n ± ∞ ¹  -20.16% (p=0.016 n=5)
geomean                                                938.8n         894.3n         -4.74%
¹ need >= 6 samples for confidence interval at level 0.95

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
geomean                                                           ³                  +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                           ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out           │
             │        sec/op        │   sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │           bench.out           │
                 │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                 │ build/main/bench.out │            bench.out             │
                 │         B/op         │     B/op       vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                 │ build/main/bench.out │           bench.out            │
                 │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │           bench.out           │
                        │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                        │ build/main/bench.out │           bench.out            │
                        │         B/op         │    B/op      vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                        │ build/main/bench.out │           bench.out            │
                        │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  580.2n         609.8n        +5.09%
¹ need >= 6 samples for confidence interval at level 0.95

               │ build/main/bench.out │               bench.out               │
               │         B/op         │     B/op       vs base                │
geomean                             ³                  +0.43%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

               │ build/main/bench.out │              bench.out              │
               │      allocs/op       │  allocs/op   vs base                │
geomean                             ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                            │ build/main/bench.out │             bench.out              │
                                            │        sec/op        │    sec/op     vs base              │
ReadEvents/nop_codec_big_tx/399_events-12             476.6µ ± ∞ ¹   448.1µ ± ∞ ¹  -5.97% (p=0.016 n=5)
IsTraceSampled/sampled-12                             68.51n ± ∞ ¹   69.90n ± ∞ ¹  +2.03% (p=0.032 n=5)
IsTraceSampled/unknown-12                             374.1n ± ∞ ¹   386.1n ± ∞ ¹  +3.21% (p=0.008 n=5)
geomean                                               29.02µ         29.28µ        +0.92%
¹ need >= 6 samples for confidence interval at level 0.95

                                            │ build/main/bench.out │               bench.out                │
                                            │         B/op         │      B/op       vs base                │
ReadEvents/nop_codec_big_tx/1000_events-12           2.079Mi ± ∞ ¹    2.090Mi ± ∞ ¹  +0.51% (p=0.032 n=5)
geomean                                              31.31Ki          31.50Ki        +0.61%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                            │ build/main/bench.out │              bench.out               │
                                            │      allocs/op       │  allocs/op    vs base                │
geomean                                                144.7          144.7        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

simitt

Generally looks good, left some comment related to updating the cache. Please also update the description as the fallback logic is now removed.

We should also add metrics for better insights into the sourcemap handling: at least track the number of requests to ES for sourcemaps, track sucess of metadata requests.

internal/sourcemap/elasticsearch.go

simitt · 2022-12-20T10:19:21Z

internal/sourcemap/metadata.go

+	if len(s.set) == 0 {
+		// If cache is empty and we're trying to fetch a sourcemap
+		// make sure the initial cache population has ended.
+		s.once.Do(func() {


What happens if ES is not yet reachable at this point but later?

Looking back at this, I don't think it's a good idea, this will basically block until the cache is populated since we cannot retrieve sourcemaps until then with the current implementation.

I'm not sure what's the best way to proceed here. Returning nothing might be misleading. Should we bypass the cache and send the request to ES if the cache is not populated yet ?

This would be if the metadata cache is not ready. If a sourcemap is part of the metadata cache but the body is not yet cached, it will be fetched from ES on request.
So let's take a look at the metadata cache. What are the scenarios for when the cache might not be ready?

ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.

Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.

ES is not reachable, due to a restart, misconfiguration, etc.: in this case, trying to fetch directly from ES would not succeed either.

I don't think the metadata cache should be concerned If ES is not ready/reachable/available. A failure in that scenario would be "working as intended" imo because the metadata cache is proxying what it got from the ES fetcher (error, timeout, etc.).

Fetching metadata from ES is ongoing. The fetching and cache update for metadata can be expected to be fairly quick, even when many sourcemaps are stored. Bypassing the metadata cache would work here, but seems like a hack. I'd rather either block the request processing (and collect logs & metrics so we get an understanding how long this blocking period is) or treat it as no sourcemap is available (and again, collect logs & metrics for the scenario). The third option would be to block the APM Server listeners until everything is readily set up, but this would definitely lead to apm event loss, which is definitely worse.

I was thinking of "if init is not completed, forward the request to the backend fetcher". In theory the first request for each sourcemap would be forwarded to ES anyway so this wouldn't create additional work unless we get a request for a missing sourcemap on ES during init. The reasoning behind this was that when the metadata cache is not initialized:

if the sourcemap exists on ES: the metadata fetcher will send the request to the es fetcher anyway. No additional time/work/blocking: this would be the same as the metadata cache receiving a request for the first time.

if the sourcemap does not exist on ES: the metadata will send the request to the ES fetcher, creating additional work. The additional processing time is offset by the fact we would have had to wait for the init phase to finish if we had blocked so my assumption was that there wasn't gonna be a time difference between the two in the end.

WDYT ?

So this is really used for triggering the first fetch, in case it hasn't yet been initiated?

The logic has been reworked and now we forward the request to ES if the metadata cache is not ready.

This does not create additional costs since the response is cached in the lru cache.

internal/sourcemap/metadata.go

simitt · 2023-02-07T09:51:00Z

IMO the blocking behaviour is fine since it's bounded to a ping + search request, which is roughly the same as the old ES implementation. @simitt do you still have any concerns about that?

The existing behavior pre 8.7 was already blocking RUM ingest requests until sourcemaps were fetched, so while I don't think it's ideal, I have no concerns on moving forward with blocking.

simitt

Changes look good now!

The sourcemap_fetcher_tests only test the happy path, some checks for error cases would also be good, but test coverage looks good enough to be merged now.

Thanks for all the effort that went into this!

internal/beater/config/config_test.go

simitt · 2023-02-07T12:39:17Z

testing/docker/elasticsearch/roles.yml

@@ -3,6 +3,8 @@ apm_server:
  indices:
    - names: ['apm-*', 'traces-apm*', 'logs-apm*', 'metrics-apm*']
      privileges: ['write','create_index','manage','manage_ilm']
+    - names: ['.apm-source-map']


make the tests use admin credentials for sourcemap only?

I wouldn't suggest this, with admin credentials it is easy to miss breaking changes

IMO it's fine to define the privilege here, in alignment with privilege definition for the apm datastreams.

internal/sourcemap/metadata_fetcher.go

axw · 2023-02-22T06:41:00Z

First test with Fleet-managed APM Server is successful:

Created an Elastic Cloud deployment
Send a RUM error, confirmed that its stack trace was not source mapped (naturally, there are no source maps uploaded yet)
Uploaded a source map
Deleted existing error documents, send another RUM error; confirmed that it was source mapped

I will run some more tests running without Fleet.

axw · 2023-02-23T02:47:42Z

I tried running Elastic Agent locally in Docker, communicating with Elasticsearch in an Elastic Cloud us-west2 deployment. Errors were indexed with:

error.exception.stacktrace.sourcemap.error: failed to ping es cluster: fetcher unavailable: context deadline exceeded

Looking at the logs, this is coming from the metadata fetcher, which has a 1 second ping timeout.

Opened #10338

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 2, 2022

lint: fix linter issues

3d61fc1

kruskall commented Dec 2, 2022

View reviewed changes

internal/sourcemap/elasticsearch.go Outdated Show resolved Hide resolved

internal/sourcemap/elasticsearch.go Outdated Show resolved Hide resolved

internal/beater/beater.go Outdated Show resolved Hide resolved

carsonip reviewed Dec 2, 2022

View reviewed changes

internal/sourcemap/metadata.go Outdated Show resolved Hide resolved

kruskall added 9 commits December 4, 2022 14:01

lint: fix staticcheck issue

d18afa0

test: udpdate sourcemap test to use the kibana sourcemap API

27883df

the new kibana PR is upload sourcemaps to ES and migrating old one to it. We can restore the sourcemap test to use the standard method instead of upload the sourcemap manually to ES.

feat: implement cache invalidation in sourcemap cache

4d8c011

The sourcemap cache is now refreshed whenever the content hashcode changes or a sourcemap is removed from ES. Update the sourcemap cache to use a LRU cache.

feat: remove fleet and kibana sourcemap fetchers

f04473e

refactor: code cleanup

7a64c06

fix: do not try to decode the body if the sourcemap was not found

c96587d

test: update tests for sourcemap changes

855cb02

lint: revert update task changes

cd01a10

kruskall added 4 commits December 19, 2022 06:41

Merge remote-tracking branch 'upstream/main' into feat/es-sourcemap

9890493

build: remove unused imports

9518384

Merge remote-tracking branch 'upstream/main' into feat/es-sourcemap

761802b

lint: fix linter issues

ac529cd

feat: add support for relative path

a9957ea

simitt reviewed Dec 20, 2022

View reviewed changes

kruskall added 5 commits December 23, 2022 00:29

Merge branch 'main' into feat/es-sourcemap

260d3d5

feat: add support for injected api key

995e254

lint: fix formatting issue

cf86bb3

Merge branch 'main' into feat/es-sourcemap

813cdfb

refactor: use meaningful variable names

6f0fb18

kruskall added 3 commits February 7, 2023 03:18

docs: remove stale comment

c24db8d

fix: close ping request response body

08ae602

Merge branch 'main' into feat/es-sourcemap

5b0dfb6

kruskall added 4 commits February 7, 2023 12:23

fix: do not send empty invalidations

2e814f4

test: add metadata fetcher tests

53c7d97

test: add sourcemap fetcher tests

8028c46

Merge branch 'main' into feat/es-sourcemap

8c61b76

kruskall requested a review from simitt February 7, 2023 14:35

simitt approved these changes Feb 7, 2023

View reviewed changes

kruskall added 2 commits February 7, 2023 16:13

docs: remove todo

4de127a

feat: remove cache expiration option

7b75446

kruskall requested a review from simitt February 7, 2023 15:18

kruskall added 2 commits February 7, 2023 17:08

Merge branch 'main' into feat/es-sourcemap

a182dd2

lint: fix linter issues

1d74a28

kruskall enabled auto-merge (squash) February 7, 2023 16:09

kruskall merged commit 4f59fa4 into elastic:main Feb 7, 2023

kruskall deleted the feat/es-sourcemap branch February 7, 2023 17:13

axw mentioned this pull request Feb 8, 2023

Update changelog for 8.7 release #10186

Merged

axw added test-plan v8.7.0 labels Feb 20, 2023

axw self-assigned this Feb 22, 2023

axw added the test-plan-regression label Feb 23, 2023

axw mentioned this pull request Feb 23, 2023

8.7 Manual Test Plan #10200

Closed

11 tasks

axw mentioned this pull request Mar 10, 2023

fix: remove hardcoded timeout and don't block when fetching sourcemaps #10367

Merged

3 tasks

marclop added test-plan-ok and removed test-plan-regression labels Mar 21, 2023

carsonip mentioned this pull request Sep 27, 2023

Outdated docs on privileges to use RUM sourcemapping #11722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fetch sourcemaps from Elasticsearch #9722

feat: fetch sourcemaps from Elasticsearch #9722

kruskall commented Dec 2, 2022

mergify bot commented Dec 2, 2022

apmmachine commented Dec 2, 2022 •

edited

Loading

Build stats

kruskall commented Dec 13, 2022 •

edited

Loading

mergify bot commented Dec 13, 2022

apmmachine commented Dec 19, 2022 •

edited

Loading

simitt left a comment

simitt Dec 20, 2022

kruskall Dec 29, 2022

simitt Dec 29, 2022 •

edited by kruskall

Loading

kruskall Dec 29, 2022 •

edited

Loading

simitt Jan 17, 2023

kruskall Jan 23, 2023

simitt commented Feb 7, 2023

simitt left a comment

simitt Feb 7, 2023

axw commented Feb 22, 2023

axw commented Feb 23, 2023

feat: fetch sourcemaps from Elasticsearch #9722

feat: fetch sourcemaps from Elasticsearch #9722

Conversation

kruskall commented Dec 2, 2022

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Dec 2, 2022

apmmachine commented Dec 2, 2022 • edited Loading

💚 Build Succeeded

Build stats

🤖 GitHub comments

kruskall commented Dec 13, 2022 • edited Loading

mergify bot commented Dec 13, 2022

apmmachine commented Dec 19, 2022 • edited Loading

📚 Go benchmark report

simitt left a comment

Choose a reason for hiding this comment

simitt Dec 20, 2022

Choose a reason for hiding this comment

kruskall Dec 29, 2022

Choose a reason for hiding this comment

simitt Dec 29, 2022 • edited by kruskall Loading

Choose a reason for hiding this comment

kruskall Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

simitt Jan 17, 2023

Choose a reason for hiding this comment

kruskall Jan 23, 2023

Choose a reason for hiding this comment

simitt commented Feb 7, 2023

simitt left a comment

Choose a reason for hiding this comment

simitt Feb 7, 2023

Choose a reason for hiding this comment

axw commented Feb 22, 2023

axw commented Feb 23, 2023

apmmachine commented Dec 2, 2022 •

edited

Loading

kruskall commented Dec 13, 2022 •

edited

Loading

apmmachine commented Dec 19, 2022 •

edited

Loading

simitt Dec 29, 2022 •

edited by kruskall

Loading

kruskall Dec 29, 2022 •

edited

Loading