9.0 test plan #15569

inge4pres · 2025-02-05T16:30:18Z

Manual Test Plan

List of changes: v8.18.0...v9.0.0 (both tags are not there yet)

Smoke Testing ESS setup

Thanks to #8303 further smoke tests are run automatically on ESS now.
Consider extending the smoke tests to include more test cases which we'd like to cover

go-docappender library

No changes, same dependency version used

apm-data library

No changes, same dependency version used

Test cases from the GitHub board

Add yourself as assignee on the PR before you start testing.

apm-server 9.0.0 test-plan

Add yourself as assignee on the PR before you start testing.

Tasks

feat: drop fleet agentcfg and direct fetcher #14921
feat: use xxhash for error grouping key #15211
TBS: change default disk usage threshold to 0.8 #15524
TBS: Replace badger with pebble #15235
upgrade testing from 8.18 to 9.0 in various settings (with TBS enabled, apm-server standalone vs. fleet managed, etc)
feat: start hiding beats monitoring behind otel abstraction #15360
Translate otel metrics to libbeat monitoring #15094

Regressions

The text was updated successfully, but these errors were encountered:

endorama · 2025-02-11T18:21:08Z

I tested #15094 (and related #15360).

Test scenario:

created a new deployment with 9.0.0-beta1 in Cloud
enabled Logs and Metrics collection to the same deployment
run apmsoak with apm-server scenario
checked Stack Monitoring UI
checked .monitoring-beats-8-mb data stream

Stack monitoring was displaying APM metrics and beats_stats.metrics.apm-server.* fields where visible in the monitoring data stream

Pictures

inge4pres · 2025-02-11T18:30:05Z

tested #15211 with otelgen, adding a RecordError() call to all spans

raultorrecilla · 2025-02-14T08:56:20Z

#14921 tested in #14921 (comment)

simitt · 2025-02-14T14:29:45Z

Tested the upgrade scenarios, details in upgrade scenario testing

One blocker issue was found related to the Cloud UI (and potentially API), which is tracked in https://elasticco.atlassian.net/browse/CP-10318.

1pkg · 2025-02-14T23:14:24Z

I validated the changes #15524 with a standalone 9.0 APM Server build.

I ran 3 scenarios, with the following TBS settings:

  sampling.tail:
    enabled: true
    interval: 1m
    policies:
      - sample_rate: .5
    discard_on_write_failure: true

1. Disk capacity at 75% fresh APM Server deployment, the load is generated with continuous apmbench.

As expected before the disk threshold 80% was reached the TBS was working in the normal mode as expected.
When the threshold was reached the following warning logs have appeared and incoming excessive traces were discarded.

{"log.level":"warn","@timestamp":"2025-02-14T14:52:47.655-0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).Run.func7","file.name":"sampling/processor.go","file.line":450},"message":"received error writing sampled trace: disk usage threshold 0.80: configured limit reached (current: 2438291456, limit: 2436697292)","service.name":"apm-server","ecs.version":"1.6.0"}

and

{"log.level":"warn","@timestamp":"2025-02-14T14:55:48.489-0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).ProcessBatch","file.name":"sampling/processor.go","file.line":124},"message":"processing trace failed, discarding by default","service.name":"apm-server","error":{"message":"disk usage threshold 0.80: configured limit reached (current: 2443640832, limit: 2436697292)"},"ecs.version":"1.6.0"}

TBS continued to make an incremental progress according to expiring TTL records in the DB.
The disk threshold was respected and never picked above configured 80% there after.

df .
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1        2974484 2381332    576768  81% /

du -sh  data/tail_sampling/
245M    data/tail_sampling/

2. Disk capacity at 80% restart APM Server deployment with existing DB, the load is generated with continuous apmbench.

As with the scenario #1 the same behavior continued as we would expect it.

3. Disk capacity at 85% fresh APM Server deployment, the load is generated with continuous apmbench.

In this scenario we start with disk utilization above the configured threshold right at the beginning. The result is the same warning logs coming from the APM Server as in scenario #1. While all traces are getting discarded, since no space to store any sampled traces this time.

du -sh  data/tail_sampling/
32K     data/tail_sampling/

One surprising observation that sampling decision were still getting posted to the ElasticSearch despite the disk threshold. This doesn't really cause any direct problem but should be separately investigated later.

1pkg · 2025-02-15T01:26:58Z

When validating #15235 I used the checklist provided by @carsonip:

All of the below should be tested on prem and on ECH
run TBS with e.g. 2 policies, and ensure that policies are respected.
upgrade from an existing TBS setup, old badger files should be ignored and apm-server should start up properly and new TBS implementation should work normally
run TBS over 2 * TTL, ensure disk usage is bounded by checking TBS monitoring metrics (and actual local disk usage if running locally), and does not OOM.

1. The upgrade from a previous TBS setup with Badger to a new TBS setup with Pebble.

Validated with a local version of APM Server - everything worked as expected, no error have been observed. To validate the new behavior I used scaled up version of sendotlp cli which dispenses 100K testing traces. With the following TBS policies, both setups worked identically respecting the defined sample_rate. The new setup correctly ignored the old DB left by the old setup.

    policies:
      - sample_rate: 0.1
        trace.name: "foo"
      - sample_rate: 0.25
        trace.name: "bar"
      - sample_rate: 0.05

2. Run TBS over 2 * TTL, ensure disk usage is bounded by checking TBS monitoring metrics (and actual local disk usage if running locally), and does not OOM.

Validated for both on prem and on ECH with scaled up apmbench -count=100 cli. The disk usage was growing steadily until the TTL swapping logic kicked in, fluctuating around 3GB for on prem and 1GB for ECH. No errors, excessive memory pressure or OOM were observed.

inge4pres added the test-plan label Feb 5, 2025

raultorrecilla assigned inge4pres Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9.0 test plan #15569

9.0 test plan #15569

inge4pres commented Feb 5, 2025 •

edited by 1pkg

Loading

endorama commented Feb 11, 2025 •

edited

Loading

inge4pres commented Feb 11, 2025

raultorrecilla commented Feb 14, 2025

simitt commented Feb 14, 2025 •

edited

Loading

1pkg commented Feb 14, 2025 •

edited

Loading

1pkg commented Feb 15, 2025 •

edited

Loading

9.0 test plan #15569

9.0 test plan #15569

Comments

inge4pres commented Feb 5, 2025 • edited by 1pkg Loading

Manual Test Plan

Smoke Testing ESS setup

go-docappender library

apm-data library

Test cases from the GitHub board

Tasks

Regressions

endorama commented Feb 11, 2025 • edited Loading

inge4pres commented Feb 11, 2025

raultorrecilla commented Feb 14, 2025

simitt commented Feb 14, 2025 • edited Loading

1pkg commented Feb 14, 2025 • edited Loading

1. Disk capacity at 75% fresh APM Server deployment, the load is generated with continuous apmbench.

2. Disk capacity at 80% restart APM Server deployment with existing DB, the load is generated with continuous apmbench.

3. Disk capacity at 85% fresh APM Server deployment, the load is generated with continuous apmbench.

1pkg commented Feb 15, 2025 • edited Loading

1. The upgrade from a previous TBS setup with Badger to a new TBS setup with Pebble.

2. Run TBS over 2 * TTL, ensure disk usage is bounded by checking TBS monitoring metrics (and actual local disk usage if running locally), and does not OOM.

inge4pres commented Feb 5, 2025 •

edited by 1pkg

Loading

endorama commented Feb 11, 2025 •

edited

Loading

simitt commented Feb 14, 2025 •

edited

Loading

1pkg commented Feb 14, 2025 •

edited

Loading

1pkg commented Feb 15, 2025 •

edited

Loading