Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9.0 test plan #15569

Open
7 tasks done
inge4pres opened this issue Feb 5, 2025 · 6 comments
Open
7 tasks done

9.0 test plan #15569

inge4pres opened this issue Feb 5, 2025 · 6 comments
Assignees

Comments

@inge4pres
Copy link
Contributor

inge4pres commented Feb 5, 2025

Manual Test Plan

List of changes: v8.18.0...v9.0.0 (both tags are not there yet)

Smoke Testing ESS setup

Thanks to #8303 further smoke tests are run automatically on ESS now.
Consider extending the smoke tests to include more test cases which we'd like to cover

go-docappender library

No changes, same dependency version used

apm-data library

No changes, same dependency version used

Test cases from the GitHub board

Add yourself as assignee on the PR before you start testing.

apm-server 9.0.0 test-plan

Add yourself as assignee on the PR before you start testing.

Tasks

Regressions

@endorama
Copy link
Member

endorama commented Feb 11, 2025

I tested #15094 (and related #15360).

Test scenario:

  • created a new deployment with 9.0.0-beta1 in Cloud
  • enabled Logs and Metrics collection to the same deployment
  • run apmsoak with apm-server scenario
  • checked Stack Monitoring UI
  • checked .monitoring-beats-8-mb data stream

Stack monitoring was displaying APM metrics and beats_stats.metrics.apm-server.* fields where visible in the monitoring data stream

Pictures

Image

Image

@inge4pres
Copy link
Contributor Author

tested #15211 with otelgen, adding a RecordError() call to all spans

Image

@raultorrecilla
Copy link

#14921 tested in #14921 (comment)

@simitt
Copy link
Contributor

simitt commented Feb 14, 2025

Tested the upgrade scenarios, details in upgrade scenario testing

One blocker issue was found related to the Cloud UI (and potentially API), which is tracked in https://elasticco.atlassian.net/browse/CP-10318.

@1pkg
Copy link
Member

1pkg commented Feb 14, 2025

I validated the changes #15524 with a standalone 9.0 APM Server build.

I ran 3 scenarios, with the following TBS settings:

  sampling.tail:
    enabled: true
    interval: 1m
    policies:
      - sample_rate: .5
    discard_on_write_failure: true

1. Disk capacity at 75% fresh APM Server deployment, the load is generated with continuous apmbench.

As expected before the disk threshold 80% was reached the TBS was working in the normal mode as expected.
When the threshold was reached the following warning logs have appeared and incoming excessive traces were discarded.

{"log.level":"warn","@timestamp":"2025-02-14T14:52:47.655-0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).Run.func7","file.name":"sampling/processor.go","file.line":450},"message":"received error writing sampled trace: disk usage threshold 0.80: configured limit reached (current: 2438291456, limit: 2436697292)","service.name":"apm-server","ecs.version":"1.6.0"}

and

{"log.level":"warn","@timestamp":"2025-02-14T14:55:48.489-0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).ProcessBatch","file.name":"sampling/processor.go","file.line":124},"message":"processing trace failed, discarding by default","service.name":"apm-server","error":{"message":"disk usage threshold 0.80: configured limit reached (current: 2443640832, limit: 2436697292)"},"ecs.version":"1.6.0"}

TBS continued to make an incremental progress according to expiring TTL records in the DB.
The disk threshold was respected and never picked above configured 80% there after.

df .
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1        2974484 2381332    576768  81% /
du -sh  data/tail_sampling/
245M    data/tail_sampling/
Image

2. Disk capacity at 80% restart APM Server deployment with existing DB, the load is generated with continuous apmbench.

As with the scenario #1 the same behavior continued as we would expect it.

3. Disk capacity at 85% fresh APM Server deployment, the load is generated with continuous apmbench.

In this scenario we start with disk utilization above the configured threshold right at the beginning. The result is the same warning logs coming from the APM Server as in scenario #1. While all traces are getting discarded, since no space to store any sampled traces this time.

du -sh  data/tail_sampling/
32K     data/tail_sampling/

One surprising observation that sampling decision were still getting posted to the ElasticSearch despite the disk threshold. This doesn't really cause any direct problem but should be separately investigated later.

Image

@1pkg
Copy link
Member

1pkg commented Feb 15, 2025

When validating #15235 I used the checklist provided by @carsonip:

  • All of the below should be tested on prem and on ECH
  • run TBS with e.g. 2 policies, and ensure that policies are respected.
  • upgrade from an existing TBS setup, old badger files should be ignored and apm-server should start up properly and new TBS implementation should work normally
  • run TBS over 2 * TTL, ensure disk usage is bounded by checking TBS monitoring metrics (and actual local disk usage if running locally), and does not OOM.

1. The upgrade from a previous TBS setup with Badger to a new TBS setup with Pebble.

Validated with a local version of APM Server - everything worked as expected, no error have been observed. To validate the new behavior I used scaled up version of sendotlp cli which dispenses 100K testing traces. With the following TBS policies, both setups worked identically respecting the defined sample_rate. The new setup correctly ignored the old DB left by the old setup.

    policies:
      - sample_rate: 0.1
        trace.name: "foo"
      - sample_rate: 0.25
        trace.name: "bar"
      - sample_rate: 0.05

2. Run TBS over 2 * TTL, ensure disk usage is bounded by checking TBS monitoring metrics (and actual local disk usage if running locally), and does not OOM.

Validated for both on prem and on ECH with scaled up apmbench -count=100 cli. The disk usage was growing steadily until the TTL swapping logic kicked in, fluctuating around 3GB for on prem and 1GB for ECH. No errors, excessive memory pressure or OOM were observed.

Image Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants