Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional tests: tests fail due to issues with aggregation data streams #15531

Open
endorama opened this issue Feb 3, 2025 · 0 comments
Open
Assignees
Labels

Comments

@endorama
Copy link
Member

endorama commented Feb 3, 2025

There are 2 issues:

  1. 4 ds found instead of 8
  2. 1 index found instead of 2

They seem related as both are related to aggregations data streams. But they do not appear in sequence, they appear independently on different CI runs.

4 ds found instead of 8

CI failure log example

This is a log from a CI run failure.

=== RUN TestUpgrade_8_15_4_to_8_16_0

8_15_test.go:44: creating deployment with terraform

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/b8996afb-7656-4977-847f-a5ab2a686198/terraform version -json

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/b8996afb-7656-4977-847f-a5ab2a686198/terraform init -no-color -input=false -backend=true -get=true -upgrade=true

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/b8996afb-7656-4977-847f-a5ab2a686198/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true -var ec_target=pro -var ec_region=eu-west-1 -var stack_version=8.15.4 -var name=TestUpgrade_8_15_4_to_8_16_0

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/b8996afb-7656-4977-847f-a5ab2a686198/terraform output -no-color -json

8_15_test.go:52: time elapsed: 4m26.976542345s

8_15_test.go:72: created deployment [https://69dbf192bfcc4e00aa37870cc673ab33.eu-west-1.aws.found.io:443](https://69dbf192bfcc4e00aa37870cc673ab33.eu-west-1.aws.found.io:443)

8_15_test.go:80: creating APM API key

logger.go:146: 2025-02-03T03:09:05.386Z INFO ingest data

logger.go:146: 2025-02-03T03:09:25.648Z INFO restarting integrations server to flush apm server data

8_15_test.go:91: time elapsed: 6m1.714300587s

8_15_test.go:97: check data streams

8_15_test.go:101:

Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:82

/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:101

Error: "[{0xc00946904b <nil> 1 false 0xc0093bfaa0 [{0xc0093bf910 .ds-logs-apm.error-default-2025.02.03-000001 Ge0DxljQRVSvXpJFN4D1pw Data stream lifecycle 0xc009468dab}] 0xc000484920 map[description:[34 73 110 100 101 120 32 116 101 109 112 108 97 116 101 32 102 111 114 32 108 111 103 115 45 97 112 109 46 101 114 114 111 114 45 42 34] managed:[116 114 117 101]] logs-apm.error-default Data stream lifecycle false 0xc00946907b false yellow 0xc00946902b logs-apm.error@template {@timestamp}} {0xc0094694bb <nil> 1 false 0xc0093bfe90 [{0xc0093bfd00 .ds-metrics-apm.app.opbeans_python-default-2025.02.03-000001 Jt9NP83GRMiiv46LF9NA1g Data stream lifecycle 0xc00946921b}] 0xc000484940 map[description:[34 73 110 100 101 120 32 116 101 109 112 108 97 116 101 32 102 111 114 32 109 101 116 114 105 99 115 45 97 112 109 46 97 112 112 46 42 45 42 34] managed:[116 114 117 101]] metrics-apm.app.opbeans_python-default Data stream lifecycle false 0xc0094694eb false yellow 0xc00946949b metrics-apm.app@template {@timestamp}} {0xc00946992b <nil> 1 false 0xc009470300 [{0xc009470170 .ds-metrics-apm.internal-default-2025.02.03-000001 qhWQNXqDSIGrXS9wpKm_iQ Data stream lifecycle 0xc00946968b}] 0xc000484960 map[description:[34 73 110 100 101 120 32 116 101 109 112 108 97 116 101 32 102 111 114 32 109 101 116 114 105 99 115 45 97 112 109 46 105 110 116 101 114 110 97 108 45 42 34] managed:[116 114 117 101]] metrics-apm.internal-default Data stream lifecycle false 0xc00946995b false yellow 0xc00946990b metrics-apm.internal@template {@timestamp}} {0xc009469d9b <nil> 1 false 0xc0094706f0 [{0xc009470560 .ds-traces-apm-default-2025.02.03-000001 Zji6sIweT-OJlFzul-cALw Data stream lifecycle 0xc009469afb}] 0xc000484980 map[description:[34 73 110 100 101 120 32 116 101 109 112 108 97 116 101 32 102 111 114 32 116 114 97 99 101 115 45 97 112 109 45 42 34] managed:[116 114 117 101]] traces-apm-default Data stream lifecycle false 0xc009469dcb false yellow 0xc009469d7b traces-apm@template {@timestamp}}]" should have 8 item(s), but has 4

Test: TestUpgrade_8_15_4_to_8_16_0

Messages: number of APM datastream differs from expectations

The 4 found are:

  • traces-apm-default
  • metrics-apm.internal-default
  • metrics-apm.app.opbeans_python-default
  • logs-apm.error-default

The aggregations data streams are missing. In my tests in all cases opening Kibana and looking at the indices (either UI or running an API call in Dev Tools) would reveal all of them as expected.
So this looks a race condition between when we fining ingesting and when we query ES for data stream information.

Note

To solve this issue we introduced a restart for APM Server. On shutdown APM Server should flush all in flight data, including aggregations (confirmed by Marc).
This restart (which takes at least ~1 minute) should take care of flushing data and allow Elasticsearh enough time to write the aggregation data.

1 index found instead of 2

Example CI failure log

=== RUN TestUpgrade_8_15_4_to_8_16_0

8_15_test.go:44: creating deployment with terraform

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform version -json

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform init -no-color -input=false -backend=true -get=true -upgrade=true

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true -var ec_target=qa -var ec_region=aws-eu-west-1 -var stack_version=8.15.4 -var name=TestUpgrade_8_15_4_to_8_16_0

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform output -no-color -json

8_15_test.go:52: time elapsed: 3m3.518383674s

8_15_test.go:72: created deployment [https://23055f16416845878d501df3082a4267.eu-west-1.aws.qa.cld.elstc.co:443](https://23055f16416845878d501df3082a4267.eu-west-1.aws.qa.cld.elstc.co:443)

8_15_test.go:80: creating APM API key

logger.go:146: 2025-02-03T03:07:37.883Z INFO ingest data

logger.go:146: 2025-02-03T03:08:18.623Z INFO restarting integrations server to flush apm server data

8_15_test.go:91: time elapsed: 4m58.467183249s

8_15_test.go:97: check data streams

8_15_test.go:108: time elapsed: 5m1.246496003s

8_15_test.go:110: upgrade to 8.16.0

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true -var ec_target=qa -var ec_region=aws-eu-west-1 -var name=TestUpgrade_8_15_4_to_8_16_0 -var stack_version=8.16.0

logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform output -no-color -json

8_15_test.go:112: time elapsed: 11m36.057462133s

8_15_test.go:114: check number of documents after upgrade

8_15_test.go:123: check data streams after upgrade, no rollover expected

logger.go:146: 2025-02-03T03:16:12.726Z INFO ingest data

logger.go:146: 2025-02-03T03:16:45.083Z INFO restarting integrations server to flush apm server data

8_15_test.go:135: time elapsed: 13m24.446239277s

8_15_test.go:137: check number of documents

8_15_test.go:145: check data streams and verify lazy rollover happened

8_15_test.go:148:

Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93

/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148

Error: "[{0xc000403f60 .ds-metrics-apm.service_destination.1m-default-2025.02.03-000001 4BAl8fuzSFqrELTSUNd_Fg Data stream lifecycle 0xc00011478b}]" should have 2 item(s), but has 1

Test: TestUpgrade_8_15_4_to_8_16_0

Messages: datastream metrics-apm.service_destination.1m-default should have 2 indices

8_15_test.go:148:

Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93

/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148

Error: "[{0xc000045ec0 .ds-metrics-apm.service_summary.1m-default-2025.02.03-000001 IgXXxQRaSdqb_uJsObYkEw Data stream lifecycle 0xc000114f5b}]" should have 2 item(s), but has 1

Test: TestUpgrade_8_15_4_to_8_16_0

Messages: datastream metrics-apm.service_summary.1m-default should have 2 indices

8_15_test.go:148:

Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93

/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148

Error: "[{0xc000236940 .ds-metrics-apm.service_transaction.1m-default-2025.02.03-000001 XJmNGBDVSW20TInnL11jaQ Data stream lifecycle 0xc00028e54b}]" should have 2 item(s), but has 1

Test: TestUpgrade_8_15_4_to_8_16_0

Messages: datastream metrics-apm.service_transaction.1m-default should have 2 indices

8_15_test.go:148:

Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93

/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148

Error: "[{0xc000236fb0 .ds-metrics-apm.transaction.1m-default-2025.02.03-000001 DR3GAdHjQx-uV6ppCxpLeA Data stream lifecycle 0xc00028e9fb}]" should have 2 item(s), but has 1

Test: TestUpgrade_8_15_4_to_8_16_0

Messages: datastream metrics-apm.transaction.1m-default should have 2 indices

8_15_test.go:155: time elapsed: 13m27.788784241s

8_15_test.go:56: cleanup terraform resources

This issue affects aggregation data streams:

  • metrics-apm.transaction.1m-default
  • metrics-apm.service_transaction.1m-default
  • metrics-apm.service_summary.1m-default
  • metrics-apm.service_destination.1m-default

My hypothesis is that this issue is caused by a lack of lazy rollover on those data streams. The fact these are the aggregation data streams makes this very similar to the previous describe error.

Lazy rollover should happens on write. If write does not happen, as the previous error seems to suggest, it's correct to expect the lazy rollover to not happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant