You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
They seem related as both are related to aggregations data streams. But they do not appear in sequence, they appear independently on different CI runs.
The aggregations data streams are missing. In my tests in all cases opening Kibana and looking at the indices (either UI or running an API call in Dev Tools) would reveal all of them as expected.
So this looks a race condition between when we fining ingesting and when we query ES for data stream information.
Note
To solve this issue we introduced a restart for APM Server. On shutdown APM Server should flush all in flight data, including aggregations (confirmed by Marc).
This restart (which takes at least ~1 minute) should take care of flushing data and allow Elasticsearh enough time to write the aggregation data.
1 index found instead of 2
Example CI failure log
=== RUN TestUpgrade_8_15_4_to_8_16_0
8_15_test.go:44: creating deployment with terraform
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform version -json
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform init -no-color -input=false -backend=true -get=true -upgrade=true
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true -var ec_target=qa -var ec_region=aws-eu-west-1 -var stack_version=8.15.4 -var name=TestUpgrade_8_15_4_to_8_16_0
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform output -no-color -json
8_15_test.go:52: time elapsed: 3m3.518383674s
8_15_test.go:72: created deployment [https://23055f16416845878d501df3082a4267.eu-west-1.aws.qa.cld.elstc.co:443](https://23055f16416845878d501df3082a4267.eu-west-1.aws.qa.cld.elstc.co:443)
8_15_test.go:80: creating APM API key
logger.go:146: 2025-02-03T03:07:37.883Z INFO ingest data
logger.go:146: 2025-02-03T03:08:18.623Z INFO restarting integrations server to flush apm server data
8_15_test.go:91: time elapsed: 4m58.467183249s
8_15_test.go:97: check data streams
8_15_test.go:108: time elapsed: 5m1.246496003s
8_15_test.go:110: upgrade to 8.16.0
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true -var ec_target=qa -var ec_region=aws-eu-west-1 -var name=TestUpgrade_8_15_4_to_8_16_0 -var stack_version=8.16.0
logger.go:32: [INFO] running Terraform command: /home/runner/work/_temp/a90136c1-43e7-4509-95af-6b51695008f4/terraform output -no-color -json
8_15_test.go:112: time elapsed: 11m36.057462133s
8_15_test.go:114: check number of documents after upgrade
8_15_test.go:123: check data streams after upgrade, no rollover expected
logger.go:146: 2025-02-03T03:16:12.726Z INFO ingest data
logger.go:146: 2025-02-03T03:16:45.083Z INFO restarting integrations server to flush apm server data
8_15_test.go:135: time elapsed: 13m24.446239277s
8_15_test.go:137: check number of documents
8_15_test.go:145: check data streams and verify lazy rollover happened
8_15_test.go:148:
Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93
/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148
Error: "[{0xc000403f60 .ds-metrics-apm.service_destination.1m-default-2025.02.03-000001 4BAl8fuzSFqrELTSUNd_Fg Data stream lifecycle 0xc00011478b}]" should have 2 item(s), but has 1
Test: TestUpgrade_8_15_4_to_8_16_0
Messages: datastream metrics-apm.service_destination.1m-default should have 2 indices
8_15_test.go:148:
Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93
/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148
Error: "[{0xc000045ec0 .ds-metrics-apm.service_summary.1m-default-2025.02.03-000001 IgXXxQRaSdqb_uJsObYkEw Data stream lifecycle 0xc000114f5b}]" should have 2 item(s), but has 1
Test: TestUpgrade_8_15_4_to_8_16_0
Messages: datastream metrics-apm.service_summary.1m-default should have 2 indices
8_15_test.go:148:
Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93
/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148
Error: "[{0xc000236940 .ds-metrics-apm.service_transaction.1m-default-2025.02.03-000001 XJmNGBDVSW20TInnL11jaQ Data stream lifecycle 0xc00028e54b}]" should have 2 item(s), but has 1
Test: TestUpgrade_8_15_4_to_8_16_0
Messages: datastream metrics-apm.service_transaction.1m-default should have 2 indices
8_15_test.go:148:
Error Trace: /home/runner/work/apm-server/apm-server/functionaltests/main_test.go:93
/home/runner/work/apm-server/apm-server/functionaltests/8_15_test.go:148
Error: "[{0xc000236fb0 .ds-metrics-apm.transaction.1m-default-2025.02.03-000001 DR3GAdHjQx-uV6ppCxpLeA Data stream lifecycle 0xc00028e9fb}]" should have 2 item(s), but has 1
Test: TestUpgrade_8_15_4_to_8_16_0
Messages: datastream metrics-apm.transaction.1m-default should have 2 indices
8_15_test.go:155: time elapsed: 13m27.788784241s
8_15_test.go:56: cleanup terraform resources
This issue affects aggregation data streams:
metrics-apm.transaction.1m-default
metrics-apm.service_transaction.1m-default
metrics-apm.service_summary.1m-default
metrics-apm.service_destination.1m-default
My hypothesis is that this issue is caused by a lack of lazy rollover on those data streams. The fact these are the aggregation data streams makes this very similar to the previous describe error.
Lazy rollover should happens on write. If write does not happen, as the previous error seems to suggest, it's correct to expect the lazy rollover to not happen.
The text was updated successfully, but these errors were encountered:
There are 2 issues:
They seem related as both are related to aggregations data streams. But they do not appear in sequence, they appear independently on different CI runs.
4 ds found instead of 8
CI failure log example
This is a log from a CI run failure.
The 4 found are:
The aggregations data streams are missing. In my tests in all cases opening Kibana and looking at the indices (either UI or running an API call in Dev Tools) would reveal all of them as expected.
So this looks a race condition between when we fining ingesting and when we query ES for data stream information.
Note
To solve this issue we introduced a restart for APM Server. On shutdown APM Server should flush all in flight data, including aggregations (confirmed by Marc).
This restart (which takes at least ~1 minute) should take care of flushing data and allow Elasticsearh enough time to write the aggregation data.
1 index found instead of 2
Example CI failure log
This issue affects aggregation data streams:
My hypothesis is that this issue is caused by a lack of lazy rollover on those data streams. The fact these are the aggregation data streams makes this very similar to the previous describe error.
Lazy rollover should happens on write. If write does not happen, as the previous error seems to suggest, it's correct to expect the lazy rollover to not happen.
The text was updated successfully, but these errors were encountered: