[Meta] Flaky integration tests #3357

AndersonQ · 2023-09-05T14:14:54Z

We're observing constant flakiness on some of the agent's integration tests:

Tests know to fail:

TestInstallWithoutBasePath
- Most likely the test setup needs /opt/Elastic/Agent to not exist, however instead the test ensuring it, the test fails if the folder is there. The folder might have been left there by another test that run before it. I thought I fixed it, I need to look for the PR/commit that should have fixed that.
8.10: TestStandaloneUpgradeToSpecificSnapshotBuild
- [8.10](backport #3293) skip downgrade test when only one snapshot version is available #3297
- When there is only one snapshot version for the current version the test fails as it relies on the fact there is at least 2 snapshots. It used to fail something like 1 is not greater than 1 (require.Greater(t, len(builds.Builds), 1))

Fixes and restructure of the test framework and tests

There are several fixes and changes to some tests and the test framework itself to make it easier to understand what is going on, to make it easier and more clear how to use the test framework and what to use and when. There all together on:

Refactor pkg/testing and small improvements to testing/integration #3378

Test failing due to known issue:

TestFleetManagedUpgrade any variation of /Upgrade_managed_agent_from_x.y.z_to_8.11.0-SNAPSHOT:
- UPDATE: actual bug: Uninstall does not stop a running watcher after upgrade #3371
- flaky upgrade
- issue collecting diagnostics: (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory)
- by the test and diagnostics error it seems the agent is removed before the test finishes.

Click to expand

=== RUN   TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT
    fetcher_artifact.go:176: Downloading artifact from https://staging.elastic.co/7.17.13-d6f555d2/downloads/beats/elastic-agent/elastic-agent-7.17.13-linux-x86_64.tar.gz
    fetcher_artifact.go:255: Downloading artifact progress 15.24%
    fetcher_artifact.go:255: Downloading artifact progress 28.27%
    fetcher_artifact.go:255: Downloading artifact progress 41.92%
    fetcher_artifact.go:255: Downloading artifact progress 58.76%
    fetcher_artifact.go:255: Downloading artifact progress 74.01%
    fetcher_artifact.go:255: Downloading artifact progress 91.52%
    fetcher_artifact.go:255: Downloading artifact progress 100.00%
    fetcher_artifact.go:222: Completed downloading artifact from https://staging.elastic.co/7.17.13-d6f555d2/downloads/beats/elastic-agent/elastic-agent-7.17.13-linux-x86_64.tar.gz
    fetcher_artifact.go:176: Downloading artifact from https://staging.elastic.co/7.17.13-d6f555d2/downloads/beats/elastic-agent/elastic-agent-7.17.13-linux-x86_64.tar.gz.sha512
    fetcher_artifact.go:222: Completed downloading artifact from https://staging.elastic.co/7.17.13-d6f555d2/downloads/beats/elastic-agent/elastic-agent-7.17.13-linux-x86_64.tar.gz.sha512
    fixture.go:221: Extracting artifact elastic-agent-7.17.13-linux-x86_64.tar.gz to /tmp/TestFleetManagedUpgradeUpgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT2893997491/001
    fixture.go:234: Completed extraction of artifact elastic-agent-7.17.13-linux-x86_64.tar.gz to /tmp/TestFleetManagedUpgradeUpgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT2893997491/001
    fixture.go:540: Components were not modified from the fetched artifact
    upgrade_test.go:108: Creating Agent policy...
    upgrade_test.go:121: Creating Agent enrollment API key...
    upgrade_test.go:128: Getting default Fleet Server URL...
    upgrade_test.go:132: Enrolling Elastic Agent...
    fixture.go:365: >> running agent with: [/tmp/TestFleetManagedUpgradeUpgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT2893997491/001/elastic-agent-7.17.13-linux-x86_64/elastic-agent install --force --url https://36b2225240c14e059b0ae4a251644d36.fleet.us-central1.gcp.qa.cld.elstc.co:443 --enrollment-token MFgtSFlJb0I4YWoxN21QMjhjZUc6dlpXS1ZCajVRdnl2a09fN0ZlMFNwUQ==]
    upgrade_test.go:155: Waiting for enrolled Agent status to be "online"...
    upgrade_test.go:158: Upgrade Elastic Agent to version 8.11.0-SNAPSHOT...
    upgrade_test.go:162: Waiting for enrolled Agent status to be "online"...
    tools.go:34: Agent status: updating
    tools.go:34: Agent status: updating
[...]
    tools.go:34: Agent status: updating
    tools.go:34: Agent status: offline
[...]
    tools.go:34: Agent status: offline
    upgrade_test.go:163: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_test.go:163
        	            				/home/ubuntu/agent/testing/integration/upgrade_test.go:99
        	Error:      	Condition never satisfied
        	Test:       	TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT
        	Messages:   	Agent status is not online
    upgrade_test.go:151: Un-enrolling Elastic Agent...
    fixture_install.go:124: collecting diagnostics; test failed
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT-diagnostics-2023-09-04T14:20:57Z.zip]
    fixture_install.go:228: failed to collect diagnostics to /home/ubuntu/agent/build/diagnostics/TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT-diagnostics-2023-09-04T14:20:57Z.zip (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory): 
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.11.0-SNAPSHOT (637.22s)

TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/ unprotected OR prtected:
- This issue is reoccurring, opened a dedicated issue Flaky Test: TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/* #3480
- UPDATE 2: New occurrence after the fix for the actual bug has been merged ---> https://buildkite.com/elastic/elastic-agent/builds/3524#018ad0da-36cf-41bc-8161-7d1620878f38
  This may be fixed by reintroducing the wait for the watcher in this PR Change the selection of downgrade version in TestUpgradeBrokenPackageVersion #3458
- UPDATE 1: should be caused by the actual bug: Uninstall does not stop a running watcher after upgrade #3371
- agent stuck in updating state on fleet
- failed to collect diagnostics: (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory)
  - by the test and diagnostics error it seems the agent is removed before the test finishes.

Click to expand

=== RUN   TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/unprotected
    endpoint_security_test.go:329: Enrolling the agent in Fleet
    endpoint_security_test.go:352: Creating enrollment API key...
    endpoint_security_test.go:352: Unpacking and installing Elastic Agent
    fetcher.go:90: Using existing artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:221: Extracting artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestInstallWithEndpointSecurityAndRemoveEndpointIntegrationunprotected3635982924/001
    fixture.go:234: Completed extraction of artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestInstallWithEndpointSecurityAndRemoveEndpointIntegrationunprotected3635982924/001
    fixture.go:540: Components were not modified from the fetched artifact
    fixture.go:365: >> running agent with: [/tmp/TestInstallWithEndpointSecurityAndRemoveEndpointIntegrationunprotected3635982924/001/elastic-agent-8.11.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive --url https://7a21b773fefe41879de25ba733c88d0e.fleet.us-central1.gcp.qa.cld.elstc.co:443 --enrollment-token WDQ5QlVZb0IycVZtNXJLc25kUnM6RHI4YjR1OFVSdnVxR2FSZ2ZES2E3QQ==]
    endpoint_security_test.go:352: >>> Ran Enroll. Output: Installing in non-interactive mode.{"log.level":"info","@timestamp":"2023-09-01T15:00:01.675Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":479},"message":"Starting enrollment to URL: https://7a21b773fefe41879de25ba733c88d0e.fleet.us-central1.gcp.qa.cld.elstc.co:443/","ecs.version":"1.6.0"}
        {"log.level":"info","@timestamp":"2023-09-01T15:00:04.928Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":277},"message":"Successfully triggered restart on running Elastic Agent.","ecs.version":"1.6.0"}
        Successfully enrolled the Elastic Agent.
        Elastic Agent has been successfully installed.
    tools.go:34: Agent status: updating
[...]
    tools.go:34: Agent status: updating
    endpoint_security_test.go:352: 
        	Error Trace:	/home/ubuntu/agent/pkg/testing/tools/tools.go:136
        	            				/home/ubuntu/agent/pkg/testing/tools/tools.go:84
        	            				/home/ubuntu/agent/testing/integration/endpoint_security_test.go:352
        	            				/home/ubuntu/agent/testing/integration/endpoint_security_test.go:319
        	Error:      	Condition never satisfied
        	Test:       	TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/unprotected
        	Messages:   	Elastic Agent status is not online
    fixture_install.go:124: collecting diagnostics; test failed
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/unprotected-diagnostics-2023-09-01T15:05:04Z.zip]
    fixture_install.go:228: failed to collect diagnostics to /home/ubuntu/agent/build/diagnostics/TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/unprotected-diagnostics-2023-09-01T15:05:04Z.zip (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory): 
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestInstallWithEndpointSecurityAndRemoveEndpointIntegration/unprotected (341.23s)

TestInstallAndUnenrollWithEndpointSecurity:
- Flaky Test: TestInstallAndUnenrollWithEndpointSecurity #3260
TestUpgradeBrokenPackageVersion
- Flaky test: TestUpgradeBrokenPackageVersion #3454

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-09-05T14:27:56Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

blakerouse · 2023-09-07T14:19:11Z

The "TestFleetManagedUpgrade" are not flaky, the tests have discovered an actual bug in the Elastic Agent. I have filed the issue that is causing the tests failures here - #3371

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Sep 5, 2023

pierrehilbert added the meta label Sep 5, 2023

pchila mentioned this issue Oct 3, 2023

Broken/Flaky integration tests on main #3502

Closed

3 tasks

pierrehilbert closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Flaky integration tests #3357

[Meta] Flaky integration tests #3357

AndersonQ commented Sep 5, 2023 •

edited by pchila

Loading

elasticmachine commented Sep 5, 2023

blakerouse commented Sep 7, 2023 •

edited

Loading

[Meta] Flaky integration tests #3357

[Meta] Flaky integration tests #3357

Comments

AndersonQ commented Sep 5, 2023 • edited by pchila Loading

Tests know to fail:

Fixes and restructure of the test framework and tests

Test failing due to known issue:

elasticmachine commented Sep 5, 2023

blakerouse commented Sep 7, 2023 • edited Loading

AndersonQ commented Sep 5, 2023 •

edited by pchila

Loading

blakerouse commented Sep 7, 2023 •

edited

Loading