Adding upgrade test for unhealthy upgraded Agent #3274

ycombinator · 2023-08-21T22:24:38Z

What does this PR do?

This PR adds an integration test that ensures that an unhealthy Agent is rolled back.

Why is it important?

To ensure upgrade rollbacks work as expected.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
~~I have added an entry in ./changelog/fragments using the changelog tool~~
I have added an integration test or an E2E test

How to test this PR locally

Check out this PR's branch and then:

DEV=true PLATFORMS=linux/arm64,linux/amd64 PACKAGES=tar.gz mage package
mage integration:single TestStandaloneUpgradeFailsStatus

Related issues

Relates Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

mergify · 2023-08-21T22:25:18Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-08-21T22:49:17Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-09-08T14:48:54.210+0000
Duration: 26 min 2 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6277
Skipped	55
Total	6332

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-08-21T22:49:27Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.78% (`81/82`)	👍
Files	66.212% (`194/293`)	👍
Classes	65.562% (`356/543`)	👍
Methods	52.612% (`1118/2125`)	👍
Lines	38.185% (`12731/33340`)	👍 0.013
Conditionals	100.0% (`0/0`)	💚

elasticmachine · 2023-08-23T16:48:27Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pchila

Left a couple of comments but overall it seems fine as it is a variation on other upgrade tests we already have and run...

testing/integration/upgrade_test.go

pchila · 2023-08-25T06:49:18Z

testing/integration/upgrade_test.go

+	// input when the Agent version matches the upgraded Agent version. This way
+	// the pre-upgrade version of the Agent runs healthy, but the post-upgrade
+	// version doesn't.
+	invalidInputPolicy := fastWatcherCfg + fmt.Sprintf(`


How do we guarantee that the fastWatcherCfg is supported? is it even needed here ?

We can't guarantee it but the settings in fastWatcherCfg should just be ignored by any versions of Agent that don't support it. Also, checkUpgradeWatcherRan will just skip checking if the upgrade watcher finished running for Agents that don't support the fastWatcherCfg settings:

elastic-agent/testing/integration/upgrade_test.go

Lines 558 to 565 in 22c276d

func checkUpgradeWatcherRan(t *testing.T, agentFixture *atesting.Fixture, fromVersion *version.ParsedSemVer) {

t.Helper()

if fromVersion.Less(*version_8_9_0_SNAPSHOT) {

t.Logf("Version %q is too old for a quick update marker check, skipping...", fromVersion)

return

}

So I think we are okay here?

If the checkUpgradeWatcherRan skips here, doesn't that make the test run pointless?
If I understood correctly, the point of the test is to verify that the watcher rollback the upgrade because of the unhealthy state of the new agent (please correct me if this is not the correct assumption). In such a case wouldn't the upgrade watcher terminate before the full 10 mins default grace period because of the amount of errors ?

Currently on the main branch, the fromVersion is 8.11.0. Which means checkUpgradeWatcherRan won't be skipped. If we backport this test to 8.10 and even 8.9, checkUpgradeWatcherRan won't be skipped.

If we were to backport this test to versions before 8.9, then checkUpgradeWatcherRan would be skipped and the rest of the test will fail because the assertion for checking if Agent has rolled back to the pre-upgrade version will timeout. But we don't plan to backport this test to any versions before 8.9 so I think we're good.

On an unrelated note, I just noticed that checkUpgradeWatcherRan skips if the fromVersion is < 8.9.0. That made sense when Agent was spawning the Upgrade Watcher from the old (pre-upgrade) Agent's binary. However, starting from 8.10.0 onwards, Agent will spawn the Upgrade Watcher from the new (post-upgrade) Agent's binary. So I think we need an additional check here to skip if the fromVersion is >= 8.10.0 && toVersion < 8.9.0. WDYT?

I think that the check is correct as it refers to the earliest version that understand settings for a shorter watch period... The fact that the watcher is spawned from the old or the new shouldn't matter as the outcome in the case of an unhealthy upgraded agent should stay the same...
It actually provides an interesting additional test: that even if the upgraded agent process itself is unhealthy, the agent process stays alive and take the correct action ;)

ycombinator · 2023-09-06T01:15:04Z

This test is failing on Ubuntu arm64 but not on Ubuntu amd64. The failure is bizzare too, in that it happens in the very first step of the test, where we try to install the Agent fixture and check that it's healthy, and the failure is evidently due to /opt/Elastic/Agent/elastic-agent not existing!

=== RUN   TestStandaloneUpgradeFailsStatus
    upgrade_test.go:944: Testing Elastic Agent upgrade from 8.11.0-SNAPSHOT to 8.10.0-SNAPSHOT...
    fetcher.go:90: Using existing artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:221: Extracting artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneUpgradeFailsStatus2517823418/001
    fixture.go:234: Completed extraction of artifact elastic-agent-8.11.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneUpgradeFailsStatus2517823418/001
    fixture.go:540: Components were not modified from the fetched artifact
    upgrade_test.go:971: Install the built Agent
    fixture.go:365: >> running agent with: [/tmp/TestStandaloneUpgradeFailsStatus2517823418/001/elastic-agent-8.11.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive]
    upgrade_test.go:973: Installing in non-interactive mode.Elastic Agent has been successfully installed.
        
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent status --output yaml]
    upgrade_test.go:978: error getting the agent state: fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory
    upgrade_test.go:977: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_test.go:977
        	Error:      	Condition never satisfied
        	Test:       	TestStandaloneUpgradeFailsStatus
        	Messages:   	Agent never became healthy
    fixture_install.go:124: collecting diagnostics; test failed
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestStandaloneUpgradeFailsStatus-diagnostics-2023-09-05T17:40:27Z.zip]
    fixture_install.go:228: failed to collect diagnostics to /home/ubuntu/agent/build/diagnostics/TestStandaloneUpgradeFailsStatus-diagnostics-2023-09-05T17:40:27Z.zip (fork/exec /opt/Elastic/Agent/elastic-agent: no such file or directory): 
    fixture.go:365: >> running agent with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestStandaloneUpgradeFailsStatus (150.71s)

If I run just just this test locally on an ARM64 Ubuntu machine, it passes. I'm going to re-run the Buildkite build and see if the test fails again in the same way or in a different way, to perhaps give us some ideas into what might be going on.

jlind23 · 2023-09-06T06:12:15Z

@pazone any reason why in this sonar notification #3274 (comment) we don't get any coverage metric?

cmacknz · 2023-09-06T12:36:29Z

The test failure here appears similar to https://buildkiteartifacts.com/e0f3970e-3a75-4621-919f-e6c773e2bb12/0187faac-2a90-4642-bc78-c33cefa167ba/018a667e-d0ba-4c91-908f-63a0d2c6aed0/018a667f-9c04-4961-b012-33d27279e3e7/build/TEST-report.html?response-content-type=text%2Fhtml&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAQPCP3C7LUJWWWKNY%2F20230906%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230906T123542Z&X-Amz-Expires=600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEJj%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJGMEQCIG4GMwSJbrlL46bN51XihMmWxfUSFJUuBoR9hS%2FHZh6rAiB%2FwtX73xzbCLFdQ%2BJYDxvJPqBRNuDoe34VtKaLtOc9GyrxAwhwEAAaDDAzMjM3OTcwNTMwMyIMT8Tk9L8wlAORa2SBKs4DXAnvnfYgB9flWasmNmfEM1to2pYULoMDPz6tXgx3JMIr2ll%2FTARSmqupwAkM6KcD1pFZ1%2BDSJ1YjMRR6a0PCvFEFOS%2FZc3wT0zSzlFb6IsuiRnpg15MPQ95YL22%2Fw0wa61BTBDELCDo1TpDX0HGPpUpFwltJweP2No%2Fu%2FIrzp2V11xueoJ3mrJdtric89GviJT06ufpiqAAIO1CKswUfOD4g5wdKKSZtL0RQb6wZ4h0l%2BXa%2B2bUBzR3%2F0EEYT5EoH0crn3vddghjy3Wo2lUtxqdQE0H9hVfuB5Tzt02nq9DNCBB7Yx7wtkAb6Ue1tGTk8SRKi4r%2FywQGjfwbted8vVdrMzV747Mcj4PXlC7ySzlCzM%2BA6Vry9yBuG9XxtjQL5u4cAz3kGLnaMGrrRYoKheVE6aSEh3xm%2B5qIecWuLNfAWvKWkLVQWbfzbT%2FCngNqn%2BxuwHZWHg9UWpzkyFunrpA0aPf%2Bw0Tgpli%2FdiQoS%2BRZOKM%2BSKuHiHdRZ5y%2FZAG6q9V7%2BM7S9oKQjXKv9p98Sk7dqc4%2FkcWlq2Vir8r%2BzLRTQEtFj53vbtbOzUjdfSTwvi%2BerMalccfhPkHuPbE0eKVYeO7YzEWxrYwISKsWMOPQ4KcGOqYBygdRoJZM%2B9QUizmT2OEUBqqG%2FEX%2BSfgWJc9N5fFtNz52IHFSONdamz9qu26KRzNLH0UEOBoC4lJPlhq0VCiBh4hvLga%2FKWMYcQWZ9cVow7SiBwhphpHGEPOBdSDw%2FG6YPNyU0FlfO7sd3NFvvoQKh361LxfCNWnUUZgHU14E7mHk8ytK9gS5vOrKESiSyeqGn%2FIYVndgtGWPniF9qjQxNu2ibCSf6w%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=b3631f059508567bf67d11fbdf5c959d3e0f3f6cb2917c7a870618607a219f6c#github.com/elastic/elastic-agent/testing/integration(linux-arm64-ubuntu-2204)(sudo).git.luolix.top/elastic/elastic-agent/testing/integration.TestFleetManagedUpgrade/Upgrade_managed_agent_from_7.17.13_to_8.10.0-SNAPSHOT on main.

I did notice it is more common on arm but also happens (or very similar problems happen) on amd64.

ycombinator · 2023-09-07T05:01:40Z

This time TestStandaloneUpgradeFailsStatus did not fail in Buildkite. Re-running and hoping it fails so we can see what might be uninstalling Agent in the middle of the test...

ycombinator · 2023-09-07T05:01:47Z

buildkite test this

ycombinator · 2023-09-07T11:26:34Z

This time TestStandaloneUpgradeFailsStatus did not fail in Buildkite. Re-running and hoping it fails so we can see what might be uninstalling Agent in the middle of the test...

Re-running caused TestStandaloneUpgradeFailsStatus to fail again with the same error but the logging I added in dbfc886 didn't reveal much. This type of failure — where Agent randomly disappears during integration tests in CI — is happening on main as well. We think the root cause of this is #3371.

This PR here is, therefore, blocked on #3371 for the moment.

ycombinator · 2023-09-08T13:33:13Z

Workaround for #3371 was just merged: #3370. Re-basing on main...

elastic-sonarqube · 2023-09-08T15:00:46Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

* [WIP] Adding upgrade test for unhealthy upgraded Agent * Fixing up logging * WIP: fixing some issues with fake package creation * Use fast watcher config * [WIP] Different approach to making upgraded Agent unhealthy * WIP: Adding TODO * Flesh out rest of the test * Adding logging * Make invalid input try to run in upgraded Agent version * Removing OS restriction * Tuning status check frequency * Use artifacts API * Clarify comment (cherry picked from commit 5531ad3)

* [WIP] Adding upgrade test for unhealthy upgraded Agent * Fixing up logging * WIP: fixing some issues with fake package creation * Use fast watcher config * [WIP] Different approach to making upgraded Agent unhealthy * WIP: Adding TODO * Flesh out rest of the test * Adding logging * Make invalid input try to run in upgraded Agent version * Removing OS restriction * Tuning status check frequency * Use artifacts API * Clarify comment (cherry picked from commit 5531ad3) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

mergify bot assigned ycombinator Aug 21, 2023

mergify bot added the backport-skip label Aug 21, 2023

ycombinator mentioned this pull request Aug 21, 2023

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Closed

15 tasks

ycombinator added skip-changelog Team:Elastic-Agent Label for the Agent team labels Aug 21, 2023

ycombinator force-pushed the integ-test-upgrade-fails-status branch 2 times, most recently from 4c9a949 to f386baf Compare August 22, 2023 22:36

ycombinator marked this pull request as ready for review August 23, 2023 16:48

ycombinator requested a review from a team as a code owner August 23, 2023 16:48

ycombinator requested review from michalpristas and blakerouse August 23, 2023 16:48

ycombinator force-pushed the integ-test-upgrade-fails-status branch from a72ba62 to d4350a3 Compare August 24, 2023 21:11

pierrehilbert requested a review from pchila August 25, 2023 06:23

pchila approved these changes Aug 25, 2023

View reviewed changes

pierrehilbert linked an issue Aug 27, 2023 that may be closed by this pull request

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Closed

15 tasks

ycombinator force-pushed the integ-test-upgrade-fails-status branch from c33192b to 789f69b Compare August 28, 2023 13:41

ycombinator requested a review from pchila August 28, 2023 14:10

pchila approved these changes Aug 28, 2023

View reviewed changes

ycombinator force-pushed the integ-test-upgrade-fails-status branch from 789f69b to fcb09a4 Compare August 29, 2023 12:50

ycombinator added backport-v8.10.0 Automated backport with mergify and removed backport-skip labels Aug 29, 2023

ycombinator enabled auto-merge (squash) August 29, 2023 15:13

ycombinator force-pushed the integ-test-upgrade-fails-status branch from fcb09a4 to 9aeade4 Compare September 5, 2023 16:46

ycombinator added 15 commits September 8, 2023 06:33

[WIP] Adding upgrade test for unhealthy upgraded Agent

38e1f98

Fixing up logging

97a25ee

WIP: fixing some issues with fake package creation

9e90aa8

Use fast watcher config

67bef53

[WIP] Different approach to making upgraded Agent unhealthy

e409188

WIP: Adding TODO

0c0d1b9

Flesh out rest of the test

cdc681a

Adding logging

8d23098

Make invalid input try to run in upgraded Agent version

336b1b6

Removing OS restriction

6db5fa1

Tuning status check frequency

b89a675

Use artifacts API

634a6e5

Clarify comment

84cdeff

Add logging for debugging

4fb5c62

Removing unnecessary const

3df1098

ycombinator force-pushed the integ-test-upgrade-fails-status branch from dbfc886 to 3df1098 Compare September 8, 2023 13:37

ycombinator mentioned this pull request Sep 8, 2023

[Integration Test] New Agent after upgrade fails to start #3085

Merged

7 tasks

Update watcher completion check function call

00bd22b

ycombinator merged commit 5531ad3 into elastic:main Sep 8, 2023
8 of 11 checks passed

mergify bot mentioned this pull request Sep 8, 2023

[8.10](backport #3274) Adding upgrade test for unhealthy upgraded Agent #3387

Merged

ycombinator deleted the integ-test-upgrade-fails-status branch September 11, 2023 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding upgrade test for unhealthy upgraded Agent #3274

Adding upgrade test for unhealthy upgraded Agent #3274

ycombinator commented Aug 21, 2023 •

edited

Loading

mergify bot commented Aug 21, 2023

elasticmachine commented Aug 21, 2023 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Aug 21, 2023 •

edited

Loading

elasticmachine commented Aug 23, 2023

pchila left a comment

pchila Aug 25, 2023

ycombinator Aug 25, 2023

pchila Aug 28, 2023

ycombinator Aug 28, 2023 •

edited

Loading

ycombinator Aug 28, 2023

pchila Aug 28, 2023 •

edited

Loading

ycombinator commented Sep 6, 2023

jlind23 commented Sep 6, 2023

cmacknz commented Sep 6, 2023

ycombinator commented Sep 7, 2023 •

edited

Loading

ycombinator commented Sep 7, 2023

ycombinator commented Sep 7, 2023 •

edited

Loading

ycombinator commented Sep 8, 2023

elastic-sonarqube bot commented Sep 8, 2023

	func checkUpgradeWatcherRan(t testing.T, agentFixture atesting.Fixture, fromVersion *version.ParsedSemVer) {
	t.Helper()

	if fromVersion.Less(*version_8_9_0_SNAPSHOT) {
	t.Logf("Version %q is too old for a quick update marker check, skipping...", fromVersion)
	return
	}

Adding upgrade test for unhealthy upgraded Agent #3274

Adding upgrade test for unhealthy upgraded Agent #3274

Conversation

ycombinator commented Aug 21, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Aug 21, 2023

elasticmachine commented Aug 21, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Aug 21, 2023 • edited Loading

🌐 Coverage report

elasticmachine commented Aug 23, 2023

pchila left a comment

Choose a reason for hiding this comment

pchila Aug 25, 2023

Choose a reason for hiding this comment

ycombinator Aug 25, 2023

Choose a reason for hiding this comment

pchila Aug 28, 2023

Choose a reason for hiding this comment

ycombinator Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

ycombinator Aug 28, 2023

Choose a reason for hiding this comment

pchila Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

ycombinator commented Sep 6, 2023

jlind23 commented Sep 6, 2023

cmacknz commented Sep 6, 2023

ycombinator commented Sep 7, 2023 • edited Loading

ycombinator commented Sep 7, 2023

ycombinator commented Sep 7, 2023 • edited Loading

ycombinator commented Sep 8, 2023

elastic-sonarqube bot commented Sep 8, 2023

ycombinator commented Aug 21, 2023 •

edited

Loading

elasticmachine commented Aug 21, 2023 •

edited

Loading

elasticmachine commented Aug 21, 2023 •

edited

Loading

ycombinator Aug 28, 2023 •

edited

Loading

pchila Aug 28, 2023 •

edited

Loading

ycombinator commented Sep 7, 2023 •

edited

Loading

ycombinator commented Sep 7, 2023 •

edited

Loading