Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and refactor the state store #4441

Merged
merged 68 commits into from
Jun 27, 2024
Merged

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Mar 20, 2024

Proposed commit message:

Title: refactor the state store to use JSON

Body:
Fix action store -> state store migration tests (#4235)
* use golden file for store migration test
* remove withFile function
* migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path

refactor state store (#4253)
It modifies the state store API to match the current needs.

update action model to match fleet-server schema (#4240)
* simplify fleetapi.Actions.UnmarshalJSON
* add test to ensure the state store is correctly loaded from disk
* skip state store migration tests, they will be fixes on a follow-up PR as part of #3912

add migrations for action and state stores (#4305)

What does this PR do?

Fixes the agent's state store. This PR is the following PRs combined and a few needed changes to fix the conflicts with main.

As explained here, the integration test for this will be done on another issue.

Why is it important?

Currently, the agent loses data when loading actions from the state store. With the fix, it won't loose data anymore and now the action schema and serialisation are the same for the state store and the fleetapi

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

reproduce the issue decribed on #3912. Now it should not fail.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

* use golden file for store migration test
* remove withFile function
* migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path
* simplify fleetapi.Actions.UnmarshalJSON
* add test to ensure the state store is correctly loaded from disk
* skip state store migration tests, they will be fixes on a follow-up PR as part of #3912
It modifies the state store API to match the current needs.
@AndersonQ AndersonQ added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Mar 20, 2024
@AndersonQ AndersonQ self-assigned this Mar 20, 2024
@AndersonQ AndersonQ requested a review from a team as a code owner March 20, 2024 06:10
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Mar 20, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b bugfix/3912-borken-stat-store upstream/bugfix/3912-borken-stat-store
git merge upstream/main
git push upstream bugfix/3912-borken-stat-store

Copy link
Contributor

mergify bot commented Mar 20, 2024

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

 - take bugfix/3912-borken-stat-store as the correct one for any store change
 - the code ill likely be broken after this merge
@AndersonQ AndersonQ force-pushed the bugfix/3912-borken-stat-store branch from fe0f59c to f61bbdf Compare April 2, 2024 12:05
@AndersonQ AndersonQ requested review from blakerouse and removed request for michalpristas April 2, 2024 12:37
Copy link
Member Author

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still one thing I need to check, I put a panic so the tests will fail with a TODO message for myself

// if on-disk creation fails an in-memory buffer is used.
f, s, err := h.diagFile(aDiag, uDiag, cDiag, action.ExcludeEventsLog)
f, s, err := h.diagFile(aDiag, uDiag, cDiag, action.Data.ExcludeEventsLog)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so just to confirm, #4384 is already the ticket for this test, right?

err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2)
}
if err != nil {
err = errors.Join(err, err2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migrated stores are deleted afterwards, once the migration succeeds. Therefore not closing the file here might create problems to delete it after.

I could try to manually test it, I'd guess it could be a problem on windows

return nil
}

if action.Type != fleetapi.ActionTypePolicyChange {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this only handling actions that were persisted by previous versions of the agent which are known to only write this action type?

yes, those are migrations only

return nil
}

if action.Type != fleetapi.ActionTypePolicyChange {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing about the unenroll is that an unenrolled agent cannot be upgraded. I can add it to migrate unenroll as well for completeness.

Data: fleetapi.ActionPolicyChangeData{
Policy: conv.YAMLMapToJSONMap(yamlStore.Action.Policy)},
}
case fleetapi.ActionTypeUnenroll:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a migration, no action will be added on an older version. Besides this is the whole problem of the current implementation, support reading actions from YAML with different schema. Using JSON it's already handled on fleetapi and the store uses the same approach and reuses the fleetapi code

s.mx.Lock()
defer s.mx.Unlock()

switch v := a.(type) {
case *fleetapi.ActionPolicyChange, *fleetapi.ActionUnenroll:
// Only persist the action if the action is different.
if s.state.action != nil && s.state.action.ID() == v.ID() {
if s.state.ActionSerializer.Action != nil &&
s.state.ActionSerializer.Action.ID() == v.ID() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly. Now I fixed the old migration tests to run and use a golden file, looking at golden file for the action store, it has only one action whereas the state store, the evolution of the action, store has the queue.

}

reader, err := yamlToReader(&serialize)
reader, err := jsonToReader(&s.state)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

require.Equal(t, ackToken, store.AckToken())
})

t.Run("errors when saving invalid action type", func(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

internal/pkg/fleetapi/action.go Outdated Show resolved Hide resolved
return nil
// UnmarshalYAML prevents to decode actions from .
func (a *Actions) UnmarshalYAML(_ func(interface{}) error) error {
// TODO(AndersonQ): we need this to migrate the store from YAML to JSON
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop, fixed

@cmacknz
Copy link
Member

cmacknz commented Jun 20, 2024

Are there any implications for being able to downgrade the agent when a migration has to occur?

This is something Fleet does not allow today, but we want to make it possible: elastic/kibana#172745

@AndersonQ
Copy link
Member Author

AndersonQ commented Jun 21, 2024

Are there any implications for being able to downgrade the agent when a migration has to occur?

This is something Fleet does not allow today, but we want to make it possible: elastic/kibana#172745

I don't believe so. The same is valid for the other migrations, the store isn't designed with a rollback in mind, the old files are deleted and the store does not support a migration rollback.

However, it isn't a problem in itself. Considering an agent rollback is intended to replace a broken agent, I'd assume it'll install the older version overriding the current installation. It'd recreate the store from scratch, thus no problem. Also trying to revert the store migration might even be worse, if the issue is on the store migration, it could corrupt the store of the rollback agent. Without extensive thought I see as the best having the agent to recreate its store from scratch.

What we might need is to keep the fleet.enc file/state for the rollback agent, avoiding to perform a new enroll during the rollback.

Comment on lines 73 to 78
if err2 != nil {
err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2)
}
if err != nil {
err = errors.Join(err, err2)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably join the errors whether err is nil or not

Suggested change
if err2 != nil {
err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2)
}
if err != nil {
err = errors.Join(err, err2)
}
if err2 != nil {
err = errors.Join(err, fmt.Errorf("migration storeLoad failed to close reader: %w", err2))
}

Comment on lines +71 to +78
defer func() {
errClose := reader.Close()
if errClose != nil {
errClose = fmt.Errorf(
"migration storeLoad failed to close reader: %w", errClose)
}
err = errors.Join(err, errClose)
}()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz, answering your question:

I tested on windows and as I expected it breaks if it fails to close the reader.
Also the current implementation only returns the error if there is already an error. But I changed it now, better to return it anyway.

here is what happens if the reader isn't closed:

--- FAIL: TestStoreMigrations (1.65s)
    --- FAIL: TestStoreMigrations/action_store_to_YAML_state_store (1.65s)
        --- FAIL: TestStoreMigrations/action_store_to_YAML_state_store/unenroll (1.65s)
            migrations_test.go:147:
                        Error Trace:    C:/vagrant/internal/pkg/agent/storage/store/migrations_test.go:147
                        Error:          Received unexpected error:
                                        migration storeLoad failed to close reader: fake reader.Close() error
                        Test:           TestStoreMigrations/action_store_to_YAML_state_store/unenroll
                        Messages:       migration action store -> state store failed
            testing.go:1225: TempDir RemoveAll cleanup: remove C:\Users\vagrant\AppData\Local\Temp\1\TestStoreMigrationsaction_store_to_YAML_state_storeunenroll2169855176\001\action_store.yml: The process cannot access the file because it is being used by another process.
FAIL
FAIL    github.com/elastic/elastic-agent/internal/pkg/a

to emulate this error, I just removed the code closing the reader

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of typos and I'm not entirely comfortable with silently making SetAction a no-op if the action type is unexpected...

for _, a := range yamlStore.ActionQueue {
if a.Type != fleetapi.ActionTypeUpgrade {
log.Warnf(
"loaded a unsupported %s action from the deprecated YAML state store action queue, ignoring it",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"loaded a unsupported %s action from the deprecated YAML state store action queue, ignoring it",
"loaded a unsupported %s action from the deprecated YAML state store action queue, it will be dropped",

}
if st.Version != Version {
return nil, fmt.Errorf(
"invalid state store version, got %q isntead of %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"invalid state store version, got %q isntead of %s",
"invalid state store version, expected %q, got %q ",

Comment on lines +166 to +167
// SetAction sets the current action. It accepts ActionPolicyChange or
// ActionUnenroll. Any other type will be silently discarded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a default case in the switch statement with a debug log with action id and type being discarded because it's not supported ?
Maybe our future selves will be grateful to have it 😅
Furthermore, we check the Action type both here and in Save()... is there a difference between thse two checks ?

store.dirty = true
store.state.ActionSerializer.Action = fleetapi.NewAction(fleetapi.ActionTypeUnknown)
err = store.Save()
require.Error(t, err, "expected and error when saving sore with invalid state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Error(t, err, "expected and error when saving sore with invalid state")
require.Error(t, err, "expected an error when saving store with invalid state")

@AndersonQ
Copy link
Member Author

A couple of typos and I'm not entirely comfortable with silently making SetAction a no-op if the action type is unexpected...

this is the current behaviour. I understand your concern and actually agree with you. I just don't think this PR is the right place to change the store's behaviour

@AndersonQ AndersonQ requested review from pchila and cmacknz June 21, 2024 15:33
@AndersonQ AndersonQ enabled auto-merge (squash) June 21, 2024 15:40
@AndersonQ AndersonQ merged commit b4877f5 into main Jun 27, 2024
13 checks passed
@AndersonQ AndersonQ deleted the bugfix/3912-borken-stat-store branch June 27, 2024 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
8 participants