-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and refactor the state store #4441
Conversation
* use golden file for store migration test * remove withFile function * migrate take in a storage.Store instead of the storage's path. It's needed so we can set the encrypted store vault's path
* simplify fleetapi.Actions.UnmarshalJSON * add test to ensure the state store is correctly loaded from disk * skip state store migration tests, they will be fixes on a follow-up PR as part of #3912
It modifies the state store API to match the current needs.
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
NOTE: |
- take bugfix/3912-borken-stat-store as the correct one for any store change - the code ill likely be broken after this merge
fe0f59c
to
f61bbdf
Compare
…astic-agent into bugfix/3912-borken-stat-store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still one thing I need to check, I put a panic so the tests will fail with a TODO message for myself
// if on-disk creation fails an in-memory buffer is used. | ||
f, s, err := h.diagFile(aDiag, uDiag, cDiag, action.ExcludeEventsLog) | ||
f, s, err := h.diagFile(aDiag, uDiag, cDiag, action.Data.ExcludeEventsLog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so just to confirm, #4384 is already the ticket for this test, right?
err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2) | ||
} | ||
if err != nil { | ||
err = errors.Join(err, err2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migrated stores are deleted afterwards, once the migration succeeds. Therefore not closing the file here might create problems to delete it after.
I could try to manually test it, I'd guess it could be a problem on windows
return nil | ||
} | ||
|
||
if action.Type != fleetapi.ActionTypePolicyChange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this only handling actions that were persisted by previous versions of the agent which are known to only write this action type?
yes, those are migrations only
return nil | ||
} | ||
|
||
if action.Type != fleetapi.ActionTypePolicyChange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing about the unenroll is that an unenrolled agent cannot be upgraded. I can add it to migrate unenroll as well for completeness.
Data: fleetapi.ActionPolicyChangeData{ | ||
Policy: conv.YAMLMapToJSONMap(yamlStore.Action.Policy)}, | ||
} | ||
case fleetapi.ActionTypeUnenroll: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a migration, no action will be added on an older version. Besides this is the whole problem of the current implementation, support reading actions from YAML with different schema. Using JSON it's already handled on fleetapi and the store uses the same approach and reuses the fleetapi code
s.mx.Lock() | ||
defer s.mx.Unlock() | ||
|
||
switch v := a.(type) { | ||
case *fleetapi.ActionPolicyChange, *fleetapi.ActionUnenroll: | ||
// Only persist the action if the action is different. | ||
if s.state.action != nil && s.state.action.ID() == v.ID() { | ||
if s.state.ActionSerializer.Action != nil && | ||
s.state.ActionSerializer.Action.ID() == v.ID() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. Now I fixed the old migration tests to run and use a golden file, looking at golden file for the action store, it has only one action whereas the state store, the evolution of the action, store has the queue.
} | ||
|
||
reader, err := yamlToReader(&serialize) | ||
reader, err := jsonToReader(&s.state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
require.Equal(t, ackToken, store.AckToken()) | ||
}) | ||
|
||
t.Run("errors when saving invalid action type", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
internal/pkg/fleetapi/action.go
Outdated
return nil | ||
// UnmarshalYAML prevents to decode actions from . | ||
func (a *Actions) UnmarshalYAML(_ func(interface{}) error) error { | ||
// TODO(AndersonQ): we need this to migrate the store from YAML to JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop, fixed
Are there any implications for being able to downgrade the agent when a migration has to occur? This is something Fleet does not allow today, but we want to make it possible: elastic/kibana#172745 |
I don't believe so. The same is valid for the other migrations, the store isn't designed with a rollback in mind, the old files are deleted and the store does not support a migration rollback. However, it isn't a problem in itself. Considering an agent rollback is intended to replace a broken agent, I'd assume it'll install the older version overriding the current installation. It'd recreate the store from scratch, thus no problem. Also trying to revert the store migration might even be worse, if the issue is on the store migration, it could corrupt the store of the rollback agent. Without extensive thought I see as the best having the agent to recreate its store from scratch. What we might need is to keep the fleet.enc file/state for the rollback agent, avoiding to perform a new enroll during the rollback. |
…astic-agent into bugfix/3912-borken-stat-store
if err2 != nil { | ||
err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2) | ||
} | ||
if err != nil { | ||
err = errors.Join(err, err2) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably join the errors whether err is nil or not
if err2 != nil { | |
err2 = fmt.Errorf("migration storeLoad failed to close reader: %w", err2) | |
} | |
if err != nil { | |
err = errors.Join(err, err2) | |
} | |
if err2 != nil { | |
err = errors.Join(err, fmt.Errorf("migration storeLoad failed to close reader: %w", err2)) | |
} |
defer func() { | ||
errClose := reader.Close() | ||
if errClose != nil { | ||
errClose = fmt.Errorf( | ||
"migration storeLoad failed to close reader: %w", errClose) | ||
} | ||
err = errors.Join(err, errClose) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmacknz, answering your question:
I tested on windows and as I expected it breaks if it fails to close the reader.
Also the current implementation only returns the error if there is already an error. But I changed it now, better to return it anyway.
here is what happens if the reader isn't closed:
--- FAIL: TestStoreMigrations (1.65s)
--- FAIL: TestStoreMigrations/action_store_to_YAML_state_store (1.65s)
--- FAIL: TestStoreMigrations/action_store_to_YAML_state_store/unenroll (1.65s)
migrations_test.go:147:
Error Trace: C:/vagrant/internal/pkg/agent/storage/store/migrations_test.go:147
Error: Received unexpected error:
migration storeLoad failed to close reader: fake reader.Close() error
Test: TestStoreMigrations/action_store_to_YAML_state_store/unenroll
Messages: migration action store -> state store failed
testing.go:1225: TempDir RemoveAll cleanup: remove C:\Users\vagrant\AppData\Local\Temp\1\TestStoreMigrationsaction_store_to_YAML_state_storeunenroll2169855176\001\action_store.yml: The process cannot access the file because it is being used by another process.
FAIL
FAIL github.com/elastic/elastic-agent/internal/pkg/a
to emulate this error, I just removed the code closing the reader
…astic-agent into bugfix/3912-borken-stat-store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of typos and I'm not entirely comfortable with silently making SetAction a no-op if the action type is unexpected...
for _, a := range yamlStore.ActionQueue { | ||
if a.Type != fleetapi.ActionTypeUpgrade { | ||
log.Warnf( | ||
"loaded a unsupported %s action from the deprecated YAML state store action queue, ignoring it", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"loaded a unsupported %s action from the deprecated YAML state store action queue, ignoring it", | |
"loaded a unsupported %s action from the deprecated YAML state store action queue, it will be dropped", |
} | ||
if st.Version != Version { | ||
return nil, fmt.Errorf( | ||
"invalid state store version, got %q isntead of %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"invalid state store version, got %q isntead of %s", | |
"invalid state store version, expected %q, got %q ", |
// SetAction sets the current action. It accepts ActionPolicyChange or | ||
// ActionUnenroll. Any other type will be silently discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a default case in the switch statement with a debug log with action id and type being discarded because it's not supported ?
Maybe our future selves will be grateful to have it 😅
Furthermore, we check the Action type both here and in Save()
... is there a difference between thse two checks ?
store.dirty = true | ||
store.state.ActionSerializer.Action = fleetapi.NewAction(fleetapi.ActionTypeUnknown) | ||
err = store.Save() | ||
require.Error(t, err, "expected and error when saving sore with invalid state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Error(t, err, "expected and error when saving sore with invalid state") | |
require.Error(t, err, "expected an error when saving store with invalid state") |
this is the current behaviour. I understand your concern and actually agree with you. I just don't think this PR is the right place to change the store's behaviour |
Quality Gate passedIssues Measures |
Proposed commit message:
What does this PR do?
Fixes the agent's state store. This PR is the following PRs combined and a few needed changes to fix the conflicts with
main
.As explained here, the integration test for this will be done on another issue.
Why is it important?
Currently, the agent loses data when loading actions from the state store. With the fix, it won't loose data anymore and now the action schema and serialisation are the same for the state store and the fleetapi
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testHow to test this PR locally
reproduce the issue decribed on #3912. Now it should not fail.
Related issues
error parsing version ""
if agent restarts before the upgrade starts #3912Questions to ask yourself