Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fix empty version when loading upgrade action from disk #4184

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Feb 2, 2024

What does this PR do?

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@AndersonQ AndersonQ added Team:Elastic-Agent Label for the Agent team backport-v8.12.0 Automated backport with mergify labels Feb 2, 2024
@AndersonQ AndersonQ self-assigned this Feb 2, 2024
@AndersonQ
Copy link
Member Author

TL;DR: YAML isn't JSON. Ok, let me explain that:

When using JSON it's possible to have a []byte field and when unmarshaling it have the byte representation of the JSON string and it can be parsed later. That's the strategy used to marshal/unmarshal a
[].fleet.Action, an interface, which actually parses the data into a []fleet.FleetAction then into the specific Action.

It works for the fleet defined JSON action, which has a opaque data field with a different schema for each cation type. In Go, it's a json.RawMessage, a.k.a []byte and it works just fine. However the same isn't true for YAML, it's not even possible to have a raw field (see this), and we use yaml.v2. Besides the concrete Action structs we use do not have a data field, they're flat. What makes their marshaled version different from what they were unmarshaled from.

That's why we're loosing data like the version for the upgrade actions, the original data inside the data field is marshaled to disk in a different place than the one expected by unmarshal. Again, as JSON isn't YAML, even if the specific fields for each action were saved to disk under a data field, the code would break anyway because the data field would fail to unmarshal to a []byte/json.RawMessage.

Some possible solutions:

  1. Refactor Actions to Reflect JSON Schema:
    So far I believe the best solution would be to refactor all the actions to reflect the JSON schema by having a nested Data struct with the action's specific fields. This should make the marshal/unmarshal trivial. However this means changing a substantial amount of code and therefore the potential for things to break.

  2. Do not use YAML for actions:
    We could also just get rid of the yaml for actions. I confess right now I wonder if we really need yaml for them. Using only JSON could be a easy solution. And of course adding some good old panic when marshalling/unmarshalling the actions so we'd be sure no one uses them with YAML.

  3. Explore if yaml.v3 can replicate the JSON raw message behaviour:
    We could try to use yaml.v3 to see if it'd be possible to replicate the JSON behaviour. However, I haven't explored all that could be done if we'd to use yaml.v3 because I'm concerned if we don't change everywhere we might end up with some package using the v2 and messing things when marshalling/unmarshalling the actions.

  4. Have specific structs used only to marshal/unmarshal actions:
    Another option is to have specific structs to represent the YAML version of the actions, those structs would only be used for marshalling/unmarshalling what would hide the complexity of making the actions work in JSON form as well as in YAML form.

Regardless of the solution: we also need some good set of tests to ensure this mix and match of types that an action can be do not loose data when persisting them to disk and reading them back. Mainly if we add new fields or action types. It's exactly what our current tests are missing, to assert the specific fields for each action type.

Right now, for me, I'd like to avoid future problems, what requires a solution allowing the model of the actions be altered or new actions be created without requiring to change any marshal/unmarshal code. What makes me side with 1) and accept the burden of the refactor. However I haven't accessed how far all the actions spread in the agent's code, so I cannot guess the amount of work it'd require.

@cmacknz
Copy link
Member

cmacknz commented Feb 2, 2024

Consistently using JSON everywhere seems like the correct approach to eliminate compatibility errors. I am unsure why we use YAML to store actions on disk. Perhaps @michel-laterman knows or can give us guidance on the best approach since he was involved in the original implementation.

The one complication to changing the format is that actions are persisted on disk, which means if we change the format we have to treat this as a data migration.

I don't think there are a lot of actions persisted in general, in fact I'm not sure the most important one in upgrades even gets persisted. That said we should avoid throwing away actions that otherwise would have been processed. I can think of a couple of ways to deal with this, but am open to suggestion on which approach we should take.

@michel-laterman
Copy link
Contributor

@cmacknz I think yaml was used to be consistent with the format of the other files the elastic-agent wrote to disk.

@AndersonQ AndersonQ force-pushed the 3912-fix-state-empty-version branch from 3dd8f7b to 1313ded Compare February 8, 2024 14:05
@AndersonQ AndersonQ changed the base branch from main to bugfix/3912-borken-stat-store February 9, 2024 16:37
@AndersonQ AndersonQ changed the title WIP: fix empty version when loading upgrade action from disk Makes internal action schema to match Fleet Server schema Feb 12, 2024
@AndersonQ AndersonQ changed the title Makes internal action schema to match Fleet Server schema WIP: fix empty version when loading upgrade action from disk Feb 12, 2024
Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

4 New issues
0 Security Hotspots
47.0% 47.0% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

Copy link
Contributor

mergify bot commented Feb 13, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 3912-fix-state-empty-version upstream/3912-fix-state-empty-version
git merge upstream/bugfix/3912-borken-stat-store
git push upstream 3912-fix-state-empty-version

@AndersonQ
Copy link
Member Author

closed in favour of #4441

@AndersonQ AndersonQ closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts
3 participants