Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New plugin sync and patch management framework #885

Merged
merged 10 commits into from
Dec 16, 2022
Merged

Conversation

kralicky
Copy link
Member

@kralicky kralicky commented Dec 2, 2022

New plugin sync and patch management framework
Co-authored-by: Alexandre Lamarre alexandre.lamarre@suse.com

This is a large rework of the plugin sync and patching mechanism. The old system relied on agents to compute binary diffs of plugins, which caused a handful of previously-unforseen issues, notably that the actual computing of patches is compute and memory intensive, and when agents are running with limited resources, this resulted in agents running oom or huge slowdowns during upgrades.

In the new system, the gateway is responsible for computing patches. It will cache plugins and patches on disk, and manage this cache intelligently based on the state of known agents.

To enable this, the gateway now tracks the "last known connection details" for all the agents that connect to it. When an agent connects, its plugin manifest is saved to its cluster metadata in the kv store. Because of this new bookkeeping, the gateway can now determine precisely which plugin revisions and patches to keep around in its cache, and clean any "unreachable" objects from the cache on every startup.

There are a few implementation details that are important to keep in mind, that make this work:

  1. The gateway and any agents connected to it all run the exact same plugins. This is enforced for all agents when they connect to the gateway by comparing their plugin manifest against the gateway's plugin manifest.
  2. The gateway's plugins are distributed to it via container image, specifically by syncing the image to the specific version the manager container is running. All pods that run the opni image are automatically synced to the manager's image whenever the manager's image is changed.
  3. The agent starts out with no plugins. It runs a different image that only includes the minimal opni binary with non-agent features excluded (via build tags). Agent plugins are persisted on disk.

The next time the gateway is started up, its plugins may be different (due to [2]), and this will cause every agent to fail plugin verification when they reconnect (due to [3]). Since all the agents that were previously connected were running the same plugins the gateway was just running (due to [1]) the gateway can simply use the previous set of plugins that were cached on last startup, and the current set of plugins that were cached on this startup, to compute a patch that can be served to all agents at the same time. The patch is then saved in the gateway's cache until at least the next restart, where it will garbage-collect the "unreachable" plugins and patches from previous revisions.

The garbage collection logic is very simple:

  1. On startup (before any agents connect), the gateway checks its list of known agents and their last known connection details, and makes a list of all unique plugin revisions. (for example, if every known agent was connected before the gateway went down, this list would contain only the previous set of plugin revisions)
  2. Any plugins in the cache that are not contained in this aggregated list are removed, along with any patches associated with that plugin.

Other details:

  • Patches are in bsdiff4 format (this was made configurable, but there is only one implementation right now). bsdiff produces very small patches but takes a while to generate them. In the future we could add an option to use zstd raw-format dictionary compression which is much faster, but there is no pure-go implementation currently. Most of the bsdiff patch is bz2 data, so it is already compressed.
  • Plugins are stored on disk, as well as transmitted over the wire, in compressed zstd format.
  • The patching server adds a few additional prometheus metrics to the gateway that track various caching metrics (cache misses, cache hits, plugin count, patch count, total size)
  • Plugins now have a simplified and consistent main()
  • Metadata about which modes (gateway/agent) a plugin supports is made available, and the server uses this info to filter out gateway-only plugins from agents. To support this, plugin discovery has been improved with the ability to add filters and query modes.
  • Gateway/agent plugin config has been updated to allow only a single directory for loading plugins. Multi-directory plugin loading was never used, and it allowed us to simplify the plugin manifest.
  • Added new CLI command opni cluster show <id> which shows detailed info on the cluster, including its plugin manifest, last known connection details, and creation timestamp.

Closes #910
Closes #738 (obsolete)
Closes #911
Closes #789

@kralicky kralicky force-pushed the patch-improvements branch 2 times, most recently from adedf53 to 0457c52 Compare December 9, 2022 01:43
Co-authored-by: Alexandre Lamarre <alexandre.lamarre@suse.com>
@kralicky kralicky changed the title new server-side-only patch management New plugin sync and patch management framework Dec 9, 2022
@kralicky kralicky marked this pull request as ready for review December 9, 2022 06:09
@kralicky kralicky marked this pull request as draft December 9, 2022 17:46
…in loader client to verify plugins before loading
@kralicky kralicky marked this pull request as ready for review December 9, 2022 22:59
@kralicky kralicky requested a review from tybalex December 9, 2022 23:00
@alexandreLamarre alexandreLamarre force-pushed the patch-improvements branch 2 times, most recently from bb2aad6 to fb301d1 Compare December 12, 2022 23:00
@kralicky kralicky merged commit b7a3337 into main Dec 16, 2022
@kralicky kralicky deleted the patch-improvements branch December 16, 2022 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants