Add bundle data snapshotting for resolution #433

m1kola · 2023-10-04T13:24:26Z

Description

We now snapshot bundle data in the client for the duration of single resolution.

This reduces reads (from network or cache) and unmarshalling since data in memory. And it also ensures that different variable sources involved in the resolution process have the same view of catalogs and bundles.

Closes #418

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

stevekuznetsov

It's not clear to me how this resolves the substantive issue at hand, which also manifests when the resolution calls out to read cluster state more than once.

I would suggest a different approach to the resolution library top-down instead: don't pass in delegated / enclosed sources that fetch data on their own time, instead, expect as input to the resolution pass a full set of variables over which resolution will occur. The client/caller must assemble these intentionally and run them through the resolver.

This would make it much more clear what the lifetime of the objects are and much less likely that somewhere in the myriad layers of abstraction that exist today in sources other mistakes occur.

codecov · 2023-10-04T13:32:55Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (6808a50) 84.00% compared to head (a7f848f) 83.68%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #433      +/-   ##
==========================================
- Coverage   84.00%   83.68%   -0.32%     
==========================================
  Files          23       23              
  Lines         844      852       +8     
==========================================
+ Hits          709      713       +4     
- Misses         93       95       +2     
- Partials       42       44       +2

Flag	Coverage Δ
e2e	`65.96% <85.71%> (-0.16%)`	⬇️
unit	`76.03% <20.00%> (-0.75%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
cmd/manager/main.go	`72.22% <100.00%> (-0.51%)`	⬇️
internal/controllers/operator_controller.go	`80.08% <100.00%> (-0.74%)`	⬇️
internal/catalogmetadata/client/client.go	`93.22% <75.00%> (-2.86%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

m1kola · 2023-10-04T16:14:11Z

It's not clear to me how this resolves the substantive issue at hand, which also manifests when the resolution calls out to read cluster state more than once.

This aims to address #418 is about bundle information being retrieved and processed multiple times. We no longer do that: we retrieve and process it once and hold for duration of resolution.

I would suggest a different approach to the resolution library top-down instead: don't pass in delegated / enclosed sources that fetch data on their own time, instead, expect as input to the resolution pass a full set of variables over which resolution will occur. The client/caller must assemble these intentionally and run them through the resolver.

I tried out a few different approaches and one of them was fetching bundles somewhere before this line:

operator-controller/internal/controllers/operator_controller.go

Line 133 in 6808a50

solution, err := r.Resolver.Solve(ctx)

And passing a slice of all bundles down to all variable sources somehow like this:

solution, err := r.NewSolver(allBundles).Solve(ctx)

I think what I ended up with wasn't better than having a client¹:

We have to pass around and store a ref to a allBundles slice in a lot of places ². It might not be an issue today since we only need to pass bundles, but eventualy we might need some other catalog information which we ignore today (packages, for example). With client it is a bit easier to extend: we would not have to update argumetns and fields on all involved structs.
I think it is more natural to use client here: I call this method on a client and it gives me info on all bundles and I don't care whether it comes from file cache or network or memory.

This would make it much more clear what the lifetime of the objects are and much less likely that somewhere in the myriad layers of abstraction that exist today in sources other mistakes occur.

I think variable sources are a bit messy today and can be simplified. But I do not see how passing a slince of bundles around instead of a thing¹ which gives me bundles makes it more clear.

Also why would I want to worry about lifetime of an object withing variable soruces? From a variable source perspective it doesn't matter lifetime of the objects are. From the whole resolution process it is important that variable sources see the same view of catalogs and we achieve that.

So my thinking is that:

Let variable source worry about doing their job
Let some other component worry about lifetime of the objects

I don't like caling this a "client", but I lack imagination to come up with different name. Maybe "store" as it was in Joe's POC here PoC: refactoring variable sources, experimenting with more constraints #331? ↩ ↩²
Which is another issue: I think we don't need so many variable sources. ↩

stevekuznetsov · 2023-10-04T16:56:29Z

We have to pass around and store a ref to a allBundles slice in a lot of places. It might not be an issue today since we only need to pass bundles, but eventualy we might need some other catalog information which we ignore today (packages, for example). With client it is a bit easier to extend: we would not have to update argumetns and fields on all involved structs.

Devil's advocate: what is the problem with updating your functions as they require new arguments? A clear list of arguments for a static transformation (variable inputs -> resolution output) is easy to read and defines all the requiremens up front. How often do you think you need to change the set of inputs to your resolution? How much work is it to add the new data source to the function signature? Is this something that's happening so often that you need to find a way to optimize the amount of time it takes to write this code?

I think variable sources are a bit messy today and can be simplified. But I do not see how passing a slince of bundles around instead of a thing which gives me bundles makes it more clear.

It's very simple: if I give you a static reference to the data you operate over, and your operation is a pure function, it's impossible to ever get into multiple-fetch / incoherent states and trivial to unit test. Passing in objects that allow you to fetch data on-demand is how the issues arise. Sure, you can wrap every single one of those to only ever return a slice of data, but to what end? Passing the slice itself is simple, and it's impossible to mess up and accidentally add a source in the future that changes during resolution.

ncdc · 2023-10-04T18:05:08Z

+1 - make the caller assemble the inputs, and pass a static set of inputs in/around.

joelanford · 2023-10-04T18:33:51Z

I think we need to be more prescriptive than "assemble inputs, pass them around" because, technically, that's exactly what already happens, right?

The resolution code asks all of its variable sources for the inputs, and then it passes the static set of variables to the resolver.

There are multiple levels of inputs:

Collect all relevant cluster state
Use (1) to build resolver variables
Use (2) to run the sat solver.

joelanford · 2023-10-04T18:41:45Z

In that PoC I did awhile back, collecting all the cluster state up front is essentially what I did: https://github.com/joelanford/operator-controller/blob/31082f42fa9fbe61a4c9155683d0956e2425b774/internal/resolution/v2/variablesources/operator.go#L29-L42

In that PR, the only variable source was that Operator variable source, so this is the only place (afaik) that actually fetches cluster data in the context of a resolution.

m1kola · 2023-10-05T09:41:54Z

Devil's advocate: what is the problem with updating your functions as they require new arguments?

You have to pass arguments and store a reference even in places where you don't need them (intermediate components). At least with the current state of variables sources. When you pass around a struct/an interface - you add a new method and imidiately can use them deep in the call stack.

There is also readability aspect of it: easier to read shorter list of arguments + when you have a "client" - it is intuitively clear where the data is comming from. If something is not clear - you can read a comment in a single place (on "client") which explains it and don't have to follow the breadcrumbs of slices passed from a component to component.

It's very simple: if I give you a static reference to the data you operate over, and your operation is a pure function, it's impossible to ever get into multiple-fetch / incoherent states and trivial to unit test. Passing in objects that allow you to fetch data on-demand is how the issues arise. Sure, you can wrap every single one of those to only ever return a slice of data, but to what end? Passing the slice itself is simple, and it's impossible to mess up and accidentally add a source in the future that changes during resolution.

In variable soruce we very often filter input data. I think It is easy to pass down a filtered slice of data down to the next function instead of unfiltered slice of all bundles and get the wrong solution. Harder to do this when you pass the "client" which always returns unfiltered data.

I can see a number of potential bugs / failure scenarios with both approaches and I don't think that one approach is substantially better than the other.

In that PR, the only variable source was that Operator variable source, so this is the only place (afaik) that actually fetches cluster data in the context of a resolution.

I would like to reduce the number of variable sources. I created #437 for this topic.

We now snapshot bundle data in the client for the duration of single resolution. This reduces reads (from network or cache) and unmarshalling since data in memory. And it also ensures that different variable sources involved in the resolution process have the same view of catalogs and bundles. Signed-off-by: Mikalai Radchuk <mradchuk@redhat.com>

m1kola · 2023-10-05T10:22:29Z

Marking this ready for review.

I'm not convinced so far that we should take another approach here: I do not think that passing slices around is substantially better than passing a client.

Issue #418 is not on my priority list at the moment and I already spent more time on it than initially planned. I'm happy to address feedback with the current approach, but if we decide that we need to do it differently - I'll leave it for another time.

stevekuznetsov · 2023-10-10T12:51:06Z

You have to pass arguments and store a reference even in places where you don't need them (intermediate components).

I think this statement is entirely predicated on how the code is factored - to what extent are intermediate components even necessary? Why? In any case, this seems like a small price to pay.

There is also readability aspect of it: easier to read shorter list of arguments + when you have a "client" - it is intuitively clear where the data is comming from.

I'll be honest, I just got familiar with the v0 resolver implementation - which, if I understand correctly - is the basis for this one - and the myriad layers of abstraction are way, way harder to read and understand. I think you are really an expert in this domain so take this with a grain of salt, but your experience is not universal.

In variable soruce we very often filter input data. I think It is easy to pass down a filtered slice of data down to the next function instead of unfiltered slice of all bundles and get the wrong solution. Harder to do this when you pass the "client" which always returns unfiltered data.

This doesn't seem convincing. The resolver continues to expect that callers provide it the correct data. The surface of this problem has not changed one iota. A static approach simply calls GetVariables() once, upfront. Any programmer error influencing the output of that call that could have happened before, can happen now, and vice versa.

I can see a number of potential bugs / failure scenarios with both approaches and I don't think that one approach is substantially better than the other.

The critical thing is that one class of bug - multiple fetches leading to performance and correctness issues that cannot be mitigated - are entirely impossible. This is an incredibly valuable property and critical to a system that can be trusted to be correct over time. Adding dynamism to an inherently static process (the transform of resolution over inputs) in order to aid in readability at the cost of correctness seems like a bad trade-off. The dozen open-forever bugs against v0 resolution that cannot be closed without rearchitecting the system are the result. If you would like to propose a different approach that meets your readability bar without sacrificing correctness, I'm all ears! I just suggested the static data approach as it's the simplest one I thought of at the time.

m1kola · 2023-10-11T13:23:51Z

I think this statement is entirely predicated on how the code is factored - to what extent are intermediate components even necessary? Why?

the myriad layers of abstraction are way, way harder to read and understand

They are not necessary and I do not like how variable sources are currently structured. It was quite hard for me to follow what is going on when I was learning operator-controller codebase.

There is #437 to simplify variable sources.

A static approach simply calls GetVariables() once, upfront.

Adding dynamism to an inherently static process

Variables sources (GetVariables()) are meant to be dynamic: this is where we collect data from different sources (kube API, catalogd API) to create static variables.

We only call GetVariables() once - here. After this the resolution is happening on a static set of variables.

The critical thing is that one class of bug - multiple fetches leading to performance and correctness issues that cannot be mitigated - are entirely impossible.

Why impossible? In both approaches programmer error leads to correctness issues. Be it someone accidentally overriding a slice of "all bundles" or someone accidentally deleting caching code. The result is incoherent view of the world in both approaches.

at the cost of correctness seems like a bad trade-off

Where specifically we undermine correctness here? As far as I understand - you are talking about potential bugs and regressions. But as we discussed above - both approaches have this potential:

This doesn't seem convincing. The resolver continues to expect that callers provide it the correct data. The surface of this problem has not changed one iota. A static approach simply calls GetVariables() once, upfront. Any programmer error influencing the output of that call that could have happened before, can happen now, and vice versa.

Exactly. I think we can not say for sure that with approach 1 programmers are less likely to mess up compared to approach 2.

stevekuznetsov · 2023-10-11T14:51:18Z

Why impossible?

@m1kola sorry, I am not sure where I am failing to explain this. Given two functions:

func ResolveDynamic(client client.Client) {}

func ResolveStatic(data []Item) {}

The ResolveStatic is a construct that literally cannot allow for programmer error within that would cause multiple client calls. ResolveDynamic has this unfortunate possibility. The errors we are talking about are not hypothetical - if you remember the issue you linked that was the root cause of flakes in downstream CI was an example of the class of failure that occurs due to this design. There are quite a number of others, and they have a good number of customer cases attached. They have been open for years and cannot be solved without fundamentally re-architecting the system.

Programmer error always occurs. Using "but it's possible to horrendously clobber changes to this code" as an argument is not useful. I am making a fundamental point about what is and is not possible in the architectural constraints within ResolveStatic. No matter how hard one tries, they may never lead to the class of performance and correctness bugs that are possible in ResolveDynamic, unless they explicitly refactor to allow it. Adding another call to a client is possible even on accident or as a side-effect of other changes in the dynamic system.

You claim that variable resolution is done only once - if that's true, two things:

nothing enforces that to be true, and the lack of enforcement allows for that class of correctness bugs
if there's only ever one call, what's the point of this PR? No snapshotting should be necessary if the data is only read once.

m1kola · 2023-10-11T16:20:57Z

@stevekuznetsov I'll start with these questions:

You claim that variable resolution is done only once - if that's true, two things:

nothing enforces that to be true, and the lack of enforcement allows for that class of correctness bugs

if there's only ever one call, what's the point of this PR? No snapshotting should be necessary if the data is only read once.

Do you suggest to wrap this in sync.Once or something like that? We can do that, but I'm not sure that we need any enforcement here given that we do not run this code concurrently in any way and lifespan of a solver is withing one reconciliation (at least after this PR).
Variable resolution only happens once on a set of static variables produced by GetVariables(). But we may call client multiple times form different units (e.g. different variable sources/funcs).

Think of the client as of shared informer (again - client is probably no the best name for this).

The ResolveStatic is a construct that literally cannot allow for programmer error within that would cause multiple client calls. ResolveDynamic has this unfortunate possibility.

We were talking about possibility of accidentally introducing correctness issues. Here is one of the examples of how it is possible with ResolveStatic:

Example

package main

import "fmt"

func ResolveStatic(data []Item) {
	Step1(data)

	// Oops, accidentally damaged something and
    // rest of the steps get wrong data
	data = data[:1]

	Step2(data)
}

func Step1(data []Item) {
	fmt.Println("Step1", data)
}

func Step2(data []Item) {
	fmt.Println("Step2", data)
}

func main() {
	data := []Item{
		{name: "1"},
		{name: "2"},
	}

	ResolveStatic(data)

	fmt.Println("end", data)
}

I do think that there is no material difference in this aspect.

The errors we are talking about are not hypothetical - if you remember the issue you linked that was the root cause of flakes in downstream CI was an example of the class of failure that occurs due to this design. There are quite a number of others, and they have a good number of customer cases attached. They have been open for years and cannot be solved without fundamentally re-architecting the system.

What you referring to was we due to incoherent caches (we had multiple caches where we only needed one, if I remember correctly).

I see how this supports "correctness is important" argument and I totally agree that correctness is important.

However I do not see how this supports "passing slice around is better than client/other data structure" argument. I do not think there is substantial difference: a programmer can mess up with any of these accidentally as I illustrated above.

Programmer error always occurs. Using "but it's possible to horrendously clobber changes to this code" as an argument is not useful.

I'm just trying to show that there is no substantial difference between two approaches:

With client it is possible to accidentally remove snapshotting when refactoring
With slices it is possible to accidentally pass incorrect data when refactoring

Adding another call to a client is possible even on accident or as a side-effect of other changes in the dynamic system.

Yes, it is true. But adding another call to a client/store doesn't matter if we make sure on a certain level that we return correct objects (like we do in this PR).

In PR we dealt with correctness & performance issues (same view on all levels, no extra unmarshalling) so it all boils down to how we pass the data (in slices or encapsulated in a client/store).

stevekuznetsov · 2023-10-11T16:41:18Z

What you referring to was we due to incoherent caches (we had multiple caches where we only needed one, if I remember correctly).

That is only partially correct. The v0 code - and, likely, the v1 code - fetches data (either from live clients or from listers) more than once. Even parts of the system that do not use a cache suffer from correctness problems simply because they access data more than once from the cluster. The system is not guaranteed to see the same list of CSVs - or catalog data, or whatever - every time it is fetched. If the system calls for the same data more than once during a resolution, your sources can be incoherent and the output of the resolution will not be correct.

Variable resolution only happens once on a set of static variables produced by GetVariables(). But we may call client multiple times form different units (e.g. different variable sources/funcs).

This is a pedantic distinction that makes no difference from the point of view of correctness. It does not matter how many layers of abstraction are put on top of the data fetching. If data fetching is allowed to happen more than once, the issue arises. My apologies if my imprecise language is getting in the way of communicating this in the specific terms of this library, but I do hope the broader point is clear.

Do you suggest to wrap this in sync.Once or something like that?

No. Working around the issue (with the sync package, or approaches like that in this PR) may be possible, but my point has been - from the beginning - that it is inherently unnecessary. The problem statement of resolution runs against the state of the system at a particular point in time. Designing resolution to allow for fetching the state of the system more than once introduces these failure modes. Time is better spent forming a design that better mirrors the problem domain and makes those failure modes impossible.

m1kola · 2023-10-11T17:02:25Z

That is only partially correct. The v0 code - and, likely, the v1 code - fetches data (either from live clients or from listers) more than once. Even parts of the system that do not use a cache suffer from correctness problems simply because they access data more than once from the cluster. The system is not guaranteed to see the same list of CSVs - or catalog data, or whatever - every time it is fetched.

@stevekuznetsov I agree. In v1 we deal with catalog data in this PR, but we still need to do that for kube API data. I have some ideas for #437 and I think as a side effect we will be fetching cluster data once as well. But we can create a separate issue to track that (if we don't already have one, need it check).

If the system calls for the same data more than once during a resolution, your sources can be incoherent and the output of the resolution will not be correct.

I agree here too. That's why we prevent fetching the same data more than once in this PR. Once resolution - one fetch (either from network or from FS cache).

This is a pedantic distinction that makes no difference from the point of view of correctness.

🤷‍♂️ Just answering questions.

It does not matter how many layers of abstraction are put on top of the data fetching. If data fetching is allowed to happen more than once, the issue arises.

Absolutely. If we merge this PR, we will not be allowing fetching more than once during the same resolution.

Designing resolution to allow for fetching the state of the system more than once introduces these failure modes.

With this PR - it won't be allowed to fetch state of the system (catalog) more than once. It will be exactly as you say: "resolution runs against the state of the system at a particular point in time".

But good point on fetching the state of the cluster. I'm wrapping up for today, but tomorrow I'll check whether we have an issue for this already so we don't rely on resolving it as a side effect of #437

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 4, 2023

stevekuznetsov requested changes Oct 4, 2023

View reviewed changes

ncdc changed the title ~~Adds bundle data snapshotting for resolution~~ Add bundle data snapshotting for resolution Oct 4, 2023

m1kola force-pushed the reduce_unmarshalling branch from a29337a to 42f2555 Compare October 4, 2023 16:15

m1kola force-pushed the reduce_unmarshalling branch from 42f2555 to a7f848f Compare October 5, 2023 09:53

m1kola marked this pull request as ready for review October 5, 2023 10:22

m1kola requested a review from a team as a code owner October 5, 2023 10:22

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2023

m1kola requested a review from stevekuznetsov October 5, 2023 10:22

m1kola closed this Oct 13, 2023

m1kola deleted the reduce_unmarshalling branch November 3, 2023 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bundle data snapshotting for resolution #433

Add bundle data snapshotting for resolution #433

m1kola commented Oct 4, 2023

stevekuznetsov left a comment

codecov bot commented Oct 4, 2023 •

edited

Loading

m1kola commented Oct 4, 2023

stevekuznetsov commented Oct 4, 2023

ncdc commented Oct 4, 2023

joelanford commented Oct 4, 2023 •

edited

Loading

joelanford commented Oct 4, 2023

m1kola commented Oct 5, 2023

m1kola commented Oct 5, 2023 •

edited

Loading

stevekuznetsov commented Oct 10, 2023

m1kola commented Oct 11, 2023

stevekuznetsov commented Oct 11, 2023 •

edited

Loading

m1kola commented Oct 11, 2023

stevekuznetsov commented Oct 11, 2023

m1kola commented Oct 11, 2023

Add bundle data snapshotting for resolution #433

Add bundle data snapshotting for resolution #433

Conversation

m1kola commented Oct 4, 2023

Description

Reviewer Checklist

stevekuznetsov left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

m1kola commented Oct 4, 2023

Footnotes

stevekuznetsov commented Oct 4, 2023

ncdc commented Oct 4, 2023

joelanford commented Oct 4, 2023 • edited Loading

joelanford commented Oct 4, 2023

m1kola commented Oct 5, 2023

m1kola commented Oct 5, 2023 • edited Loading

stevekuznetsov commented Oct 10, 2023

m1kola commented Oct 11, 2023

stevekuznetsov commented Oct 11, 2023 • edited Loading

m1kola commented Oct 11, 2023

stevekuznetsov commented Oct 11, 2023

m1kola commented Oct 11, 2023

codecov bot commented Oct 4, 2023 •

edited

Loading

joelanford commented Oct 4, 2023 •

edited

Loading

m1kola commented Oct 5, 2023 •

edited

Loading

stevekuznetsov commented Oct 11, 2023 •

edited

Loading