-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve artifact support by storing artifact_id in the rule evaluation table and store artifact information during webhook processing #760
Conversation
Fixes: #715 (there's one item in the 715 todo list which I'll split into a separate ticket as this PR is already quite big) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code look good to me, but I would like @yrobla to take a look who has history with the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upsert logic seems fine to me. But, I'm confused by the in-code signature verification, that should be done at the policy level instead AFAIU. Let's catch up on this.
So we did have a rather long discussion with @JAORMX and @yrobla and agreed that since we're storing the data about signatures as part of the artifact version, we can just keep doing that, but we'll make the policy evaluation simpler and more effective by checking the signature and storing the result when we fetch the artifact version and then the policy will make a decision based on that -- currently the builtin policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, I understand there's a little rework going on here at the moment.
@@ -240,7 +240,7 @@ CREATE UNIQUE INDEX users_organization_id_username_lower_idx ON users (organizat | |||
CREATE UNIQUE INDEX repositories_repo_id_idx ON repositories(repo_id); | |||
CREATE UNIQUE INDEX policies_group_id_policy_name_idx ON policies(provider, group_id, name); | |||
CREATE UNIQUE INDEX rule_type_idx ON rule_type(provider, group_id, name); | |||
CREATE UNIQUE INDEX rule_evaluation_status_results_idx ON rule_evaluation_status(policy_id, repository_id, entity, rule_type_id); | |||
CREATE UNIQUE INDEX rule_evaluation_status_results_idx ON rule_evaluation_status(policy_id, repository_id, COALESCE(artifact_id, -1), entity, rule_type_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOT for this PR, but I'm feeling a little itchy on the schema here -- I'm not sure what to suggest yet, but the Upsert (which I introduced) is starting to feel like the beginnings of a hairball. On the other hand, I think "Upsert" is absolutely the right semantic if we want to know the latest rule evaluation result -- I think it's more more about the possible 4-way ID key with NULLs.
Would using DEFAULT -1
for artifact_id
in the database simplify the rest, or just make things worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think Ozz or Yolanda warned me about this, to be fair)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code below, it looks like we should be able to use 0
as the sentinel value as well as -1
. I have a slight preference for using 0
, as that aligns with Go and Proto's default values. It does mean that we need to start our artifact ID numbering at 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting at 1 should already be the case (Postgres SERIAL
starts as 1 unless you tell it otherwise using CREATE SEQUENCE
IIRC). I chose -1 specifically because it's clearly invalid, but so is 0, so I don't really have a preference and can change that.
The upsert does start to become a bit hairy, if it becomes too much as we add more entity types, we'll use just a single entity id instead and use a stored procedure instead of the cascading delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you think about DEFAULT -1
or DEFAULT 0
instead of needing to use COALESCE
in the queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since artifact_id
is a foreign key to the artifacts
table, would that work transparently w/o requiring a dummy artifact with id of -1? Additionally, would ON DELETE CASCADE
still work? (Honest questions, that works with a NULL value, but I'm unsure if we defaulted to -1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! I'm not sure -- I'm not used to using quite so many relational features (mostly living a NoSQL world), so I'm not sure. Testing, it appears that you're right, and if we set a REFERENCES
foreign key constraint, we can't have a DEFAULT value that doesn't exist in the corresponding table.
> CREATE TABLE test_base (id SERIAL PRIMARY KEY);
> CREATE TABLE test_derived (id SERIAL PRIMARY KEY, parent_id INTEGER NOT NULL REFERENCES test_base(id) ON DELETE CASCADE);
> INSERT INTO test_base VALUES (1);
> INSERT INTO test_derived (parent_id) VALUES (1)
> INSERT INTO test_derived (parent_id) VALUES (0)
insert or update on table "test_derived" violates foreign key constraint "test_derived_parent_id_fkey"
> ALTER TABLE test_derived ALTER COLUMN parent_id SET DEFAULT 0
> ALTER TABLE test_derived ADD COLUMN description TEXT
> INSERT INTO test_derived (description) VALUES ('Happy days')
insert or update on table "test_derived" violates foreign key constraint "test_derived_parent_id_fkey"
> ALTER TABLE test_derived ALTER COLUMN parent_id DROP NOT NULL
> ALTER TABLE test_derived ALTER COLUMN parent_id DROP DEFAULT
> INSERT INTO test_derived (description) VALUES ('Happy days')
select * FROM test_derived
+----+-----------+-------------+
| id | parent_id | description |
|----+-----------+-------------|
| 1 | 1 | <null> |
| 4 | <null> | Happy days |
+----+-----------+-------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is to say that you're completely right about using COALESCE
.
In addition to referencing repository entities from the rule_evaluation_status table, we need to reference artifacts as well. To do that, we add a new column artifact_id to the rule_evaluation_table so that we can take advantage of cascaded deletes. Because this column can be nullable, we need to COALESCE() the column to a valid integer value, but one that's not valid for the model because otherwise comparing NULL that's passed in with NULL in the conflicting row never yields true. The same COALESCE() must be applied to the unique index.
…g artifact rules Changes the signature of createOrUpdateEvalStatus() to allow storing artifact rules when artifactID is passed. Because the number of parameters was growing, also changes the signature to accept a struct type so that the parameters are easily named.
In order to list rule evaluation status for both artifacts and repositories, let's add a new parameter to the ListRuleEvaluationStatusByPolicyIdParams method that specifies the entity type. Based on that type, the DB query performs a switch-case, either returning a matching instance of one entity type or all.
When mediator receives an event notification about a new artifact, the payload doesn't contain the artifact visibility and we'll need to retreive it by calling the GH API. In order to avoid passing a string around, let's add a new attribute to the ArtifactEventPayload struct
We'll have to get additional artifact information from github and we need a new API to do that.
…artifacts In order to handle artifact webhook events, we need to reuse several functions that were previously only available in the reconciler. This patch moves the functions to the container module, making them reusable.
…ceived via a webhook event Extends the executor so that artifact version is stored properly upon receiving a webhook message informing mediator about a new package. This allows, among others, the artifact information to be displayed to the clients. Fixes: #715
The ArtifactEventPayload protobuf message was a relic of how the artifacts were initially implemented by just looking at the github API. It had several problems, mainly that it didn't reflect how artifacts are stored in the DB, supported only a single tag. Let's converge on using VersionedArtifact which just contains an artifact and one version. This way we would be able to use the same struct in webhooks, init and reconciler.
e689009
to
1bb3737
Compare
eval_status = $5, | ||
details = $6, | ||
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW()) | ||
ON CONFLICT(policy_id, repository_id, COALESCE(artifact_id, 0), entity, rule_type_id) DO UPDATE SET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we want to coalesce repository_id
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would be no harm in coalescing, but currently repository_id cannot be NULL I think, at least the code doesn't ever pass NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a NOT NULL
constraint (possibly including a DEFAULT
) for these columns, with a comment that NULL
doesn't play nicely with unique indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, NOT NULL
for what exactly? The artifact_id /is/ NULL is the DB for entities that are not artifacts, the coalesce works only for a) the unique index and b) detecting conflicts.
The schema of the artifact_signature rule_type was built with an artifact having a single tag and was built to be used through the now removed ArtifactEventPayload struct. Changing the schema allows us to specify the tags better, as a subset of container tags. Since evaluating the schema with the builtin ingester would be too complex and would require special-casing the individual keys and values, it also makes sense to switch to a new ingester created specifically for artifacts.
- change coalesce to 0 to play better with default go types - rename cli to client to distinguish from CLI - safer type cast of packageVersionName to string, to be followed up on - move deletion of old artifacts inside a transaction - guard against NULL params - more descriptive comments on how do createOrUpdateEvalStatusParams work
1bb3737
to
7e4c6ce
Compare
the latest push should just make the linter happy |
Since @yrobla and @evankanderson also reviewed the PR, I'm going to give them a bit to check if their concerns have been addressed. Thanks for the review and the time spent brainstorming @JAORMX ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of small comments, but I won't be bent out of shape if they don't get addressed.
eval_status = $5, | ||
details = $6, | ||
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW()) | ||
ON CONFLICT(policy_id, repository_id, COALESCE(artifact_id, 0), entity, rule_type_id) DO UPDATE SET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a NOT NULL
constraint (possibly including a DEFAULT
) for these columns, with a comment that NULL
doesn't play nicely with unique indexes?
// NewArtifactDataIngest creates a new artifact rule data ingest engine | ||
func NewArtifactDataIngest( | ||
_ *pb.ArtifactType, | ||
) (*Ingest, error) { | ||
return &Ingest{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this signature (just curious -- are we matching some other contract)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort of. It's more of a kludge because I found this part of mediator code hard to follow.
If you check NewRuleDataIngest
, you'll see that we're instantiating an ingester based on what ingester was configured in the rule_type
definition. The ingester configuration (which might be e.g. API endpoint for the REST ingester) is represented as a protobuf message, in this case called Artifact
, but because currently the artifact ingester has no configuration, the structure is empty and because it's empty, it's unused in the NewArtifactDataIngest
"constructor" so I used the empty assignment name.
But I thought (and this is the kludge because I had a hard time deconstructing the code) that passing the configuration object, even if empty around might make it easier for a future developer to just change the protobuf message by adding a field there and changing the NewArtifactDataIngest
function.
Hope it makes sense.
for _, tag := range cfg.Tags { | ||
if slices.Contains(versionedArtifact.Version.Tags, tag) { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required, but consider using the k8s sets
library (already in use elsewhere) for semantic clarity on what the loop intends:
haveTags := sets.New(versionedArtifact.Version.Tags)
return haveTags.HasAny(cfg.Tags...)
Note that your semantics here are a bit strange -- empty would be a HasAll
, but if you have at least one element, then you're using HasAny
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the part about HasAll
? Otherwise suggestion taken. By the way, we're debating the semantics of how do we match all tags/all names in the epic, so this might change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that I don't like fancy relational features, even when I understand their intended value. 😁
This pull request does two things in fact:
extends the DB schema and the associated Go code to be able to reference
artifacts when storing rule evaluation status. This allows to return policy
results that pertain to artifacts to the clients
extends the executor code that handles artifact published events to also
store artifact versions so that newly published artifacts can be evaluated
There will also be a follow up that will improve the
EntityInfo map[string]string
attribute of theRuleEvaluationStatus
message to sendartifact-specific extended information to the clients, but I didn't want to
delay this PR any longer.
Arguably this PR could have been split into two, one adding just the support
for referencing artifacts in the
rule_evaluation_status
table and the otherextending the executor code to store artifact information, but I just coded
and tested everything together and it made for easier testing to test with
newly released artifacts.
There's multiple patches which should hopefully make the review easier, I also
tried to write reasonably descriptive commit messages.