Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve artifact support by storing artifact_id in the rule evaluation table and store artifact information during webhook processing #760

Merged
merged 11 commits into from
Aug 30, 2023

Conversation

jhrozek
Copy link
Contributor

@jhrozek jhrozek commented Aug 25, 2023

This pull request does two things in fact:

  1. extends the DB schema and the associated Go code to be able to reference
    artifacts when storing rule evaluation status. This allows to return policy
    results that pertain to artifacts to the clients

  2. extends the executor code that handles artifact published events to also
    store artifact versions so that newly published artifacts can be evaluated

There will also be a follow up that will improve the EntityInfo map[string]string attribute of the RuleEvaluationStatus message to send
artifact-specific extended information to the clients, but I didn't want to
delay this PR any longer.

Arguably this PR could have been split into two, one adding just the support
for referencing artifacts in the rule_evaluation_status table and the other
extending the executor code to store artifact information, but I just coded
and tested everything together and it made for easier testing to test with
newly released artifacts.

There's multiple patches which should hopefully make the review easier, I also
tried to write reasonably descriptive commit messages.

@jhrozek
Copy link
Contributor Author

jhrozek commented Aug 25, 2023

Fixes: #715 (there's one item in the 715 todo list which I'll split into a separate ticket as this PR is already quite big)

lukehinds
lukehinds previously approved these changes Aug 25, 2023
Copy link
Contributor

@lukehinds lukehinds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code look good to me, but I would like @yrobla to take a look who has history with the implementation

Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upsert logic seems fine to me. But, I'm confused by the in-code signature verification, that should be done at the policy level instead AFAIU. Let's catch up on this.

internal/engine/executor.go Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
@jhrozek
Copy link
Contributor Author

jhrozek commented Aug 28, 2023

The upsert logic seems fine to me. But, I'm confused by the in-code signature verification, that should be done at the policy level instead AFAIU. Let's catch up on this.

So we did have a rather long discussion with @JAORMX and @yrobla and agreed that since we're storing the data about signatures as part of the artifact version, we can just keep doing that, but we'll make the policy evaluation simpler and more effective by checking the signature and storing the result when we fetch the artifact version and then the policy will make a decision based on that -- currently the builtin policy ValidateSignature also grabs the signatures again during evaluation again after we've grabbed and stored them earlier in the db.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, I understand there's a little rework going on here at the moment.

@@ -240,7 +240,7 @@ CREATE UNIQUE INDEX users_organization_id_username_lower_idx ON users (organizat
CREATE UNIQUE INDEX repositories_repo_id_idx ON repositories(repo_id);
CREATE UNIQUE INDEX policies_group_id_policy_name_idx ON policies(provider, group_id, name);
CREATE UNIQUE INDEX rule_type_idx ON rule_type(provider, group_id, name);
CREATE UNIQUE INDEX rule_evaluation_status_results_idx ON rule_evaluation_status(policy_id, repository_id, entity, rule_type_id);
CREATE UNIQUE INDEX rule_evaluation_status_results_idx ON rule_evaluation_status(policy_id, repository_id, COALESCE(artifact_id, -1), entity, rule_type_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT for this PR, but I'm feeling a little itchy on the schema here -- I'm not sure what to suggest yet, but the Upsert (which I introduced) is starting to feel like the beginnings of a hairball. On the other hand, I think "Upsert" is absolutely the right semantic if we want to know the latest rule evaluation result -- I think it's more more about the possible 4-way ID key with NULLs.

Would using DEFAULT -1 for artifact_id in the database simplify the rest, or just make things worse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think Ozz or Yolanda warned me about this, to be fair)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code below, it looks like we should be able to use 0 as the sentinel value as well as -1. I have a slight preference for using 0, as that aligns with Go and Proto's default values. It does mean that we need to start our artifact ID numbering at 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting at 1 should already be the case (Postgres SERIAL starts as 1 unless you tell it otherwise using CREATE SEQUENCE IIRC). I chose -1 specifically because it's clearly invalid, but so is 0, so I don't really have a preference and can change that.

The upsert does start to become a bit hairy, if it becomes too much as we add more entity types, we'll use just a single entity id instead and use a stored procedure instead of the cascading delete.

Copy link
Member

@evankanderson evankanderson Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you think about DEFAULT -1 or DEFAULT 0 instead of needing to use COALESCE in the queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since artifact_id is a foreign key to the artifacts table, would that work transparently w/o requiring a dummy artifact with id of -1? Additionally, would ON DELETE CASCADE still work? (Honest questions, that works with a NULL value, but I'm unsure if we defaulted to -1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! I'm not sure -- I'm not used to using quite so many relational features (mostly living a NoSQL world), so I'm not sure. Testing, it appears that you're right, and if we set a REFERENCES foreign key constraint, we can't have a DEFAULT value that doesn't exist in the corresponding table.

> CREATE TABLE test_base (id SERIAL PRIMARY KEY);
> CREATE TABLE test_derived (id SERIAL PRIMARY KEY, parent_id INTEGER NOT NULL REFERENCES test_base(id) ON DELETE CASCADE);
> INSERT INTO test_base VALUES (1);
> INSERT INTO test_derived (parent_id) VALUES (1)
> INSERT INTO test_derived (parent_id) VALUES (0)
insert or update on table "test_derived" violates foreign key constraint "test_derived_parent_id_fkey"
> ALTER TABLE test_derived ALTER COLUMN parent_id SET DEFAULT 0
> ALTER TABLE test_derived ADD COLUMN description TEXT
> INSERT INTO test_derived (description) VALUES ('Happy days')
insert or update on table "test_derived" violates foreign key constraint "test_derived_parent_id_fkey"
> ALTER TABLE test_derived ALTER COLUMN parent_id DROP NOT NULL
> ALTER TABLE test_derived ALTER COLUMN parent_id DROP DEFAULT
> INSERT INTO test_derived (description) VALUES ('Happy days')
select * FROM test_derived
+----+-----------+-------------+
| id | parent_id | description |
|----+-----------+-------------|
| 1  | 1         | <null>      |
| 4  | <null>    | Happy days  |
+----+-----------+-------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is to say that you're completely right about using COALESCE.

internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Show resolved Hide resolved
internal/engine/executor.go Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
In addition to referencing repository entities from the
rule_evaluation_status table, we need to reference artifacts as well.

To do that, we add a new column artifact_id to the
rule_evaluation_table so that we can take advantage of cascaded deletes.

Because this column can be nullable, we need to COALESCE() the column
to a valid integer value, but one that's not valid for the model because
otherwise comparing NULL that's passed in with NULL in the conflicting
row never yields true. The same COALESCE() must be applied to the unique
index.
…g artifact rules

Changes the signature of createOrUpdateEvalStatus() to allow storing
artifact rules when artifactID is passed. Because the number of
parameters was growing, also changes the signature to accept a struct
type so that the parameters are easily named.
In order to list rule evaluation status for both artifacts and
repositories, let's add a new parameter to the
ListRuleEvaluationStatusByPolicyIdParams method that specifies the
entity type. Based on that type, the DB query performs a switch-case,
either returning a matching instance of one entity type or all.
When mediator receives an event notification about a new artifact, the
payload doesn't contain the artifact visibility and we'll need to
retreive it by calling the GH API. In order to avoid passing a string
around, let's add a new attribute to the ArtifactEventPayload struct
We'll have to get additional artifact information from github and we
need a new API to do that.
…artifacts

In order to handle artifact webhook events, we need to reuse several
functions that were previously only available in the reconciler. This
patch moves the functions to the container module, making them reusable.
…ceived via a webhook event

Extends the executor so that artifact version is stored properly upon
receiving a webhook message informing mediator about a new package. This
allows, among others, the artifact information to be displayed to the
clients.

Fixes: #715
The ArtifactEventPayload protobuf message was a relic of how the
artifacts were initially implemented by just looking at the github API.
It had several problems, mainly that it didn't reflect how artifacts are
stored in the DB, supported only a single tag.

Let's converge on using VersionedArtifact which just contains an
artifact and one version. This way we would be able to use the same
struct in webhooks, init and reconciler.
database/migrations/000001_init.up.sql Show resolved Hide resolved
eval_status = $5,
details = $6,
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW())
ON CONFLICT(policy_id, repository_id, COALESCE(artifact_id, 0), entity, rule_type_id) DO UPDATE SET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we want to coalesce repository_id as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would be no harm in coalescing, but currently repository_id cannot be NULL I think, at least the code doesn't ever pass NULL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a NOT NULL constraint (possibly including a DEFAULT) for these columns, with a comment that NULL doesn't play nicely with unique indexes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, NOT NULL for what exactly? The artifact_id /is/ NULL is the DB for entities that are not artifacts, the coalesce works only for a) the unique index and b) detecting conflicts.

The schema of the artifact_signature rule_type was built with an
artifact having a single tag and was built to be used through the now
removed ArtifactEventPayload struct. Changing the schema allows us to
specify the tags better, as a subset of container tags.

Since evaluating the schema with the builtin ingester would be too
complex and would require special-casing the individual keys and values,
it also makes sense to switch to a new ingester created specifically for
artifacts.
- change coalesce to 0 to play better with default go types
- rename cli to client to distinguish from CLI
- safer type cast of packageVersionName to string, to be followed up on
- move deletion of old artifacts inside a transaction
- guard against NULL params
- more descriptive comments on how do createOrUpdateEvalStatusParams work
@jhrozek
Copy link
Contributor Author

jhrozek commented Aug 29, 2023

the latest push should just make the linter happy

JAORMX
JAORMX previously approved these changes Aug 29, 2023
@jhrozek
Copy link
Contributor Author

jhrozek commented Aug 29, 2023

Since @yrobla and @evankanderson also reviewed the PR, I'm going to give them a bit to check if their concerns have been addressed. Thanks for the review and the time spent brainstorming @JAORMX !

evankanderson
evankanderson previously approved these changes Aug 29, 2023
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of small comments, but I won't be bent out of shape if they don't get addressed.

eval_status = $5,
details = $6,
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW())
ON CONFLICT(policy_id, repository_id, COALESCE(artifact_id, 0), entity, rule_type_id) DO UPDATE SET
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a NOT NULL constraint (possibly including a DEFAULT) for these columns, with a comment that NULL doesn't play nicely with unique indexes?

internal/engine/executor.go Show resolved Hide resolved
internal/engine/executor.go Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
internal/engine/executor.go Outdated Show resolved Hide resolved
Comment on lines +40 to +45
// NewArtifactDataIngest creates a new artifact rule data ingest engine
func NewArtifactDataIngest(
_ *pb.ArtifactType,
) (*Ingest, error) {
return &Ingest{}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this signature (just curious -- are we matching some other contract)?

Copy link
Contributor Author

@jhrozek jhrozek Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort of. It's more of a kludge because I found this part of mediator code hard to follow.

If you check NewRuleDataIngest, you'll see that we're instantiating an ingester based on what ingester was configured in the rule_type definition. The ingester configuration (which might be e.g. API endpoint for the REST ingester) is represented as a protobuf message, in this case called Artifact, but because currently the artifact ingester has no configuration, the structure is empty and because it's empty, it's unused in the NewArtifactDataIngest "constructor" so I used the empty assignment name.

But I thought (and this is the kludge because I had a hard time deconstructing the code) that passing the configuration object, even if empty around might make it easier for a future developer to just change the protobuf message by adding a field there and changing the NewArtifactDataIngest function.

Hope it makes sense.

internal/engine/ingester/artifact/artifact.go Outdated Show resolved Hide resolved
Comment on lines 107 to 111
for _, tag := range cfg.Tags {
if slices.Contains(versionedArtifact.Version.Tags, tag) {
return true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required, but consider using the k8s sets library (already in use elsewhere) for semantic clarity on what the loop intends:

haveTags := sets.New(versionedArtifact.Version.Tags)
return haveTags.HasAny(cfg.Tags...)

Note that your semantics here are a bit strange -- empty would be a HasAll, but if you have at least one element, then you're using HasAny.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the part about HasAll ? Otherwise suggestion taken. By the way, we're debating the semantics of how do we match all tags/all names in the epic, so this might change.

internal/engine/ingester/ingester.go Show resolved Hide resolved
pkg/container/container.go Show resolved Hide resolved
@jhrozek jhrozek dismissed stale reviews from evankanderson and JAORMX via 2217c4f August 29, 2023 19:43
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that I don't like fancy relational features, even when I understand their intended value. 😁

@jhrozek jhrozek merged commit c8c4e92 into mindersec:main Aug 30, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants