Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git ingester #775

Merged
merged 1 commit into from
Sep 1, 2023
Merged

Add git ingester #775

merged 1 commit into from
Sep 1, 2023

Conversation

JAORMX
Copy link
Contributor

@JAORMX JAORMX commented Aug 28, 2023

The git ingester will take a specific branch, clone it in a temporary
directory, and set it as a context for the policy to evaluate on top
of it.

The intention is to be able to run policies on git contents.

@JAORMX JAORMX changed the title Add git ingester WIP: Add git ingester Aug 28, 2023
@JAORMX JAORMX force-pushed the git-ingester branch 3 times, most recently from 1ff0075 to 85a06f8 Compare August 28, 2023 14:28
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments inline. I'm a bit confused about the tracking. Also cloning on every ingest might get expensive fast, should we think about implementing a timeout or LRU?

// Git is the engine for a rule type that uses git data ingest
type Git struct {
accessToken string
fileTrack map[string]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the ingester ever be called concurrently? (thinking about syncMap)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is a single instance of the ingester being executed at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anyone else confused like me, we do instantiate a per-rule instance for every rule which in turns instantiates a new evaluator per rule (thanks @JAORMX for the explanation on slack)

internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment about auth - didn't see it earlier, sorry. I guess in order for this PR to not be WIP, we also need to a new Message to mediator.proto that represents the git ingester along with the plumbing to instantiate it?

internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
evankanderson
evankanderson previously approved these changes Aug 29, 2023
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where you're headed with this yet, but the git clone part looks reasonable.

Comment on lines 66 to +68
// Eval calls the jq library to evaluate the rule
func (jqe *Evaluator) Eval(ctx context.Context, pol map[string]any, obj any) error {
func (jqe *Evaluator) Eval(ctx context.Context, pol map[string]any, res *engif.Result) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a type assertion that Evaluator fulfils engif.Evaluator?

Suggested change
// Eval calls the jq library to evaluate the rule
func (jqe *Evaluator) Eval(ctx context.Context, pol map[string]any, obj any) error {
func (jqe *Evaluator) Eval(ctx context.Context, pol map[string]any, res *engif.Result) error {
var _ engif.Evaluator = (*Evaluator)(nil)
// Eval calls the jq library to evaluate the rule
func (jqe *Evaluator) Eval(ctx context.Context, pol map[string]any, res *engif.Result) error {

internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
Comment on lines +78 to +96
// We clone to the memfs go-billy filesystem driver, which doesn't
// allow for direct access to the underlying filesystem. This is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing -- I think you mean "doesn't store any contents on the local filesystem". The fs object can still be used to read files from the repo.

One future enhancement item might be to be able to further filter the files included in the fetch, so we don't need to allocate hundreds of megabytes for large repos. I think a TODO is sufficient for that case, though.

internal/engine/ingester/git/git.go Outdated Show resolved Hide resolved
}

return &engif.Result{
Object: struct{}{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Object be nil here or r, rather than an empty struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure I understand the semantics of Result, I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, too, is this supposed to ensure that evaluators that consume an Object will not throw an error, but rather never evaluate? I would say that nil might be more explicit and safer (thinking about an jq evaluator that makes sure that an input does NOT have a path, I think this empty object would pass that evaluator..)

lukehinds
lukehinds previously approved these changes Aug 30, 2023
Copy link
Contributor

@lukehinds lukehinds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from me. I spoke with @JAORMX and we discussed allowing multiple ways of accessing files, as cloning could be costly for our ingress. @JAORMX added #793 as a countermeasure.

@JAORMX JAORMX dismissed stale reviews from lukehinds and evankanderson via 42a9691 September 1, 2023 06:53
@JAORMX JAORMX force-pushed the git-ingester branch 2 times, most recently from 42a9691 to f5b2ff5 Compare September 1, 2023 08:19
@JAORMX JAORMX changed the title WIP: Add git ingester Add git ingester Sep 1, 2023
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and questions, but overall looks good. Thanks for adding the tests!

internal/engine/eval/rego/eval.go Show resolved Hide resolved
// Fs is the filesystem that was created as a result of the ingestion. This
// is normally used by the evaluator to do rule evaluation. The filesystem
// may be a git repo, or a memory filesystem.
Fs billy.Filesystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of evaluator will be plugged into the git ingestor that consumes the FS? Evaluating vulnerabilities comes to mind, but what I wondered when I read this code is if we want to be able to express a contract between an ingestor and evaluator and throw an error on parsing the rule that ingestor X cannot work with evaluator Y rather than throwing a runtime error saying that an object or a FS is missing during evaluation..

// in different calls as opposed to having to set it in the rule type.
type IngesterConfig struct {
Branch string `json:"branch" yaml:"branch" mapstructure:"branch"`
CloneURL string `json:"clone_url" yaml:"clone_url" mapstructure:"clone_url"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the size limit option would come at a later point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right

internal/engine/ingester/git/config.go Outdated Show resolved Hide resolved
}

return &engif.Result{
Object: struct{}{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, too, is this supposed to ensure that evaluators that consume an Object will not throw an error, but rather never evaluate? I would say that nil might be more explicit and safer (thinking about an jq evaluator that makes sure that an input does NOT have a path, I think this empty object would pass that evaluator..)

}

// If the entity is a repository get it from the entity
// else, get it from the configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to match what the code does (IOW, the code prefers the config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw what is the use-case for git ingester w/o a repository?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a solid answer right now, but I'm guessing we could do something interesting with comparing contents from artifacts with something in a git repo.

gi := gitengine.NewGitIngester(&pb.GitType{}, "")
got, err := gi.Ingest(context.Background(), &pb.Artifact{}, map[string]any{
"clone_url": "https://github.com/octocat/Hello-World.git",
"branch": "test",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to assert that we cloned the right branch here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That branch has other contents, that the master branch doesn't have, maybe it would be sufficient to check that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it's fine to do it later

jhrozek
jhrozek previously approved these changes Sep 1, 2023
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually let me approve this so that you're not blocked (I'd be curious about the questions I had either way)

The git ingester will take a specific branch, clone it in a temporary
directory, and set it as a context for the policy to evaluate on top
of it.

The intention is to be able to run policies on git contents.
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, there is some follow-up work to do I guess, but this PR can go in as-is

@JAORMX JAORMX merged commit 5cea6d9 into main Sep 1, 2023
14 checks passed
@JAORMX JAORMX deleted the git-ingester branch September 1, 2023 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants