Add batch creation logic for the reminder service #3413

Vyom-Yadav · 2024-05-23T16:00:20Z

Summary

Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.

Part - 2 #2262

Algorithm:

Generate a random v4 uuid as starting point.
Iterate and pick repos until the end of the table is reached.
After end of table is reached, start iterating from the first row in the table.

Only the starting point is random; batch creation becomes sequential after one end of the table is reached. If after reaching the end of the table, the iteration point is selected randomly again, the coverage of all repos won't be guaranteed.

Change Type

Mark the type of change your PR introduces:

Bug fix (resolves an issue without affecting existing features)
Feature (adds new functionality without breaking changes)
Breaking change (may impact existing functionalities or require documentation updates)
Documentation (updates or additions to documentation)
Refactoring or test improvements (no bug fixes or new functionality)

Testing

Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.

Review Checklist:

Reviewed my own code for quality and clarity.
Added comments to complex or tricky code sections.
Updated any affected documentation.
Included tests that validate the fix or feature.
Checked that related changes are merged.

coveralls · 2024-05-23T16:08:50Z

Changes unknown
when pulling dc04b46 on Vyom-Yadav:addReminderLogic
into ** on stacklok:main**.

Vyom-Yadav · 2024-05-25T09:46:26Z

database/migrations/000060_repo_reminder.up.sql

+ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP;
+
+CREATE EXTENSION IF NOT EXISTS tsm_system_rows;


I'm not sure if enabling extensions should be part of migration files.

This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.

If we did need to get a random repository, I'd be inclined to simply use:

SELECT * FROM repositories WHERE id > gen_random_uuid() LIMIT 1;

(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)

suggestion: I think we can get rid of query GetRandomRepository by calculating a random UUID on the application side and passing it straight to ListEligibleRepositoriesAfterID. Added benefits would be

we have one less query

one less step in the process

the whole process becomes deterministic (and easier to test)

If we want to start at a random point, fetching a valid random repo is required. If we can start at any point, then sequential iteration would do the job.

If we don't fetch a valid repo / generate a random uuid, then we can potentially have a case where the generated uuid is out of range, and we have to either generate a uuid again or start from the beginning (which defeats the purpose of random generation)

It all comes down to whether we want to start from a valid random point or not.

Vyom-Yadav · 2024-05-25T09:54:44Z

sqlc.yaml

+  - db_type: 'pg_catalog.interval'
+    go_type: 'github.com/jackc/pgtype.Interval'
+  - db_type: 'pg_catalog.interval'
+    go_type: 'github.com/jackc/pgtype.NullInterval'
+    nullable: true


See: sqlc-dev/sqlc#429 (comment)

Another option would be to use make_interval(secs => {{ duration.Seconds() }})

Another option would simply be to pass a string through here, rather than a time.Duration. Given that this is a configuration constant, I'd prefer to pull in fewer dependencies to support it.

rdimitrov

I missed if it was discussed but should we also have a way to disable reminder for certain repos by filtering repos out of the batch creation logic?

Vyom-Yadav · 2024-05-30T08:42:48Z

I missed if it was discussed but should we also have a way to disable reminder for certain repos by filtering repos out of the batch creation logic?

User configurable option?

Why would someone like to do that? To preserve their rate limit? I don't think reminder should be exposed to end users; it's just background reconciliation to keep the system up to date.

blkt

Small suggestion as I keep looking at the code.

blkt · 2024-05-30T19:57:29Z

database/query/repositories.sql

+-- name: GetRandomRepository :one
+SELECT * FROM repositories
+TABLESAMPLE SYSTEM_ROWS(1);
+


Hi @Vyom-Yadav, thank you for your great work! 🙇

Albeit I don't think that adding tsm_system_rows would be a problem, I'm not convinced it's the right tool in this scenario.

If I got the idea right, here we're trying to get a row at random to use its ID as cursor, and all subsequent queries would be based on WHERE r.id > $1, which is OK and is most likely guaranteed to go by index.

I was wondering, why not getting rid of the additional dependency on tsm_system_rows and simply generate one "first" random uuid on the application side and then use that use that to start? If I got the procedure right, we would be also able to remove this statement, as we control randomness from the outside.

"first" random uuid on the application side and then use that use that to start?

There is no way that I know that generates a random uuid in some range (I'd rather not play with that). If we generate on the application side, then we can end up with an out-of-range uuid, which leads to simple sequential iteration.

I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id field on repositories is gen_random_uuid(), which is a v4 UUID.

evankanderson

Good to see you back! I'm going to encourage you to be a little less precise here in favor of ensuring there's an upper bound on the amount of work that we put into the system at any given time, which I think is the more critical piece for background operation.

evankanderson · 2024-05-30T13:07:58Z

database/migrations/000060_repo_reminder.up.sql

+ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP;
+
+CREATE EXTENSION IF NOT EXISTS tsm_system_rows;


This is a reasonable way to do it, but I'd prefer to have less postgres magic rather than more. It's not clear that we need to fetch a random valid repository at startup -- we can simply start with a random UUID and march forward on valid repository IDs from there.

If we did need to get a random repository, I'd be inclined to simply use:

SELECT * FROM repositories WHERE id > gen_random_uuid() LIMIT 1;

(plus a little bit to fetch the lowest-numbered repository if that query returns zero rows.)

evankanderson · 2024-05-30T22:14:04Z

database/query/repositories.sql

+SELECT r.* FROM repositories r
+  INNER JOIN rule_evaluations re ON re.repository_id = r.id
+  INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id
+WHERE r.id > $1
+GROUP BY r.id
+HAVING MIN(rde.last_updated) + sqlc.arg('min_elapsed')::interval < NOW()
+ORDER BY r.id
+LIMIT sqlc.narg('limit')::bigint;


I can see where this is coming from -- I'm also worried about the cost of this query on the database side. With a fairly simple amount of data, the query ends up being a lot more expensive than a full table scan on the database side (e.g. full scans of the tables are about 40-ish cost in this example).

Digging in a bit, I ran explain on a small database:

EXPLAIN SELECT r.* FROM repositories r INNER JOIN rule_evaluations re ON re.repository_id = r.id INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id WHERE r.id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid GROUP BY r.id HAVING MIN(rde.last_updated) + interval '1 hour' < NOW() ORDER BY r.id LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Limit (cost=57.74..57.76 rows=10 width=338) -> Sort (cost=57.74..57.80 rows=24 width=338) Sort Key: r.id -> HashAggregate (cost=55.94..57.22 rows=24 width=338) Group Key: r.id Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now()) -> Hash Join (cost=32.24..54.65 rows=258 width=346) Hash Cond: (rde.rule_eval_id = re.id) -> Seq Scan on rule_details_eval rde (cost=0.00..17.80 rows=780 width=24) -> Hash (cost=30.12..30.12 rows=169 width=354) -> Hash Join (cost=13.66..30.12 rows=169 width=354) Hash Cond: (re.repository_id = r.id) -> Seq Scan on rule_evaluations re (cost=0.00..15.10 rows=510 width=32) -> Hash (cost=12.75..12.75 rows=73 width=338) -> Seq Scan on repositories r (cost=0.00..12.75 rows=73 width=338) Filter: (id > '8e4d7b85-2022-4d95-8d1c-96d2097d73d2'::uuid)

The fundamental limit here seems to be that since we don't have an index on rde.last_updated, we'll always incur a sequential scan on rule_details_eval (one of our larger tables).

It's possible on a large database, we'd see different query planning, but I worry that the layers of indirection and aggregation will fundamentally make this query hard to plan efficiently. At the worst case, we might have a database with 100M rows, and at any given query, there are only 100 rule_details_eval rows that are older than our threshold. In this scenario, we'd end up needing to scan the 100M rows on each query for work, finding 10 repositories, and then repeating.

My gut feeling is that it's more important to have a steady amount of load on the database (many light queries) than to get exactly as much work as possible in any particular iteration. One way to do this would be to use a sub-select to limit the number of repositories considered in the query (e.g. SELECT ... FROM (SELECT * FROM repositories WHERE id > $1 ORDER BY id LIMIT 50) AS r ...). Another option would simply to do the work outside of a single SQL query, e.g. SELECT * FROM repositories WHERE id > $1 LIMIT 30, and then processing each repository after that.

Using the sub-select with limit on repositories with my sample query and a limit of 20, I see a 25-30% reduction in query time, but I suspect the behavior is stronger with larger databases.

Given that we have reminder_last_sent, it might also be reasonable to simply use that -- we'd have a one-time extra batch of revisits when this rolls out, and then the query could be:

SELECT * FROM (SELECT * FROM repositories WHERE id > $1 LIMIT 50) WHERE reminder_last_sent < NOW() - sqlc.arg('min_elapsed')::interval ORDER BY id;

Agreed on the statement analysis. The current query (without the sub-select) guarantees that we get limit rows (if they are there). With a sub-select, selected rows might not be eligible i.e. last_updated was recent.

Agreed with the point that this will be a single time-consuming query rather than a steady load form of query. To keep things and mocking simpler on the application side, I'd go with a sub-query rather than two different queries, i.e. selection and filtration.

Just using reminder_last_sent isn't a good parameter IMO. We can have a case where a reminder was sent 24h ago but the repo was recently updated (edge based), so this would result in extra load on the server (reconciling the repo).

evankanderson · 2024-05-30T22:37:22Z

internal/reminder/reminder.go

 	}
-	err := r.restoreCursorState(ctx)
+
+	randomRepo, err := r.store.GetRandomRepository(ctx)


Why not simply r.repositoryCursor = uuid.Random()?

It may generate an out-of-range uuid, which will result in sequential iteration. There is nothing wrong with sequential iteration, but I coded it in a way in which we start from a random valid uuid (theoretically speaking, it is possible to get the first uuid as a random uuid from the db which will result in sequential iteration)

This point of generating a random valid uuid may not be that important, so I'm willing to gen random uuid on application side if that's better (in terms of simplicity)

I'm not sure what an out-of-range uuid is -- I'd meant uuid.NewRandom(), which generates a valid (v4) UUID.

So, there is one edge case here: we may generate a UUID which is larger than any of the UUIDs in the database (at which point, we should start at the sequentially-first UUID). I believe that we're using v4 (random) UUIDs, so we should end up with a rougly-even distribution of UUIDs choosing at random. (If we were using time-sorted UUIDs, picking a random UUID would likely either start at the beginning or the end of the sequence.)

evankanderson · 2024-05-30T22:37:56Z

internal/reminder/reminder.go

 	if err != nil {
-		// Non-fatal error, if we can't restore the cursor state, we'll start from scratch.
-		logger.Error().Err(err).Msg("error restoring cursor state")
+		return nil, err


This means reminder will exit with an empty database (say, when setting up for the first time).

Yes, it should be:

if err != nil && !errors.Is(err, sql.ErrNoRows) { return nil, err }

evankanderson · 2024-05-30T22:39:51Z

internal/reminder/reminder.go

-		"repoListCursor":    r.repoListCursor,
+	// Update the reminder_last_sent for each repository to export as metrics
+	for _, repo := range repos {
+		logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)


Can you include both the repo and the old value of reminder_last_sent as structured fields in the logs? This could be an easy way to see what the actual delay on updates is.

evankanderson · 2024-05-30T22:57:48Z

internal/reminder/reminder.go

+		Limit: sql.NullInt64{
+			Int64: int64(r.cfg.RecurrenceConfig.BatchSize),
+			Valid: true,
+		},


Why make this nullable?

Bad copy paste, changed it.

evankanderson · 2024-05-30T23:00:46Z

internal/reminder/reminder.go

+	// Only fetch additional repositories if we are under the limit
+	if len(repos) < r.cfg.RecurrenceConfig.BatchSize {
+		additionalRepos, err = r.getAdditionalRepos(ctx, repos)
+		if err != nil {
+			return nil, err
+		}
+
+		repos, intersectionPoint = r.mergeRepoBatch(repos, additionalRepos)
 	}


This feels complicated -- it feels like we should call ListEligibleRepositoriesAfterID and then use the last returned element of that query to set the cursor for the next run (checking RepositoryExistsAfterID and setting the cursor to the zero UUID if no more repos exist).

evankanderson · 2024-05-30T23:03:30Z

internal/reminder/reminder.go

+	// There may be an intersection between the two sets
+	// If there is an intersection, then we need to update the cursor to the last fetched
+	// non-common repository


I'm not sure I understand how this happens, unless there are only e.g. 4 repos eligible in the whole database.

evankanderson · 2024-05-30T23:10:07Z

internal/reminder/reminder.go

+	intersectionPoint := -1
+	var additionalRepos []db.Repository
+
+	// Only fetch additional repositories if we are under the limit


I don't think it's a requirement that we find exactly BatchSize repos, just that we don't exceed BatchSize repos in one pass. We're trying to upper-bound the amount of background work added to the system.

I think this will get a lot simpler if we just do one pass:

repos, err := r.store.ListEligibleRepositoriesAfterID(ctx, {...}) if err != nil { return nil, err } // Don't actually ignore the error here... if len(repos) == 0 || !RepositoryExistsAfterID(ctx, repos[len(repos)-1].ID) { r.repositoryCursor = uuid.UUID{} } else { r.repositoryCursor = repos[len(repos)-1].ID } return repos, nil

evankanderson · 2024-05-30T23:14:46Z

internal/reminder/reminder.go

+	// non-common repository
+	intersectionPoint := -1
+	for i, repo := range additionalRepos {
+		if reposSet.Has(repo) {


This is using repo as a hashable object, but we've done two different queries, so the repos might differ in some minutiae like updated_at and be included in the batch twice. Given that we know that the two lists are sorted, we could also march through additionalRepos comparing less-than the first item in repos and find the intersection point without needing to use the set.

But, as mentioned, I think we can avoid a lot of this code if we're willing to not exactly fill each reminder batch.

Ahh, I missed this, but now we fetch only once, so this isn't required.

Vyom-Yadav · 2024-06-03T06:48:34Z

@evankanderson, I updated the PR to fetch only once, and batch_size will only be used as the upper limit. The only thing remaining is to decide the starting point random uuid generation.

Vyom-Yadav · 2024-06-03T06:53:41Z

pkg/api/protobuf/go/minder/v1/minder.pb.gw.go

-	conn, err := grpc.DialContext(ctx, endpoint, opts...)
+	conn, err := grpc.NewClient(endpoint, opts...)


Haven't verified the changes that came with protoc-gen-go v1.34.1

evankanderson

This is looking pretty close -- just two concerns:

It looks like LATERAL is a lot slower than just a standard inner join. /shrug
I think your logic for when to loop back the beginning isn't quite right.

evankanderson · 2024-06-03T22:47:57Z

database/query/repositories.sql

+-- name: GetRandomRepository :one
+SELECT * FROM repositories
+TABLESAMPLE SYSTEM_ROWS(1);
+


I believe we're using v4 UUIDs, which are random, except for about 7 bits in the middle of the string. (It looks like entity_event.go uses v1 UUIDs...). In particular, the default for the id field on repositories is gen_random_uuid(), which is a v4 UUID.

evankanderson · 2024-06-03T23:05:15Z

database/query/repositories.sql

+  ORDER BY id
+  LIMIT sqlc.arg('limit')::bigint
+) r
+JOIN LATERAL (


It looks like LATERAL converts this from a HashAggregate of a NestedLoop to a NestedLoop that runs an Aggregate inside:

EXPLAIN SELECT r.* FROM ( SELECT * FROM repositories ORDER BY id LIMIT 10 ) r JOIN LATERAL ( SELECT MIN(rde.last_updated) AS min_last_updated FROM rule_evaluations re INNER JOIN rule_details_eval rde ON rde.rule_eval_id = re.id WHERE re.repository_id = r.id ) sub ON sub.min_last_updated + interval '1h' < NOW() ORDER BY r.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=31.75..318.77 rows=10 width=338) -> Limit (cost=0.14..2.30 rows=10 width=338) -> Index Scan using repositories_pkey on repositories (cost=0.14..47.45 rows=220 width=338) -> Aggregate (cost=31.61..31.63 rows=1 width=8) Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now()) -> Nested Loop (cost=8.12..31.60 rows=5 width=8) -> Bitmap Heap Scan on rule_evaluations re (cost=7.97..15.08 rows=3 width=16) Recheck Cond: (repository_id = repositories.id) -> Bitmap Index Scan on rule_evaluations_results_name_lower_idx (cost=0.00..7.97 rows=3 width=0) Index Cond: (repository_id = repositories.id) -> Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde (cost=0.15..5.50 rows=1 width=24) Index Cond: (rule_eval_id = re.id) (12 rows)

vs

EXPLAIN SELECT r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id FROM ( SELECT * FROM repositories ORDER BY id LIMIT 10 ) r JOIN rule_evaluations AS re on re.repository_id = r.id JOIN rule_details_eval rde ON rde.rule_eval_id = re.id WHERE re.repository_id = r.id GROUP BY r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id HAVING MIN(rde.last_updated) + interval '1h' < NOW() ORDER BY r.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=28.33..28.36 rows=13 width=146) Sort Key: r.id -> HashAggregate (cost=27.39..28.09 rows=13 width=146) Group Key: r.id, r.provider, r.project_id, r.repo_owner, r.repo_name, r.repo_id, r.is_private, r.is_fork, r.webhook_id Filter: ((min(rde.last_updated) + '01:00:00'::interval) < now()) -> Nested Loop (cost=2.67..26.39 rows=40 width=154) -> Hash Join (cost=2.52..19.79 rows=26 width=162) Hash Cond: (re.repository_id = r.id) -> Seq Scan on rule_evaluations re (cost=0.00..15.10 rows=510 width=32) -> Hash (cost=2.40..2.40 rows=10 width=146) -> Subquery Scan on r (cost=0.14..2.40 rows=10 width=146) -> Limit (cost=0.14..2.30 rows=10 width=338) -> Index Scan using repositories_pkey on repositories (cost=0.14..47.45 rows=220 width=338) -> Index Scan using idx_rule_detail_eval_ids on rule_details_eval rde (cost=0.15..0.25 rows=1 width=24) Index Cond: (rule_eval_id = re.id) (15 rows)

While there are more steps in the non-lateral query plan, it seems to plan and execute substantially faster than the LATERAL version. In an environment with more rows, this is the difference between EXPLAIN ANALYZE taking 30ms and 2ms of execution time.

evankanderson · 2024-06-03T23:10:05Z

internal/reminder/reminder.go

+
+	// Update the reminder_last_sent for each repository to export as metrics
+	for _, repo := range repos {
+		logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)


Zerolog supports structured logging, which allows us to later easily query and extract (for example) log lines that reference a specific repository id:

Suggested change

logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)

logger.Debug().Str("repo", repo.ID.String()).Time("previously", repo.ReminderLastSent.Time).

Msg("updating reminder_last_sent")

evankanderson · 2024-06-03T23:10:54Z

internal/reminder/reminder.go

+		logger.Debug().Msgf("previous reminder_last_sent: %s", repo.ReminderLastSent.Time)
+		err := r.store.UpdateReminderLastSentById(ctx, repo.ID)
+		if err != nil {
+			logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID)


Suggested change

logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID)

logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")

evankanderson · 2024-06-03T23:13:03Z

internal/reminder/reminder.go

+	if len(repos) == 0 {
+		r.repositoryCursor = uuid.Nil


Since we're filtering the repos we get back from the batch in the database, it's possible that we read e.g. 50 repos that all got filtered. I think you want to check RepositoryExistsAfterID here.

I missed this branch 😞 . I updated the logic. Hopefully I haven't missed any logical branches this time.

evankanderson · 2024-06-03T23:15:01Z

internal/reminder/reminder.go

+		logger.Error().Err(err).Msgf("unable to check if repository exists after cursor: %s", r.repositoryCursor)
+		logger.Info().Msg("resetting cursor to zero uuid")


I'd tend to put this as one message in the logs, rather than two separate messages. Especially when searching structured logs (say, for the last 48 hours across 8 or 10 servers), it can be hard to "scroll" forward or backward to the next log line from a particular server.

Vyom-Yadav · 2024-06-05T18:03:19Z

database/query/profile_status.sql

+-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types.
+-- Though repository_id doesn't have non null constraint, but it always has a value in the database.
+-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965
+
+-- name: ListOldestRuleEvaluationsByRepositoryId :many
+SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
+FROM rule_evaluations re
+    INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
+WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[])
+GROUP BY re.repository_id;


Can re.repository_id be null? I was under the assumption that minder needs a repo to function on.

Also

EXPLAIN SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated FROM rule_evaluations re INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id WHERE re.repository_id = ANY (array['de0b2ad2-bc90-4126-b0a2-63abc1cce808','81b2a2ce-85fb-4528-a43e-7eef02f596f2','7e0813ce-6201-48f0-b136-bf08be8efcb9']::uuid[]) GROUP BY re.repository_id; GroupAggregate (cost=37.19..37.36 rows=8 width=24) Group Key: re.repository_id -> Sort (cost=37.19..37.22 rows=12 width=24) Sort Key: re.repository_id -> Hash Join (cost=17.11..36.97 rows=12 width=24) Hash Cond: (rde.rule_eval_id = re.id) -> Seq Scan on rule_details_eval rde (cost=0.00..17.80 rows=780 width=24) -> Hash (cost=17.01..17.01 rows=8 width=32) -> Seq Scan on rule_evaluations re (cost=0.00..17.01 rows=8 width=32) " Filter: (repository_id = ANY ('{de0b2ad2-bc90-4126-b0a2-63abc1cce808,81b2a2ce-85fb-4528-a43e-7eef02f596f2,7e0813ce-6201-48f0-b136-bf08be8efcb9}'::uuid[]))"

It does a sequential scan on both tables. I can understand the sequential scan on rule_evaluations table as there is no index on re.repository_id. But after obtaining re.ids associated with re.repository_id, why does it need to perform a sequential scan on rule_details_eval rde? There is an index on rde.rule_eval_id.

repository_id can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.

It looks like we don't have a relevant index on repository_id:

\d rule_evaluations Table "public.rule_evaluations" Column | Type | Collation | Nullable | Default -----------------+----------+-----------+----------+------------------- id | uuid | | not null | gen_random_uuid() entity | entities | | not null | profile_id | uuid | | not null | rule_type_id | uuid | | not null | repository_id | uuid | | | artifact_id | uuid | | | pull_request_id | uuid | | | rule_name | text | | not null | Indexes: "rule_evaluations_pkey" PRIMARY KEY, btree (id) "rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT

This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond).

rule_details_eval seems to be fine:

Indexes: "rule_details_eval_pkey" PRIMARY KEY, btree (id) "idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)

Added index on rule_evaluations(repository_id). The index will be created in concurrent mode as I thought blocking rule_evaluations writes isn't a good option.

Vyom-Yadav · 2024-06-05T18:07:04Z

@evankanderson To address the problem of updating the cursor properly, I have split the fetching and filtering queries. Now, repositories would be fetched unconditionally (to update the cursor) and later filtered on the application side.

evankanderson

This is getting really close!

I'd actually missed the "query might return zero repos but we still need to advance the cursor" issue -- nice catch!

evankanderson · 2024-06-06T20:07:11Z

database/query/profile_status.sql

+-- ListOldestRuleEvaluationsByRepositoryId has casts in select statement as sqlc generates incorrect types.
+-- Though repository_id doesn't have non null constraint, but it always has a value in the database.
+-- cast after MIN is required due to a known bug in sqlc: https://github.com/sqlc-dev/sqlc/issues/1965
+
+-- name: ListOldestRuleEvaluationsByRepositoryId :many
+SELECT re.repository_id::uuid AS repository_id, MIN(rde.last_updated)::timestamp AS oldest_last_updated
+FROM rule_evaluations re
+    INNER JOIN rule_details_eval rde ON re.id = rde.rule_eval_id
+WHERE re.repository_id = ANY (sqlc.arg('repository_ids')::uuid[])
+GROUP BY re.repository_id;


repository_id can be null for resources which aren't associated with a repository (for example, we're working on a DockerHub provider that wouldn't have git repos). I think it's fine for Reminder not to work with those yet.

It looks like we don't have a relevant index on repository_id:

\d rule_evaluations Table "public.rule_evaluations" Column | Type | Collation | Nullable | Default -----------------+----------+-----------+----------+------------------- id | uuid | | not null | gen_random_uuid() entity | entities | | not null | profile_id | uuid | | not null | rule_type_id | uuid | | not null | repository_id | uuid | | | artifact_id | uuid | | | pull_request_id | uuid | | | rule_name | text | | not null | Indexes: "rule_evaluations_pkey" PRIMARY KEY, btree (id) "rule_evaluations_results_name_lower_idx" UNIQUE, btree (profile_id, lower(rule_name), repository_id, COALESCE(artifact_id, '00000000-0000-0000-0000-000000000000'::uuid), entity, rule_type_id, COALESCE(pull_request_id, '00000000-0000-0000-0000-000000000000'::uuid)) NULLS NOT DISTINCT

This somewhat surprises me, as I'd have expected the foreign key constraint on repository_id to create an index. We should probably add indexes to support these -- feel free to only add the one for repository_id right now. (This may actually have been the proximate cause of a lot of the other queries having screwy cost estimates -- adding the index changed the explain from a Seq Scan + Filter to Bitmap Heap Scan(Recheck Cond) + Bitmap Index Scan(Index Cond).

rule_details_eval seems to be fine:

Indexes: "rule_details_eval_pkey" PRIMARY KEY, btree (id) "idx_rule_detail_eval_ids" UNIQUE, btree (rule_eval_id)

evankanderson · 2024-06-06T20:09:33Z

internal/reminder/reminder.go

+		err := r.store.UpdateReminderLastSentById(ctx, repo.ID)
+		if err != nil {
+			logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")
+			return []error{err}


It looks like this accumulates a list of errors by the signature, but this code does an early-return. Do you want to accumulate the errors, or just return err?

This function will largely change in the next PR i.e. connecting minder and reminder. A slice of errors is returned as we can get errors while creating messages, sending them, etc. Again, we can discuss this in the next PR. (See sendReminders function in the old PR)

evankanderson · 2024-06-06T21:00:00Z

internal/reminder/reminder.go

+	idToLastUpdatedMap := make(map[uuid.UUID]time.Time)
+	for _, oldestRuleEval := range oldestRuleEvals {
+		idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated
+	}
+
+	for _, repo := range repos {
+		if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok &&
+			oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) {


Why do this two-step loop here? It feels like either passing in the cutoff to the query (i.e. SELECT ... WHERE rde.last_updated < $1 AND re.repository_id = ANY(...) GROUP BY re.repository_id or the equivalent HAVING), or looping through the results once should be sufficient:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed) for _, evalTime := range oldestRuleEvals { if evalTime.OldestLastUpdated.Before(cutoff) { eligibleRepos = append(eligibleRepos, repo) } }

ListOldestRuleEvaluationsByRepositoryId only queries rule_evaluations and rule_details_eval, so it returns a slice of:

type ListOldestRuleEvaluationsByRepositoryIdRow struct { RepositoryID uuid.UUID `json:"repository_id"` OldestLastUpdated time.Time `json:"oldest_last_updated"` }

We have to iterate over repos as the repo object isn't returned.

Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).

Also, a slight preference for:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed) for _, repo := range repos { if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) { .... } }

This has two benefits:

You can fit the condition on one line (shortening the time var that lives for a single line, removing Map from the name of the map, and doing the date math on a separate line).

You only do the date math once, and only fetch time.Now() once, rather than using a slightly different time for each check.

Signed-off-by: Vyom-Yadav <jackhammervyom@gmail.com>

evankanderson · 2024-06-11T20:07:18Z

internal/reminder/reminder.go

+	idToLastUpdatedMap := make(map[uuid.UUID]time.Time)
+	for _, oldestRuleEval := range oldestRuleEvals {
+		idToLastUpdatedMap[oldestRuleEval.RepositoryID] = oldestRuleEval.OldestLastUpdated
+	}
+
+	for _, repo := range repos {
+		if oldestRuleEval, ok := idToLastUpdatedMap[repo.ID]; ok &&
+			oldestRuleEval.Add(r.cfg.RecurrenceConfig.MinElapsed).Before(time.Now()) {


Ah, that seems like it's worth a comment about how we're doing a bunch of transforms of types to fit into the sqlc-generated code (rather than for some other reason, like performance or thread-safety).

Also, a slight preference for:

cutoff := time.Now().Sub(r.cfg.RecurrenceConfig.MinElapsed) for _, repo := range repos { if t, ok := idToLastUpdate[repo.ID]; ok && t.Before(cutoff) { .... } }

This has two benefits:

You can fit the condition on one line (shortening the time var that lives for a single line, removing Map from the name of the map, and doing the date math on a separate line).

You only do the date math once, and only fetch time.Now() once, rather than using a slightly different time for each check.

evankanderson · 2024-06-11T20:13:11Z

I'm going to merge and then send a PR for the cleanup in getEligibleRepositories to avoid needing to do another go-round. 😁

Vyom-Yadav requested a review from a team as a code owner May 23, 2024 16:00

Vyom-Yadav marked this pull request as draft May 23, 2024 18:16

Vyom-Yadav force-pushed the addReminderLogic branch 4 times, most recently from 19f4f0a to 8fd38d5 Compare May 25, 2024 09:52

Vyom-Yadav commented May 25, 2024

View reviewed changes

Vyom-Yadav marked this pull request as ready for review May 25, 2024 09:55

rdimitrov reviewed May 30, 2024

View reviewed changes

blkt reviewed May 30, 2024

View reviewed changes

evankanderson reviewed May 30, 2024

View reviewed changes

Vyom-Yadav force-pushed the addReminderLogic branch 2 times, most recently from e273a44 to efed08c Compare June 3, 2024 06:44

Vyom-Yadav commented Jun 3, 2024

View reviewed changes

evankanderson reviewed Jun 3, 2024

View reviewed changes

Vyom-Yadav force-pushed the addReminderLogic branch 3 times, most recently from ef39a5d to 6c92820 Compare June 5, 2024 18:02

Vyom-Yadav commented Jun 5, 2024

View reviewed changes

Vyom-Yadav force-pushed the addReminderLogic branch from 6c92820 to dc04b46 Compare June 5, 2024 18:15

evankanderson reviewed Jun 6, 2024

View reviewed changes

Vyom-Yadav force-pushed the addReminderLogic branch 2 times, most recently from 413869a to 7afc72f Compare June 11, 2024 07:59

Add batch creation logic for the reminder service

2542103

Signed-off-by: Vyom-Yadav <jackhammervyom@gmail.com>

Vyom-Yadav force-pushed the addReminderLogic branch from 7afc72f to 2542103 Compare June 11, 2024 08:55

evankanderson approved these changes Jun 11, 2024

View reviewed changes

evankanderson merged commit a2b09c7 into mindersec:main Jun 11, 2024
23 checks passed

evankanderson mentioned this pull request Jun 11, 2024

Add comments to getEligibleRepositories, reduce time work in same, fix migration numbering #3580

Merged

10 tasks

		ALTER TABLE repositories ADD COLUMN reminder_last_sent TIMESTAMP;

		CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

		conn, err := grpc.DialContext(ctx, endpoint, opts...)
		conn, err := grpc.NewClient(endpoint, opts...)

	logger.Debug().Msgf("updating reminder_last_sent for repository: %s", repo.ID)
	logger.Debug().Str("repo", repo.ID.String()).Time("previously", repo.ReminderLastSent.Time).
	Msg("updating reminder_last_sent")

	logger.Error().Err(err).Msgf("unable to update reminder_last_sent for repository: %s", repo.ID)
	logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")

		logger.Error().Err(err).Msgf("unable to check if repository exists after cursor: %s", r.repositoryCursor)
		logger.Info().Msg("resetting cursor to zero uuid")

Add batch creation logic for the reminder service #3413

Add batch creation logic for the reminder service #3413

Conversation

Vyom-Yadav commented May 23, 2024 • edited Loading

Summary

Change Type

Testing

Review Checklist:

coveralls commented May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdimitrov left a comment • edited Loading

Choose a reason for hiding this comment

Vyom-Yadav commented May 30, 2024

blkt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evankanderson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vyom-Yadav commented Jun 3, 2024

Choose a reason for hiding this comment

evankanderson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vyom-Yadav commented Jun 5, 2024

evankanderson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evankanderson commented Jun 11, 2024

Vyom-Yadav commented May 23, 2024 •

edited

Loading

coveralls commented May 23, 2024 •

edited

Loading

rdimitrov left a comment •

edited

Loading