Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctness issue: Change to batched entry upload to support witnessing #1067

Closed
haydentherapper opened this issue Sep 21, 2022 · 6 comments · Fixed by #1461
Closed

Correctness issue: Change to batched entry upload to support witnessing #1067

haydentherapper opened this issue Sep 21, 2022 · 6 comments · Fixed by #1461
Assignees
Labels
bug Something isn't working

Comments

@haydentherapper
Copy link
Contributor

haydentherapper commented Sep 21, 2022

Witnesses monitor the consistency of the log, verifying that the log is append-only and immutable. Roughly, the verification process for a witness is:

  • Persist a checkpoint (signed tree head)
  • Periodically, request the current checkpoint
  • Verify the signature on the checkpoint
  • Request and verify a consistency proof between the persisted checkpoint and current checkpoint
  • Persist the new checkpoint
  • Publish a countersignature over the current checkpoint

Witnesses also help mitigate split-view attacks, where the log presents different views to different clients. As a consumer, I can use witnesses' countersignatures to verify that all witnesses have seen the same checkpoint. For a client that wants to verify an entry in the log and verify the log is consistent and no split-view attack is occurring, the client needs to:

  • Verify the entry's inclusion proof
    • If presented with an SET (offline promise of inclusion), the client should request an inclusion proof
  • Verify consistency by reaching quorum with a set of witnesses. This involves a) checking signatures for each countersignature from a witness, and b) verifying that the root hash that's been signed over is the same as the root hash of the inclusion proof.
    • Note that a client doesn't need to request a consistency proof if they trust the set of witnesses
    • It's up to the client how many witnesses constitutes quorum
    • Note that quorum could be reached via a gossip protocol too, where witnesses share signatures with one another.

This quorum mechanism requires that witnesses have seen the same root hash. This means that the root hash cannot frequently update. With the current state of Rekor, the root hash updates with almost every entry, because we've set Trillian's MMD (maximum merge delay, the time it takes for Trillian to process an entry for inclusion) to 0. This means that witnesses are very, very unlikely to ever witness the same checkpoint.

The proposed fix would be to move towards a model like CT, where entries are batched and processed periodically. Whatever this period is, we can set witnesses to poll more frequently than this period, giving them a chance to witness the same checkpoint. For example, if we set MMD to 5 minutes, we can have witnesses poll every minute or two.

The impact of this change is:

  • The SET will no longer include a log index. This will be a breaking change to clients, because the SET signature is over the log index
  • The SET will no longer include the integrated time. This timestamp will have to be the received/requested time, or a signed timestamp.
  • The response on entry upload will no longer include the inclusion proof, also a breaking client change. I do not see this as a significant problem. Technically, an inclusion proof is a stronger statement than an SET (which is just a promise), but the log can always lie about the inclusion proof and present a split view. You still need to verify a) consistency with that inclusion proof's root hash, and b) find quorum. Persisting an inclusion proof saves one call to Rekor however, since for SET verification, you should also request an inclusion proof to confirm that the promise has been resolved and the entry is actually in Rekor.
  • Entries can't be verified online immediately. This is not likely an issue for two reasons:
    • Most Sigstore clients have just been using SETs to "verify inclusion" and never querying the Rekor log
    • Artifacts are likely not going to be verified immediately after upload, but later when they're consumed. We just need to keep the MMD low enough (<1hr would be my recommendation).
@haydentherapper haydentherapper added bug Something isn't working ga_candidate Proposed blocking issue for GA release labels Sep 21, 2022
@haydentherapper
Copy link
Contributor Author

cc @asraa @priyawadhwa @znewman01 for more discussion

@haydentherapper
Copy link
Contributor Author

An alternative to making breaking changes, while still increasing the batch size, would be to keep the current design and increase the MMD to O(minutes). Rekor already polls Trillian until the upload is complete, so this would only require a configuration change. However, the downside to this is that connections will be kept open for minutes, and the SLO for upload would have to be significantly increased. I assume this is not a viable solution due to filling up the connection pool, but maybe there's some way around this? If anyone has knowledge of Go + networking, please chime in!

Also worth noting the current state:

  • We can deploy a GHA-based witness using rekor-monitor. This currently just verifies consistency.
  • Trillian's omniwitness verifies consistency, publishes a countersignature, and gathers signatures from other witnesses ("distributors"). If you look at one distributor, you'll see that it has not been able to gather matching checkpoints from different distributors in months (checkpoint.2/checkpoint.3), and I assume it was only able to due to some outage in Rekor.
  • Witnesses verifying consistency without a quorum mechanism will not prevent split-view attacks, because each witness could be presented a different view of the log.

@dlorenc
Copy link
Member

dlorenc commented Sep 22, 2022

I'm not sure I agree that witnessing and gossip protocols are necessary to mitigate split-view attacks. In order to maintain a long-term split view attack, the server would need to track individual users across IP addresses over long periods of time in order to target or create one.

A split-view attack is impractical if you assume that clients/witnesses are relatively anonymous. There's plenty more work that can be done here to make these attacks even harder, but this is still pretty effective today.

@znewman01
Copy link

znewman01 commented Sep 22, 2022

I still think it's a good idea to make the SET just a promise, and not wait for completion.

This has some advantages architecturally: as the Merkle tree grows, each append will be slower, and batching helps, so we may want to batch for reasons other than security.

EDIT: plus it's weirdly inconsistent with CT

@priyawadhwa priyawadhwa removed the ga_candidate Proposed blocking issue for GA release label Sep 22, 2022
@asraa
Copy link
Contributor

asraa commented Sep 22, 2022

This has some advantages architecturally: as the Merkle tree grows, each append will be slower, and batching helps, so we may want to batch for reasons other than security.

@rbehjati has also brought up points in favor of batched updates, in face of CAP theorem too: and we can ideally make this a configurable parameter of the server.

@haydentherapper
Copy link
Contributor Author

haydentherapper commented Feb 15, 2023

I have written a proposal for supporting witnessing in Rekor - https://docs.google.com/document/d/1NFSrkcIvrwvqV-2Ax2NoE0FdKh6ccnZufdqawAetKmY/edit?resourcekey=0-p8eFLQub4klCf3wrOAFSTg (shared with sigstore-dev@)

tl;dr: We don't need any breaking changes, we don't need to switch to batching entries. Rekor will publish a checkpoint periodically that witnesses will sign. As an artifact verifier, I will use this witnessed checkpoint (and the log size from that checkpoint) when computing inclusion proofs. The consequence of this is that artifacts aren't immediately verifiable when added to the log, but this is no different than if we went with a batched approach. This approach also comes with privacy improvements (k-anonymity), since checkpoints no longer uniquely identity an entry.

Proposed changes to Rekor are to:

  • Add an API for fetching a to-be-witnessed checkpoint (with a mechanism to compute this periodically)
  • Add support for fetching an inclusion proof by tree size (supported by Trillian, but not Rekor's current API)

There will be changes to Sigstore clients to support this stronger online verification and witnesses to use the new API.

@haydentherapper haydentherapper self-assigned this Feb 15, 2023
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 2, 2023
Currently, a log witness countersigns the latest checkpoint Rekor
publishes. Rekor updates its checkpoint on every entry upload, which is
extremely frequent. This means that two witnesses are very unlikely to
countersign the same checkpoint. While gossiping, it will not be
possible to reach a consensus on the same checkpoint, and therefore we
can't mitigate split-view attacks.

This change publishes a checkpoint, a "stable checkpoint", every 5
minutes (configurable) to Redis. This runs as a goroutine, with a Redis
key derived from the current time rounded to the nearest 5 minutes.
We use set-if-not-exist for Redis, meaning you can run replicated instances of
Rekor, with all instances writing to the same Redis key.

For a client that wants to gossip, this means waiting 5 minutes before
a checkpoint is published that witnesses will countersign (Note that
this is an area of active development and research too).
The stable checkpoint can be accessed with a query parameter.

Fixes sigstore#1067. There is still
value in batching in terms of reliablity, but stable checkpoints
solve the gossiping issue without a breaking change.

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 9, 2023
Currently, a log witness countersigns the latest checkpoint Rekor
publishes. Rekor updates its checkpoint on every entry upload, which is
extremely frequent. This means that two witnesses are very unlikely to
countersign the same checkpoint. While gossiping, it will not be
possible to reach a consensus on the same checkpoint, and therefore we
can't mitigate split-view attacks.

This change publishes a checkpoint, a "stable checkpoint", every 5
minutes (configurable) to Redis. This runs as a goroutine, with a Redis
key derived from the current time rounded to the nearest 5 minutes.
We use set-if-not-exist for Redis, meaning you can run replicated instances of
Rekor, with all instances writing to the same Redis key.

For a client that wants to gossip, this means waiting 5 minutes before
a checkpoint is published that witnesses will countersign (Note that
this is an area of active development and research too).
The stable checkpoint can be accessed with a query parameter.

Fixes sigstore#1067. There is still
value in batching in terms of reliablity, but stable checkpoints
solve the gossiping issue without a breaking change.

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>
bobcallaway added a commit that referenced this issue May 14, 2023
* Publish stable checkpoint periodically to Redis

Currently, a log witness countersigns the latest checkpoint Rekor
publishes. Rekor updates its checkpoint on every entry upload, which is
extremely frequent. This means that two witnesses are very unlikely to
countersign the same checkpoint. While gossiping, it will not be
possible to reach a consensus on the same checkpoint, and therefore we
can't mitigate split-view attacks.

This change publishes a checkpoint, a "stable checkpoint", every 5
minutes (configurable) to Redis. This runs as a goroutine, with a Redis
key derived from the current time rounded to the nearest 5 minutes.
We use set-if-not-exist for Redis, meaning you can run replicated instances of
Rekor, with all instances writing to the same Redis key.

For a client that wants to gossip, this means waiting 5 minutes before
a checkpoint is published that witnesses will countersign (Note that
this is an area of active development and research too).
The stable checkpoint can be accessed with a query parameter.

Fixes #1067. There is still
value in batching in terms of reliablity, but stable checkpoints
solve the gossiping issue without a breaking change.

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Use latest key to access latest checkpoint

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Add test

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Return early if key already exists

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Add comment explaining failure handling

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Apply suggestions from code review

Co-authored-by: Bob Callaway <bobcallaway@users.noreply.github.com>
Signed-off-by: Hayden B <hblauzvern@google.com>

* Fix goroutine leak, check if redis client is configured

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

* Add test for goroutine leak

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>

---------

Signed-off-by: Hayden Blauzvern <hblauzvern@google.com>
Signed-off-by: Hayden B <hblauzvern@google.com>
Co-authored-by: Bob Callaway <bobcallaway@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants