Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a namespace field to tuf metadata storage #1110

Closed
wants to merge 7 commits into from

Conversation

ecordell
Copy link
Contributor

@ecordell ecordell commented Mar 1, 2017

This PR adds “channels” to tuf metadata. A Published and Staged channel are added and the migration puts all existing metadata into the Published channel.

Motivation

The primary motivation for introducing namespaces is to support “staged” views of metadata, so that a notary server can coordinate threshold signing between remote users.

As discussed in #841, it’s insufficient to have a single “staged” copy of metadata due to race conditions. There will be a maximum of N possible staged copies, where N is the number of users allowed to sign a particular metadata file. The limit to those users also serves as protection against DOSing by uploading multiple staged versions (each signer gets a max of 1 staged copy).

This is just the first step towards that, to keep things manageable.

Design decisions

This is not user-facing; users do not get to decide what channel their metadata lives in. For this reason the namespace is not exposed to client. Channel decisions will be made on the server during the update process.

A new channels table is added to allow channels to be easily added/removed. The expectation is that a few (e.g. Published and Staged) will be fairly static, but additional Channels will come and go. For example, there may be a new channel for each key that can sign metadata (and will only exist while during the staging process).

The Changefeed only publishes changes in the Published channel for now. It may make sense to include other channels at some point, or to provide channel-specific changefeeds, but this first version preserves the existing behavior.

@ecordell ecordell force-pushed the namespaced-metadata branch from 5c25351 to a1e369a Compare March 1, 2017 16:38
@ecordell ecordell force-pushed the namespaced-metadata branch 2 times, most recently from 3e7c864 to 5615777 Compare March 1, 2017 16:53
@@ -0,0 +1,18 @@
CREATE TABLE `namespaces` (
Copy link
Contributor

@endophage endophage Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAAAAASSSS! I was just explaining this pattern to @riyazdf and @cyli a month or so ago :-D

(the pattern being a single column table rather than an enum)

@endophage
Copy link
Contributor

Overall I think this looks pretty good. I have two thoughts, one of which may not be immediately relevant:

  1. is "namespace" the right terminology? As a precursor towards thresholding and staged metadata, it's really a "state" or "stage" of the publishing process.
    a. a sub point, I think we should be more explicit about the consts we use for the value. "default" doesn't tell me anything. Is "default" staged or published. I know which from reading the code, but I had to read more code than should be necessary to determine it. I'd say the default should be "published" and the const itself should be something like "FooPublished" where "Foo" is whatever we settle on as far as "Namespace", "Stage", "State", or something else.
  2. (possibly for later) We need to take another look at the DB indices. I think the current implementation will prevent multiple iterations of the same role/version existing at the same time. This is desirable for published metadata, I can't remember what/if we settled on what would happen with pending metadata (I think I would be OK with only allowing one staged thing but want to make sure we've hashed out that experience.

@ecordell
Copy link
Contributor Author

ecordell commented Mar 1, 2017

Working on the RDB tests and will update!

is "namespace" the right terminology? As a precursor towards thresholding and staged metadata, it's really a "state" or "stage" of the publishing process.

What about changing the table name from namespaces to labels, and changing the namespace field to state_label? Then additional labels can be added in the future to address some of the other use-cases of namespaces I suggested.

a sub point, I think we should be more explicit about the consts we use for the value. "default" doesn't tell me anything.

How about published as you suggested, or perhaps canonical or official?

We need to take another look at the DB indices. I think the current implementation will prevent multiple iterations of the same role/version existing at the same time.

I changed the constraints to include the namespace field, which means you can have the samerole/version in multiple records as long as they have different namespaces.

This is desirable for published metadata, I can't remember what/if we settled on what would happen with pending metadata (I think I would be OK with only allowing one staged thing but want to make sure we've hashed out that experience.

We might be talking about different things, but the plan was to make the namespace (or state_label) be used for pending metadata.

For example:

  • You and I are both signers for targets/releases.json with a threshold of 2.
  • I sign a new version and push it up
  • The notary server detects that the threshold is <2 and stashes it with namespace = ecordell (or perhaps namespace = <my_kid> would make more sense)
  • If you push up conflicting metadata, it would get stashed under namespace = endophage
  • Or if you pull my staged metadata, sign it, and push it up, it would be promoted to the default/canonical namespace and the stashed namespace = ecordell copy would be garbage collected.

That's the rough plan for pending metadata for thresholding, were you thinking of a different type of "pending" metadata?

@ecordell ecordell force-pushed the namespaced-metadata branch 2 times, most recently from 823d196 to 011654d Compare March 2, 2017 13:11
@endophage
Copy link
Contributor

Ah thanks for the explanation. I thought the "namespaces" (I'll stick with the term for now) would be more like "pending_signatures", "pending" (has enough signatures but maybe needs a final OK or for say, a periodic publish to happen) and "published", rather than referencing a signer by one means or another (name/keyID). If that's the plan, I don't think it even makes sense to have a table for them. Would we write every new "namespace" into that table to satisfy the foreign key relationship? What's the point of that (and of having the foreign key relationship) if new "namespaces" can be dynamically created (by creating new delegations with thresholds) and we'll just update the table when we see them? I don't think it would actually provide any meaningful guarantees.

I just re-read your initial comments and I think it might be worth addressing some of the ideas you had for future uses of namespaces:

Namespaces could be used to share a single notary server between multiple classes of objects. For example, a “container” namespace and a “python package” namespace.

Notary doesn't care if you're signing docker images or python packages, it's just a TUF repo. GUNs are already meant to indicate ownership/context, hence all docker hub images are signed under docker.io/* and I would expect python packages to be signed under something like pypi.org/*.

Could be used to “stash” a set of metadata when a user performs a potentially destructive operation (e.g. deleting a root) so that it can be recovered later.

The ORM (gorm) already supports soft deletes. It's only our uniqueness constraints that are preventing us using it. Additionally one would still run into the problem if say, they deleted the same role/version twice at different points in time, so this only half solves a problem in its most basic form.

Could be used to store different “views” of metadata a la the PyPI maximum security model. (A user could decide to look for targets only in the tree rooted at the offline key, or they could request from the online key as well).

Those views are achieved client side. To make it work as you describe, I believe it would require multiple root files referencing different targets keys. PEP 480 doesn't describe this structure as far as I remember (and having taken another brief scan), instead it would be on a user to say "I know packages in the 'claimed' bucket are more secure so I only want to use those" (something the user would still have to do even if we maintained a special view server side).

@ecordell
Copy link
Contributor Author

ecordell commented Mar 2, 2017

If that's the plan, I don't think it even makes sense to have a table for them. Would we write every new "namespace" into that table to satisfy the foreign key relationship? What's the point of that (and of having the foreign key relationship) if new "namespaces" can be dynamically created (by creating new delegations with thresholds) and we'll just update the table when we see them? I don't think it would actually provide any meaningful guarantees.

That pretty much has to be the plan - it's been a while, but if you read back through #841 (and related linked discussions: https://github.com/theupdateframework/tuf/issues/339), we hashed out how the coordination would have to work to avoid DOSing and/or bad things in the case of a single key compromise. In the case where you have only 1 staged copy at a time, the compromise of a single delegation key allows someone to continually clobber the staged metadata. (e.g. if you, cyli and I all sign for some metadata, I can continue to push garbage "staged" metadata and the two of you would have to race me to actually hit a 2/3 threshold!)

As for utility, I was thinking having a separate table + fk constraint would be useful for garbage collecting? You can construct a list of candidate namespaces by querying the namespace table. I'm not 100% sold on the utility so I can drop it for now, we can always add it in the future if needed.

I'll update to address all of this.

@endophage
Copy link
Contributor

@ecordell sorry, maybe I wasn't clear. My issue isn't with the process, only the implementation (the utility as you put it). It doesn't make sense to have the extra table unless the foreign key constraint is enforceable, but for it to be enforceable in a meaningful way, it also needs to be a static set of possibilities. If we're just creating namespaces dynamically, there's no point in having it, it's just not giving us anything.

@ecordell
Copy link
Contributor Author

ecordell commented Mar 3, 2017

@endophage updated so that there's simply a field with no fk, and no extra table.

@ecordell ecordell force-pushed the namespaced-metadata branch 2 times, most recently from 7db65dc to a95d5fa Compare March 3, 2017 14:22
@endophage
Copy link
Contributor

endophage commented Mar 4, 2017

Other than still really disliking the name of the new Namespace type, I think I'm good to go on this. @cyli @riyazdf @HuKeping any thoughts on the new type name? My concerns are that I don't think it really represents what the new field is about (i.e. the publish state, and somewhat which signer proposed the pending version), and also that "Namespace" gets thrown around so much I wouldn't want anybody to confuse it with some other use in a different docker or associated project. Like I think containerd is officially adopting out "canonical" GUNs and calling them "Namespaces".

@ecordell
Copy link
Contributor Author

ecordell commented Mar 7, 2017

working on a new design that should address both the naming concerns and also make this more flexible (realized that this one won't easily support some of the queries we'd want to do, e.g. "if there's staged metadata, give me the version with the most signatures")

@ecordell
Copy link
Contributor Author

ecordell commented Mar 14, 2017

Okay so this took a bit longer than planned thanks to some fun gorm issues, but here's the revised design:

  1. Instead of namespaces, I've added Channels which are the "alternate realities" of a TUF repository.
  2. There are two default Channels: Published and Staged
  3. Everything is transitioned to Published currently, and Staged is unused for now.
  4. Channels are many-to-many with TUFFiles, so when we add threshold signing, each signer will get their own channel.

This lets us do nice queries like "staged metadata for GUN/Role with the most signatures" or "staged metadata signed by X but not yet signed by Y or Z" which we will need when presenting these states, eventually, to the user. You'll notice that compared to the namespace design, the APIs for channels are much nicer (IMO).

I checked all of the queries being generated in SQL and they look good. The joins are minimal and use indexes. I've also updated gorm to v1.0 and made the necessary changes. The older version that we were using before was generating bad SQL when using a join table.

What is the status of RethinkDB? I was going to add support for that but noticed the changefeed isn't implemented there either. Is it deprecated?

@@ -407,7 +407,7 @@ func TestGetGUNPRefixes(t *testing.T) {
// For sanity, make sure we can always parse the sample config
func TestSampleConfig(t *testing.T) {
var registerCalled = 0
_, _, err := parseServerConfig("../../fixtures/server-config.json", fakeRegisterer(&registerCalled), false)
_, _, err := parseServerConfig("../../fixtures/server-config.sqlite.json", fakeRegisterer(&registerCalled), false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gorm 1.0 errors if it can't establish a connection immediately, so switched to sqlite for testing config parsing.

@ecordell ecordell force-pushed the namespaced-metadata branch 2 times, most recently from 7ce2685 to 9527b8c Compare March 14, 2017 17:55
@ecordell ecordell force-pushed the namespaced-metadata branch from e36c6b7 to 184c434 Compare March 15, 2017 15:04
Copy link
Contributor

@endophage endophage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only consider the check in isValidTargetsStructure to be a necessary change. The rest are questions/future minor polish but I don't consider them blockers.


// GetChecksum returns the given TUF role file and creation date for the
// GUN with the provided checksum. If the given (gun, role, checksum) are
// not found, it returns storage.ErrNotFound
GetChecksum(gun data.GUN, tufRole data.RoleName, checksum string) (created *time.Time, data []byte, err error)
GetChecksum(gun data.GUN, tufRole data.RoleName, checksum string, channels ...*Channel) (created *time.Time, data []byte, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checksum is sufficiently specific I wonder if it's necessary to also pass a channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same checksum can live in multiple channels when staging metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course!

@@ -33,17 +33,17 @@ type MetaStore interface {
// GetCurrent returns the modification date and data part of the metadata for
// the latest version of the given GUN and role. If there is no data for
// the given GUN and role, an error is returned.
GetCurrent(gun data.GUN, tufRole data.RoleName) (created *time.Time, data []byte, err error)
GetCurrent(gun data.GUN, tufRole data.RoleName, channels ...*Channel) (created *time.Time, data []byte, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should GetCurrent implicitly mean published and maybe there should be a new method GetLatest that takes a channel?

Copy link
Contributor Author

@ecordell ecordell Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is that GetCurrent(gun, role) uses the published channel, GetCurrent(gun, role, channel) uses the specified channel. I'm happy to add another wrapper method but that seems not dissimilar from your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, yeah, this is fine. I had a moment of forgetfulness that a ... can also be left empty by the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in all the implementations, if multiple channels are passed, only the data in the first channel the data is found in is passed (and the channels are checked in order of preference). Should this behavior be documented as a requirement in the interface, and a general test be added (maybe in server/storage/storage_test.go) to make sure that even if there is a TUF file with the same GUN and role is in multiple channels, this function will only return the data from the first?

Similarly for GetVersion. I guess for GetChecksum it doesn't really matter, since if it's the same checksum, it doesn't matter which channel it's in because all the data will be the same, although the creation time may be different. If the creation time matters, then perhaps a similar test can be added for GetChecksum?

}

func checksumKey(gun data.GUN) string {
return fmt.Sprintf("%s", gun)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really need this function at all it should just be return gun.String()

}

// Published is the channel all fully signed, validated metadata lives in
var Published = Channel{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a var block

var (
    Published = ...
    Staged = ...
)

@@ -43,7 +42,7 @@ func isValidTargetsStructure(t Targets, roleName RoleName) error {
}

for _, roleObj := range t.Delegations.Roles {
if !IsDelegation(roleObj.Name) || path.Dir(roleObj.Name.String()) != roleName.String() {
if !IsDelegation(roleObj.Name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the longer check should stay in. It's checking that all the delegations are direct children of the parent. i.e. if the targets.json file contains a "targets/foo/bar" delegation, that would be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get away from the strict hierarchy, looking forward to multi-role delegation. The DFS for delegations just needs to check for a cycle and otherwise doesn't really need to care about the names. Is there a reason for enforcing the hierarchy that I'm missing?

@endophage
Copy link
Contributor

endophage commented Jun 6, 2017 via email

ecordell added 6 commits June 7, 2017 09:45
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
@ecordell ecordell force-pushed the namespaced-metadata branch from ea277f8 to fe55a08 Compare June 7, 2017 13:45
@ecordell ecordell force-pushed the namespaced-metadata branch from fe55a08 to e407047 Compare June 7, 2017 13:51
Signed-off-by: Evan Cordell <cordell.evan@gmail.com>
Copy link
Contributor

@endophage endophage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ecordell
Copy link
Contributor Author

@cyli @riyazdf @HuKeping anyone else have some time to review? I will rebase on monday!

@cyli
Copy link
Contributor

cyli commented Jun 15, 2017

@ecordell Looking now, although apologies will take me a little bit to re-familiarize myself :)

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I haven't finished reviewing everything yet - I got sidetracked. I just wanted to include a couple of comments first - and I'll do another pass tomorrow.

I had a non-blocking question about garbage collection, should we choose to implement it - how do we want to implement that? It can totally be done in a different PR in the future, but I just wanted to make sure we considered it so that we didn't make a change that we'd have to undo in order to support it.

Would it be like "find every tuf_file ID that is not matched to the Published channel, and for each one, delete every other entry in tuf_file with the same gun, role, and version but different ID"?

@@ -0,0 +1,25 @@
ALTER TABLE "tuf_files" DROP CONSTRAINT "tuf_files_gun_role_version_key";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb SQL question: Is this constraint automatically created in 0001_initial.up.sql when we specify UNIQUE ("gun", "role", "version")? I didn't see this particular name anywhere in our repo, so was just wondering if this was something postgres specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly. It's created by UNIQUE in postgres.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for explaining!

@@ -33,17 +33,17 @@ type MetaStore interface {
// GetCurrent returns the modification date and data part of the metadata for
// the latest version of the given GUN and role. If there is no data for
// the given GUN and role, an error is returned.
GetCurrent(gun data.GUN, tufRole data.RoleName) (created *time.Time, data []byte, err error)
GetCurrent(gun data.GUN, tufRole data.RoleName, channels ...*Channel) (created *time.Time, data []byte, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like in all the implementations, if multiple channels are passed, only the data in the first channel the data is found in is passed (and the channels are checked in order of preference). Should this behavior be documented as a requirement in the interface, and a general test be added (maybe in server/storage/storage_test.go) to make sure that even if there is a TUF file with the same GUN and role is in multiple channels, this function will only return the data from the first?

Similarly for GetVersion. I guess for GetChecksum it doesn't really matter, since if it's the same checksum, it doesn't matter which channel it's in because all the data will be the same, although the creation time may be different. If the creation time matters, then perhaps a similar test can be added for GetChecksum?

Role data.RoleName
Version int
Data []byte
Channels []*Channel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like, from the memory implementation (I am not very familiar with advanced SQL syntax like joins and grouping, so I hadn't gotten to looking at the SQL implementation yet :)) that if multiple channels are passed, the same data will be added to all the channels.

Out of curiosity, what is the use case for staging the same data to multiple channels at once? In the case of published and staging, you probably don't need to write to both. In the case where each channel is a particular user or key ID, it also seems strange to write to multiple key IDs. Is the case for writing to both staging and a particular key ID? (I was a little fuzzy on how that would work for threshold signing).

If that's the case, and adding the same update should apply to every channel in the channel list (as opposed to the Get* functions, for which if multiple channels are passed only the data from the first channel that gun and role are in will be returned), should this be documented in the storage interface's UpdateCurrent function declaration? And in server/storage/storage_test.go's testUpdateCurrentInChannel, can we update multiple channels at the same time and ensure that getting for both channels returns the same data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the case for writing to both staging and a particular key ID? (I was a little fuzzy on how that would work for threshold signing).

Yeah, that was the plan. You upload metadata and it gets put into channels based on the keys it's signed with, then users can coordinate about which ones to promote by picking the versions they want to sign. Then you can get "all staged" by just querying the staging channel, or a particular staged version by querying for (staged, key_id).

(as opposed to the Get* functions, for which if multiple channels are passed only the data from the first channel that gun and role are in will be returned),

I think you're just seeing the deficiencies of the in-memory implementation. For testing purposes I didn't think it was necessary to try to replicate everything that sql gives you, but maybe it is worth the effort?

The SQL implementation gives you what you would expect from the interfaces: when you Get from a set of channels, the file must exist in all of them, and if you Update to a set of channels, the metadata will be placed in all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The issue with the in-memory implementation is the answer to your other comment as well, for some reason GH won't let me reply to it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason GH won't let me reply to it

I think that's because I started a review, but in my "review" I replied to someone else's comment, and github posts my response both as part of my review and as part of the original comment. :| You might be able to respond in the original comment.

The SQL implementation gives you what you would expect from the interfaces: when you Get from a set of channels, the file must exist in all of them, and if you Update to a set of channels, the metadata will be placed in all of them.

Ah ok, thanks for explaining. Apologies, I had misunderstood what the intended behavior of Get* in the SQL implementation due to the fact that I missed the PRIMARY KEY (tuf_file_id, channel_id) in the channel table creation SQL, and because I just saw Where("channel_id IN (?)", channelIDs) in TufFilesInChannels without reading further.

think you're just seeing the deficiencies of the in-memory implementation. For testing purposes I didn't think it was necessary to try to replicate everything that sql gives you, but maybe it is worth the effort?

I think the intention of the in-memory implementation is to provide basically a verified fake for others to use, and as a reference implementation for how everything else works. I'm ok with leaving it for now and we can address that in a future PR, so long as we document the deficiency.

I do think we should add more test cases for getting data from multiple channels, though, to be shared across the implementations of the storage interface (e.g. server/storage/storage_test.go, although in the in-memory implementation tests can just not call that particular function), and document this expected behavior in the interface, just because it can be easier to see what the intended behavior is if someone (like me :D) isn't super familiar with SQL joins. I'd also be afraid of making a change in that SQL implementation code and breaking it!

@ecordell
Copy link
Contributor Author

Would it be like "find every tuf_file ID that is not matched to the Published channel, and for each one, delete every other entry in tuf_file with the same gun, role, and version but different ID"?

That's what I was thinking, probably with a configurable age for deleting them

@@ -115,7 +115,7 @@ func (rdb RethinkDB) UpdateCurrent(gun data.GUN, update MetaUpdate) error {

// UpdateCurrentWithTSChecksum adds new metadata version for the given GUN with an associated
// checksum for the timestamp it belongs to, to afford us transaction-like functionality
func (rdb RethinkDB) UpdateCurrentWithTSChecksum(gun, tsChecksum string, update MetaUpdate) error {
func (rdb RethinkDB) UpdateCurrentWithTSChecksum(gun data.GUN, tsChecksum string, update MetaUpdate) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! 👍

// Find tuf files matching all passed channels
return func(db *gorm.DB) *gorm.DB {
return db.Joins("INNER JOIN channels_tuf_files ON tuf_files.id = channels_tuf_files.tuf_file_id").
Where("channel_id IN (?)", channelIDs).
Copy link
Contributor

@cyli cyli Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am likely just misunderstanding how this works due to lack of familiarity, but would this only enforce that the rows were in the right number of channels, as opposed to the channels specified? For instance, if there were 3 channels (Alice, Bob, and Carol), and I wanted to get the latest metadata for GUN and Timestamp that was in both Alice and Carol's channels, but the database had:

  1. id=5, GUN, Timestamp, channel=Alice (via the channel_tuf_files table)
  2. id=5, GUN, Timestamp, channel=Bob (via the channel_tuf_files table)

Would the scope consider that good because the id=5 is in one of (Alice, Carol), and it it is also in 2 distinct channels (Alice, Bob)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, gorm makes this harder to read than it should be.

The scope says:

  • Ignore any roles that aren't in the set of requested channels
  • Count how many distinct channels a tuf file is in
  • Because (tuf_file_id, channel_id) is unique, if number of unique channels == number of requested channels, the tuf file is in all requested channels.

For your specific example it would be something like this:

  • Query by GUN, TS in Alice and Carol channels
  • tuf_files is joined with channel_tuf_files
  • All rows in the joined table with channel_id != Alice or Carol are discarded
    • this is the part that specifically answers your question - there is no longer a (5, GUN, Timestamp, Bob) row in the result set
  • The result set only has (5, Carol)
  • COUNT of that set is 1
  • 2 channels were requested, so the result set returns empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, thanks for explaining. I didn't realize it ran the query for every channel in the Where() and then joined.

}

// The Channel is added separately from the TUFFile record to prevent gorm from issuing an Update
if err = tx.Model(&tufFile).Association("Channels").Append(update.Channels).Error; err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, the Append implementation in the Gorm code is kinda opaque - I think this will not error if a relationship already exists, since I think it eventually does an INSERT ... WHERE NOT EXISTS ..., but would it be possible to add a test case for if the metadata already exists in one of the channels specified, but not in the others?

Just to make sure this behavior is consistent should we ever chose to switch ORMs, or upgrade, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will add a test for it.

Getting gorm to generate the right SQL for this pr took me the better part of a day, I was almost ready to rip it out and replace it...but of course that would've made this much harder to review.

@@ -127,8 +164,9 @@ func (db *SQLStorage) UpdateMany(gun data.GUN, updates []MetaUpdate) error {
// called, version ordering in the updates list must be enforced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally non-blocking observation, unrelated to this PR - can we just sort the updates and then call UpdateCurrent? cc @endophage

},
).First(&row)
}).
Order("version desc").Limit(1).First(&row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: I don't think it hurts anything, but is ordering necessary here? If the version is different, won't the checksum be different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I see no reason this is needed and can't remember why I changed it. It's potentially just a copy/paste error.

@@ -241,6 +289,7 @@ func (db *SQLStorage) Delete(gun data.GUN) error {
}
// if there weren't actually any records for the GUN, don't write
// a deletion change record.
// don't write a change if this isn't published yet
Copy link
Contributor

@cyli cyli Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to correspond to some channel logic? (e.g. don't delete if the data isn't in the published channel?) If we're deleting everything in a GUN, should even staged metadata be deleted (since it probably won't be published in the future?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, yes this is a straggling comment - originally I scoped the delete but realized, as you point out, that when we delete a GUN we want to delete all staged copies as well

exists := db.Where("gun = ? and role = ? and version >= ?",
gun.String(), update.Role.String(), update.Version).First(&TUFFile{})

exists := db.Scopes(TufFilesInChannels(update.Channels...)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking observations - I don't think any of these things are problems because if we have garbage collection, but I just wanted to note them in case I missed something.

If we're updating the non-published channels, it's possible to add very much older staged versions of metadata (if there was no staged metadata for that version or higher). Any attempt to publish this old staged metadata will fail anyway, so not an issue.

If someone publishes a really high version metadata in a non-published channel, that will only prevent users from staging thresholded metadata for a limited amount of time, because since it does not necessarily correspond to something in published, it will be garbage collected eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are very good points, and we should add additional checks for those cases when actually implementing the threshold support.

CREATE TABLE `channels_tuf_files` (
`channel_id` INT(11) NOT NULL,
`tuf_file_id` INT(11) NOT NULL,
FOREIGN KEY (channel_id) REFERENCES channels(`id`) ON DELETE CASCADE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, does gorm automatically add this to both the mysql and postgres table generation? (didn't see something in the model that corresponded to this directive)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I didn't use gorm to generate the table or migration

@cyli
Copy link
Contributor

cyli commented Jun 16, 2017

Sorry for taking so long to review this @ecordell! This looks pretty good to me - my only 2 concerns are:

  1. Add a namespace field to tuf metadata storage #1110 (comment)
  2. Could we add the following test cases?
    • UpdateCurrent in storage_test.go (general behavior test):
      1. Adding to multiple channels at once, if the metadata doesn't exist in those channels, succeeds and Get* succeeds for the metadata we just updated if we specify all of those channels or one of those channels
      2. Adding to multiple channels at once, if the metadata exists in only some of those channels, succeeds and Get* succeeds for the metadata we just updated if we specify all of those new channels + all of the old channels, or just a subset of the channels
      3. Adding to one or multiple channels at once fails if the metadata already exists in all the channels specified
    • Get* in storage_test.go (general behavior test):
      1. Specifying Staging when we get will fail if the metadata is only in Published
      2. Test for Add a namespace field to tuf metadata storage #1110 (comment) - if there are 3 channels, and the metadata is in the first two, and we want to get metadata that is in the last two channels, getting fails.

I also had a question for the future thresholding implementations: Will there be there an API to create more channels? Deleting channels? If channels correspond to key IDs, will they be auto-reaped if the key is removed?

@ecordell
Copy link
Contributor Author

I also had a question for the future thresholding implementations: Will there be there an API to create more channels? Deleting channels? If channels correspond to key IDs, will they be auto-reaped if the key is removed?

That's open for discussion, but IMO if possible we should keep the channels as an implementation detail and hide them from the user. Exposing them might require more careful consideration and possibly a TAP?

@cyli
Copy link
Contributor

cyli commented Jun 19, 2017

That's open for discussion, but IMO if possible we should keep the channels as an implementation detail and hide them from the user. Exposing them might require more careful consideration and possibly a TAP?

Makes sense - seems like hiding from the user makes the CLI/API simpler, too.

@ecordell ecordell closed this Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants