RFD 149 PostgreSQL Schema For Manta buckets #112

kellymclaughlin · 2018-08-27T23:39:40Z

This is for discussion of RFD 149 PostgreSQL Schema For Manta buckets.

jasonbking · 2018-08-28T00:12:57Z

Just a few general questions (hopefully I'm not trodding dead ground here :P)

A number of non-primary key fields are marked unique, though the DDL does not contain a UNIQUE constraint. Is that a deliberate choice (it might be good to explain the motivation if so)?

If I've read the descriptions correctly, manta_bucket_object.bucket_id references manta_bucket.id, should that be expressed as a foreign key constraint? Similarly, if manta_bucket_deleted_bucket.id references manta_bucket.id (this isn't entirely clear from the description), should it also have a foreign key constraint?

If any of the id columns aren't going to be referenced to existing data outside of postgresql, was any thought given to making them BIGINT or BIGSERIAL instead of uuid? The former would both save space and would likely be a bit faster (since comparisons, sorting, equality checks, indexing etc. could be done as a single instruction).

Any thoughts about putting the values of 'sharks' into their own table and just referencing it from manta_bucket_object?

kellymclaughlin · 2018-08-28T16:47:43Z

Thanks for these good questions, Jason.

A number of non-primary key fields are marked unique, though the DDL does not contain a UNIQUE constraint. Is that a deliberate choice (it might be good to explain the motivation if so)?

I'm guessing you mean the id columns or some of the other uuid type columns here. We hope that the UUID values are unique across all of the metadata storage shards, but we take no steps to check or enforce this and therefore enforcing a uniqueness constraint and incurring that cost does not benefit us. We can generally tolerate duplicate UUIDs for the id column values of objects or buckets with the exception of a few edge cases. I am glad you mentioned this though because it is probably worth enumerating those cases in the RFD so that the risk is evident.

If I've read the descriptions correctly, manta_bucket_object.bucket_id references manta_bucket.id, should that be expressed as a foreign key constraint? Similarly, if manta_bucket_deleted_bucket.id references manta_bucket.id (this isn't entirely clear from the description), should it also have a foreign key constraint?

If all of our data were resident on the same database I would definitely look more into using foreign key constraints, but in the model for buckets the database records about a bucket is not guaranteed to be co-located with the records of the objects in that bucket and so we would be left with trying to manually enforce those constraints across multiple databases and servers which would be a huge ordeal.

If any of the id columns aren't going to be referenced to existing data outside of postgresql, was any thought given to making them BIGINT or BIGSERIAL instead of uuid? The former would both save space and would likely be a bit faster (since comparisons, sorting, equality checks, indexing etc. could be done as a single instruction).

The id column values for objects will be exposed, that's our ETag value currently. The owner and creator id values are created outside of the metadata system so the uuid type was chosen to match those external values. The id for the manta_bucket table is probably not going to be exposed right away, but it may be useful to expose it at some point. Additionally using an integer value for the bucket id would increase the changes of running into a conflict from the manta_bucket_object table primary key. i.e. We would be much more likely to have conflicting integer values for the bucket id generated across different shards than we would with UUIDs.

Any thoughts about putting the values of 'sharks' into their own table and just referencing it from manta_bucket_object?

I originally thought about having a separate table for this information, but in some of our early design discussions we reached the conclusion that it might be best to avoid using JOINs for the initial effort. The main reason is the lack of JOIN support in moray. This work will already require substantial changes to moray so it's an attempt to limit the risk of the initial effort some. It would save some bytes on disk though for sure so I am keeping it in mind as we go.

I'll push some updates to the RFD soon to better cover some of these questions. Thanks again!

davepacheco · 2018-09-15T00:09:41Z

Thanks for writing this up! It looks great. I have several small suggestions below.

"Overview" section

The storage for buckets storage in manta

That should probably say "The storage for buckets metadata in Manta".

"Schema" section

There proposed schema

That should read "The proposed schema", I think.

In the various places where you define a name column, as in:

name - A string representing the name of the bucket

I'd add "as assigned by the end user", since "name" can mean a lot of different things.

Only one instance of a particular named bucket may be present at one time.

That might lead one to think that names can't be reused across accounts. Maybe something like "Bucket names are unique within an account, but may be duplicated across accounts."

The unique identifier (id) is used for tracking different incarnations of a particular named bucket and determining the visibility of objects within that bucket.

To make this more explicit, what do you think of adding: "Whenever a bucket is used by the API, the system would generally resolve an (owner, name) to a particular bucket id, and then use the bucket id for all subsequent operations within the API request." (That's just to emphasize to readers that it shouldn't be re-resolved within a request, since the result can change.)

Is there any reason to prefer PRIMARY KEY instead of a separate unique index?

I wonder if it would be useful to add metadata for operations that create a bucket, delete a bucket, or delete an object? I'm envisioning a "tag" of sorts that in practice could be a muskie zonename and request id. (It should probably be free-form, though, so that other components -- like an internal command-line tool for manipulating the namespace -- could record whatever makes sense for them.) That way it would be easy to go back and find the log entry for more details about an operation. I didn't suggest doing this for object create operations because that's the most expensive case -- since we have to store it indefinitely for all existing objects.

I'm torn about whether the owner or creator of an object should be non-NULL. On the storage side, it's only the creator that matters. Can we get away with leaving the owner out entirely, given that the bucket_id already implies an owner?

Is there a particular reason to assign timestamps in PostgreSQL rather than having callers specify those?

I'm assuming we don't anticipate a need to distinguish between objects and directories?

I didn't understand this comment:

At first glance it might seem that the table should use the id column as the primary key, but this could cause issues if a situation arose where two different deleted or overwritten object versions happened to have the same value for the id column.

I think you'd need two different deleted or overwritten bucket versions, right? It seems like that should be impossible. Relatedly:

There could be a case where two buckets or two objects that reside on the same metadata shard end up with the same UUID for their respective id column values.

This feels dicey. It seems hard to enforce that we won't accidentally create dependencies on these being unique in the future. If it happens to work, that might be fine, but I'm not sure it makes sense to go out of our way to support it. I think we could make the constraint violation clearer to diagnose.

There are a few other fields we either have today or may want in the future:

RFD 143 (accelerated GC) describes adding a field to object metadata that allows non-snaplinked objects to be cleaned up much more efficiently. I strongly suggest we add that here, since adding it later will make it much less useful.
We have an index by objectid today, which I believe would be used in the full "online GC" implementation. As long as we have a path to adding this later, maybe we can defer this for now.
Several fields related to online auditing (RFD 112). I'm thinking we should skip this for now, with the expectation that we can add it in the future relatively easily using DEFAULT NULL.

"Queries" section

For all of these, I'd suggest mentioning what's expected to be given. In particular, it might not be obvious to a reader that when reading a bucket, you'd generally have the owner uuid and name (or why that would be the case). So they might wonder why the lookup isn't by "id".

I would also emphasize for each one that there's an index that should be able to efficiently satisfy queries on each of the WHERE clauses. I think that's worth calling out for future readers.

It might also be worth talking about sharding at some point so that people realize what things must be on the same shard and what things may not be. (That might preempt future questions about, e.g., the foreign key approach.)

It might be worth showing a query for writing and deleting an object conditional on a particular etag.

It might be worth expanding on the bucket listing case, showing what subsequent pages look like and maybe mentioning that it's subject to further research (if we're still planning to see what other systems support efficiently here). There's a decent blog post on pagination in PostgreSQL you could link to.

I know that's a lot of feedback, but it's mostly suggestions for clarification and a few small items. Thanks for doing this -- it looks great.

kellymclaughlin · 2018-09-17T23:03:12Z

Thanks for the great feedback! I'll be pushing some changes to reflect many of
the comments and here are some responses to some particular questions or
concerns you raised.

Is there any reason to prefer PRIMARY KEY instead of a separate unique index?

In this case I think (owner, name) is the appropriate primary lookup key for
the table. Using a unique index instead should yield the same result with the
exception that the unique index alone does not enforce that the value(s) cannot
be NULL. For this table we want both the uniqueness and the NOT NULL
constraint so I think specifying it as a primary key is appropriate. I also
realize my SQL to create the table unnecessarily species explicit NOT NULL
constraints for the owner and name columns. That comes for free with the
inclusion as part of the primary key.

I'm torn about whether the owner or creator of an object should be non-NULL. On the storage side, it's only the creator that matters. Can we get away with leaving the owner out entirely, given that the bucket_id already implies an owner?

This is a very interesting point. My reasoning for allowing NULL values for
the creator column was to avoid storing a duplicate UUID in the case where the
bucket owner is also the creator. I think you are right, we could drop the
owner column since the bucket_id should also imply an owner. We could also
still only store a creator value if the UUID differed from the bucket owner
UUID. One caveat I would add is that not storing the owner UUID with the object
information would slightly reduce our tolerance to UUID collisions (as described
in the UUID collision risks and impactssection). Again, the chances of such
a collision are very low so perhaps the overall storage savings is worth the
risk in this case.

Is there a particular reason to assign timestamps in PostgreSQL rather than
having callers specify those?

My thought is we do this to limit the potential for clock skew or latency to
distort the value. Rather than suffering the from the variance of the clocks
across all the web tier instances we at least isolate it to variance among the
metadata storage nodes. If a created timestamp gets assigned at the web tier and
request gets stuck in a postgres connection queue for five seconds then that
time isn't any longer representative of the real creation time. I suspect
postgres can assign the timestamps without incurring much extra overhead, but if
it's a concern we could always do some comparative testing to look at it more
closely.

I'm assuming we don't anticipate a need to distinguish between objects and
directories?

Correct, the presentation of directories by different tools that provide
interfaces to S3 is just done using zero-byte files with special names. So
basically it's just an illusion that some client tools present.

I didn't understand this comment:

At first glance it might seem that the table should use the id column as the primary key, but this could cause issues if a situation arose where two different deleted or overwritten object versions happened to have the same value for the id column.

I think you'd need two different deleted or overwritten bucket versions, right? It seems like that should be impossible. Relatedly:

There could be a case where two buckets or two objects that reside on the same metadata shard end up with the same UUID for their respective id column values.

Sure, let me try to explain this a bit more. The first of my comments you
reference I put in anticipation of questions about why the id column was not
specified as the primary key. It really equally applies to buckets or objects,
but it's just part of explicitly stating the risks (albeit very small) of UUID
collision when you aren't enforcing uniqueness at any single point in the
system. It's not that a collision is likely to ever happen, but an
acknowledgment that it is theoretically possible and if it ever does occur we
expect this aspect of the system to carry on unaffected. I have felt great
dissatifaction in interacting with S3 around the lack of documented information
about the true behaviors and guarantees of the system. The user is left to
experiment for themselves and draw inferences rather than having an explicit
reference for many aspects of the system that can be consequential to
users. Having felt this frustration I wanted to do better and be explicit about
as many of the behaviors as possible. So to your question about it being
impossible, it is not an impossible situation. Just very unlikely and even it
ever does occur we don't care because we did not add a primary key on our object
or bucket id columns. Does this help answer your questions or did I
misunderstand what you're really asking?

This feels dicey. It seems hard to enforce that we won't accidentally create
dependencies on these being unique in the future. If it happens to work, that
might be fine, but I'm not sure it makes sense to go out of our way to support
it. I think we could make the constraint violation clearer to diagnose.

I disagree with characterizing it as dicey. We're not really going out of our
way to support anything. As I mentioned in my last response it's really just
about acknowledging the risks and being explicit about how the system would
handle problems that might happen. In this case it's just explaining a benefit
to not specifying a particular primary key on our delete tables that we don't
actually need anyways. I think it is a greater risk when talking about UUID
collisions in the context of this schema design is to over-constrain the
system. I cannot really think of a negative consequence of a UUID collision with
the schema I've set out that would make it worth the effort to detect. As long
as data is not left accessible to unauthorized parties and the correct bucket
and object data can be retrieved by the authorized users then a collision of a
UUID should really be of no consequence.

There are a few other fields we either have today or may want in the future:

RFD 143 (accelerated GC) describes adding a field to object metadata that
allows non-snaplinked objects to be cleaned up much more efficiently. I
strongly suggest we add that here, since adding it later will make it much less
useful.

I strongly advocate we foreclose on ever supporting object linking
for buckets. It's a nice feature, but the implications for garbage collection
are far too great in my opinion. I can't say for sure, but I would wager this is
the reason S3 only supports a full object copy operation rather than any type of
linking. And if we never intend to support links then we don't have to worry
about a distinction for accelerated gc versus normal gc.

We have an index by objectid today, which I believe would be used in the
full "online GC" implementation. As long as we have a path to adding this
later, maybe we can defer this for now.

I wrote up a comment on MANTA-3878 on how I think online gc for buckets could
work using the delete tables I outlined in the RFD. A more detailed treatment
would need its own RFD, but the comment gets the main ideas across I hope. With
that system we would not need an index on the object id. I'd love to hear your
thoughts on those gc notes if you have a chance to read them over.

Several fields related to online auditing (RFD 112). I'm thinking we should
skip this for now, with the expectation that we can add it in the future
relatively easily using DEFAULT NULL.

Thanks, I'll revisit this and see what we might need to consider adding in the
future.

It might be worth expanding on the bucket listing case, showing what
subsequent pages look like and maybe mentioning that it's subject to further
research (if we're still planning to see what other systems support
efficiently here). There's a decent blog post on pagination in PostgreSQL you
could link to.

Good call. There's a lot still in flux with regards to this topic, but I'll try
to describe it a bit more and mention the ongoing research.

davepacheco · 2018-09-20T21:12:36Z

Thanks for the thoughtful replies!

Thanks for the great feedback! I'll be pushing some changes to reflect many of
the comments and here are some responses to some particular questions or
concerns you raised.

Is there any reason to prefer PRIMARY KEY instead of a separate unique index?

In this case I think (owner, name) is the appropriate primary lookup key for
the table. Using a unique index instead should yield the same result with the
exception that the unique index alone does not enforce that the value(s) cannot
be NULL. For this table we want both the uniqueness and the NOT NULL
constraint so I think specifying it as a primary key is appropriate. I also
realize my SQL to create the table unnecessarily species explicit NOT NULL
constraints for the owner and name columns. That comes for free with the
inclusion as part of the primary key.

I don't think there's anything wrong with PRIMARY KEY. I just haven't used it much before (surprising, I know), and was wondering if there's any operational difference between that and just declaring the columns NOT NULL and adding a unique composite index. (For example, is it harder to re-create a primary key index CONCURRENTLY if we realize we need to rebuild it for performance reasons? It sounds like this isn't so bad, but I'm not sure what other operational issues might exist.) This question comes mostly from my ignorance -- it sounds like both options are about the same.

I'm torn about whether the owner or creator of an object should be non-NULL. On the storage side, it's only the creator that matters. Can we get away with leaving the owner out entirely, given that the bucket_id already implies an owner?

This is a very interesting point. My reasoning for allowing NULL values for
the creator column was to avoid storing a duplicate UUID in the case where the
bucket owner is also the creator. I think you are right, we could drop the
owner column since the bucket_id should also imply an owner. We could also
still only store a creator value if the UUID differed from the bucket owner
UUID. One caveat I would add is that not storing the owner UUID with the object
information would slightly reduce our tolerance to UUID collisions (as described
in the UUID collision risks and impactssection). Again, the chances of such
a collision are very low so perhaps the overall storage savings is worth the
risk in this case.

I think either approach seems reasonable.

Is there a particular reason to assign timestamps in PostgreSQL rather than
having callers specify those?

My thought is we do this to limit the potential for clock skew or latency to
distort the value. Rather than suffering the from the variance of the clocks
across all the web tier instances we at least isolate it to variance among the
metadata storage nodes. If a created timestamp gets assigned at the web tier and
request gets stuck in a postgres connection queue for five seconds then that
time isn't any longer representative of the real creation time. I suspect
postgres can assign the timestamps without incurring much extra overhead, but if
it's a concern we could always do some comparative testing to look at it more
closely.

I don't think this probably matters too much either way -- I just wanted to raise it for consideration. I find the system as a whole a bit easier to reason about if the values are generally constructed it one place (which very much isn't the case today), and I can imagine use-cases where clients would want to set their own timestamp anyway or cases where we'd want the client timestamp instead of the database timestamp. I don't think the variance in clocks among the databases is likely to be any different than the variance in clocks among the clients (which are the same physical machines today), and I think timestamp values shouldn't really be used for correctness in Manta. Again, all that said, I don't think it matters too much either way.

I didn't understand this comment:

At first glance it might seem that the table should use the id column as the primary key, but this could cause issues if a situation arose where two different deleted or overwritten object versions happened to have the same value for the id column.

I think you'd need two different deleted or overwritten bucket versions, right? It seems like that should be impossible. Relatedly:

There could be a case where two buckets or two objects that reside on the same metadata shard end up with the same UUID for their respective id column values.

Sure, let me try to explain this a bit more. The first of my comments you
reference I put in anticipation of questions about why the id column was not
specified as the primary key. It really equally applies to buckets or objects,
but it's just part of explicitly stating the risks (albeit very small) of UUID
collision when you aren't enforcing uniqueness at any single point in the
system. It's not that a collision is likely to ever happen, but an
acknowledgment that it is theoretically possible and if it ever does occur we
expect this aspect of the system to carry on unaffected. I have felt great
dissatifaction in interacting with S3 around the lack of documented information
about the true behaviors and guarantees of the system. The user is left to
experiment for themselves and draw inferences rather than having an explicit
reference for many aspects of the system that can be consequential to
users. Having felt this frustration I wanted to do better and be explicit about
as many of the behaviors as possible. So to your question about it being
impossible, it is not an impossible situation. Just very unlikely and even it
ever does occur we don't care because we did not add a primary key on our object
or bucket id columns. Does this help answer your questions or did I
misunderstand what you're really asking?

This feels dicey. It seems hard to enforce that we won't accidentally create
dependencies on these being unique in the future. If it happens to work, that
might be fine, but I'm not sure it makes sense to go out of our way to support
it. I think we could make the constraint violation clearer to diagnose.

I disagree with characterizing it as dicey. We're not really going out of our
way to support anything. As I mentioned in my last response it's really just
about acknowledging the risks and being explicit about how the system would
handle problems that might happen. In this case it's just explaining a benefit
to not specifying a particular primary key on our delete tables that we don't
actually need anyways. I think it is a greater risk when talking about UUID
collisions in the context of this schema design is to over-constrain the
system. I cannot really think of a negative consequence of a UUID collision with
the schema I've set out that would make it worth the effort to detect. As long
as data is not left accessible to unauthorized parties and the correct bucket
and object data can be retrieved by the authorized users then a collision of a
UUID should really be of no consequence.

I agree that this situation is very unlikely but possible, and it makes sense to document the expected behavior. I think you're saying that it's a nice property if this part of Manta will do the right thing if we happen to have a uuid collision. However, if other parts of Manta would do very much the wrong thing (e.g., serve the wrong object data), then I wonder if it would be better to identify this situation and forbid it when possible, even if we can't catch it all the time because objects may be on different shards? At the very least, if the rationale is going to be "so that we can support duplicate uuids in this part of the system", I think we ought to include the major caveat that the rest of the system may do disastrous things in this situation.

There are a few other fields we either have today or may want in the future:
RFD 143 (accelerated GC) describes adding a field to object metadata that
allows non-snaplinked objects to be cleaned up much more efficiently. I
strongly suggest we add that here, since adding it later will make it much less
useful.

I strongly advocate we foreclose on ever supporting object linking
for buckets. It's a nice feature, but the implications for garbage collection
are far too great in my opinion. I can't say for sure, but I would wager this is
the reason S3 only supports a full object copy operation rather than any type of
linking. And if we never intend to support links then we don't have to worry
about a distinction for accelerated gc versus normal gc.

It's pretty strong to foreclose on it forever (at least, without requiring a major database migration). I think we've found them a pretty useful feature. (Snaplinks were initially intended as a way for customers to build the equivalent of S3 object versions.) Normally I'd suggest deferring it, and we can definitely defer full support for links, but adding one boolean now would make that significantly easier for us in the future and that seems potentially worth the storage cost now.

We have an index by objectid today, which I believe would be used in the
full "online GC" implementation. As long as we have a path to adding this
later, maybe we can defer this for now.

I wrote up a comment on MANTA-3878 on how I think online gc for buckets could
work using the delete tables I outlined in the RFD. A more detailed treatment
would need its own RFD, but the comment gets the main ideas across I hope. With
that system we would not need an index on the object id. I'd love to hear your
thoughts on those gc notes if you have a chance to read them over.

Okay. I haven't had a chance to look at this yet.

Thanks!

KodyKantor · 2018-11-06T15:37:02Z

Sorry for being late to the party here, just thought I'd add a couple comments, specifically on the use of hstore.

From the PG documentation on the hstore type:

Each key in an hstore is unique. If you declare an hstore with duplicate keys, only one will be stored in the hstore and there is no guarantee as to which will be kept

The hstore type is suggested to be used in the manta_bucket_object and manta_bucket_deleted_object relations to store shark and header information. For the shark information the key would be the datacenter name and the value would be the storage ID.

I'm thinking about times where we've had bugs (MANTA-1735) and/or legitimate configuration options (single-DC muskie) where Muskie will insert two key/value pairs with the same 'key' into this table for a single object. It sounds like Postgres essentially will silently ignore one of the keys/value pairs. I think hstore is a fine data type here, but we'll have to double check somewhere that we're not trying to insert duplicate keys into a single hstore field.

I think this example might be working because of an inconsistency in the datacenter names (us-east-1 and us-east1).

kellymclaughlin · 2019-01-07T23:59:53Z

It's pretty strong to foreclose on it forever (at least, without requiring a major database migration). I think we've found them a pretty useful feature. (Snaplinks were initially intended as a way for customers to build the equivalent of S3 object versions.) Normally I'd suggest deferring it, and we can definitely defer full support for links, but adding one boolean now would make that significantly easier for us in the future and that seems potentially worth the storage cost now.

I am ok with adding a boolean for this. I'll continue to argue we shouldn't ever actually implement links, but I agree the overhead to add it is very minimal so I'm fine with it.

kellymclaughlin · 2019-01-08T00:01:00Z

I think this example might be working because of an inconsistency in the datacenter names (us-east-1 and us-east1).

Thanks so much for catching this! I am currently testing alternatives and will be pushing an update to reflect that once the testing is done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 149 PostgreSQL Schema For Manta buckets #112

RFD 149 PostgreSQL Schema For Manta buckets #112

kellymclaughlin commented Aug 27, 2018

jasonbking commented Aug 28, 2018

kellymclaughlin commented Aug 28, 2018

davepacheco commented Sep 15, 2018

kellymclaughlin commented Sep 17, 2018

davepacheco commented Sep 20, 2018

KodyKantor commented Nov 6, 2018

kellymclaughlin commented Jan 7, 2019

kellymclaughlin commented Jan 8, 2019

RFD 149 PostgreSQL Schema For Manta buckets #112

RFD 149 PostgreSQL Schema For Manta buckets #112

Comments

kellymclaughlin commented Aug 27, 2018

jasonbking commented Aug 28, 2018

kellymclaughlin commented Aug 28, 2018

davepacheco commented Sep 15, 2018

"Overview" section

"Schema" section

"Queries" section

kellymclaughlin commented Sep 17, 2018

davepacheco commented Sep 20, 2018

KodyKantor commented Nov 6, 2018

kellymclaughlin commented Jan 7, 2019

kellymclaughlin commented Jan 8, 2019