-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 149 PostgreSQL Schema For Manta buckets #112
Comments
Just a few general questions (hopefully I'm not trodding dead ground here :P) A number of non-primary key fields are marked unique, though the DDL does not contain a If I've read the descriptions correctly, If any of the Any thoughts about putting the values of 'sharks' into their own table and just referencing it from |
Thanks for these good questions, Jason.
I'm guessing you mean the
If all of our data were resident on the same database I would definitely look more into using foreign key constraints, but in the model for buckets the database records about a bucket is not guaranteed to be co-located with the records of the objects in that bucket and so we would be left with trying to manually enforce those constraints across multiple databases and servers which would be a huge ordeal.
The
I originally thought about having a separate table for this information, but in some of our early design discussions we reached the conclusion that it might be best to avoid using I'll push some updates to the RFD soon to better cover some of these questions. Thanks again! |
Thanks for writing this up! It looks great. I have several small suggestions below. "Overview" section
That should probably say "The storage for buckets metadata in Manta". "Schema" section
That should read "The proposed schema", I think. In the various places where you define a
I'd add "as assigned by the end user", since "name" can mean a lot of different things.
That might lead one to think that names can't be reused across accounts. Maybe something like "Bucket names are unique within an account, but may be duplicated across accounts."
To make this more explicit, what do you think of adding: "Whenever a bucket is used by the API, the system would generally resolve an (owner, name) to a particular bucket id, and then use the bucket id for all subsequent operations within the API request." (That's just to emphasize to readers that it shouldn't be re-resolved within a request, since the result can change.) Is there any reason to prefer I wonder if it would be useful to add metadata for operations that create a bucket, delete a bucket, or delete an object? I'm envisioning a "tag" of sorts that in practice could be a muskie zonename and request id. (It should probably be free-form, though, so that other components -- like an internal command-line tool for manipulating the namespace -- could record whatever makes sense for them.) That way it would be easy to go back and find the log entry for more details about an operation. I didn't suggest doing this for object create operations because that's the most expensive case -- since we have to store it indefinitely for all existing objects. I'm torn about whether the Is there a particular reason to assign timestamps in PostgreSQL rather than having callers specify those? I'm assuming we don't anticipate a need to distinguish between objects and directories? I didn't understand this comment:
I think you'd need two different deleted or overwritten bucket versions, right? It seems like that should be impossible. Relatedly:
This feels dicey. It seems hard to enforce that we won't accidentally create dependencies on these being unique in the future. If it happens to work, that might be fine, but I'm not sure it makes sense to go out of our way to support it. I think we could make the constraint violation clearer to diagnose. There are a few other fields we either have today or may want in the future:
"Queries" sectionFor all of these, I'd suggest mentioning what's expected to be given. In particular, it might not be obvious to a reader that when reading a bucket, you'd generally have the owner uuid and name (or why that would be the case). So they might wonder why the lookup isn't by "id". I would also emphasize for each one that there's an index that should be able to efficiently satisfy queries on each of the It might also be worth talking about sharding at some point so that people realize what things must be on the same shard and what things may not be. (That might preempt future questions about, e.g., the foreign key approach.) It might be worth showing a query for writing and deleting an object conditional on a particular etag. It might be worth expanding on the bucket listing case, showing what subsequent pages look like and maybe mentioning that it's subject to further research (if we're still planning to see what other systems support efficiently here). There's a decent blog post on pagination in PostgreSQL you could link to. I know that's a lot of feedback, but it's mostly suggestions for clarification and a few small items. Thanks for doing this -- it looks great. |
Thanks for the great feedback! I'll be pushing some changes to reflect many of
In this case I think
This is a very interesting point. My reasoning for allowing
My thought is we do this to limit the potential for clock skew or latency to
Correct, the presentation of directories by different tools that provide
Sure, let me try to explain this a bit more. The first of my comments you
I disagree with characterizing it as dicey. We're not really going out of our
I strongly advocate we foreclose on ever supporting object linking
I wrote up a comment on MANTA-3878 on how I think online gc for buckets could
Thanks, I'll revisit this and see what we might need to consider adding in the
Good call. There's a lot still in flux with regards to this topic, but I'll try |
Thanks for the thoughtful replies!
I don't think there's anything wrong with
I think either approach seems reasonable.
I don't think this probably matters too much either way -- I just wanted to raise it for consideration. I find the system as a whole a bit easier to reason about if the values are generally constructed it one place (which very much isn't the case today), and I can imagine use-cases where clients would want to set their own timestamp anyway or cases where we'd want the client timestamp instead of the database timestamp. I don't think the variance in clocks among the databases is likely to be any different than the variance in clocks among the clients (which are the same physical machines today), and I think timestamp values shouldn't really be used for correctness in Manta. Again, all that said, I don't think it matters too much either way.
I agree that this situation is very unlikely but possible, and it makes sense to document the expected behavior. I think you're saying that it's a nice property if this part of Manta will do the right thing if we happen to have a uuid collision. However, if other parts of Manta would do very much the wrong thing (e.g., serve the wrong object data), then I wonder if it would be better to identify this situation and forbid it when possible, even if we can't catch it all the time because objects may be on different shards? At the very least, if the rationale is going to be "so that we can support duplicate uuids in this part of the system", I think we ought to include the major caveat that the rest of the system may do disastrous things in this situation.
It's pretty strong to foreclose on it forever (at least, without requiring a major database migration). I think we've found them a pretty useful feature. (Snaplinks were initially intended as a way for customers to build the equivalent of S3 object versions.) Normally I'd suggest deferring it, and we can definitely defer full support for links, but adding one boolean now would make that significantly easier for us in the future and that seems potentially worth the storage cost now.
Okay. I haven't had a chance to look at this yet. Thanks! |
Sorry for being late to the party here, just thought I'd add a couple comments, specifically on the use of hstore. From the PG documentation on the
The hstore type is suggested to be used in the manta_bucket_object and manta_bucket_deleted_object relations to store shark and header information. For the shark information the key would be the datacenter name and the value would be the storage ID. I'm thinking about times where we've had bugs (MANTA-1735) and/or legitimate configuration options (single-DC muskie) where Muskie will insert two key/value pairs with the same 'key' into this table for a single object. It sounds like Postgres essentially will silently ignore one of the keys/value pairs. I think hstore is a fine data type here, but we'll have to double check somewhere that we're not trying to insert duplicate keys into a single hstore field. I think this example might be working because of an inconsistency in the datacenter names (us-east-1 and us-east1). |
I am ok with adding a boolean for this. I'll continue to argue we shouldn't ever actually implement links, but I agree the overhead to add it is very minimal so I'm fine with it. |
Thanks so much for catching this! I am currently testing alternatives and will be pushing an update to reflect that once the testing is done. |
This is for discussion of RFD 149 PostgreSQL Schema For Manta buckets.
The text was updated successfully, but these errors were encountered: