-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: array value encoding #16172
rfc: array value encoding #16172
Conversation
Very glad to see this written up. So many RFCs being produced. Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 75 at r1 (raw file):
There are 2 limits on the maximum array size: the size of a range and the maximum size of a Raft command. The max range size is 64 MB. While that is configurable, we haven't significantly tested larger range sizes. The maximum size of a Raft command is also 64 MB, but that might be too large in practice. docs/RFCS/array_encoding.md, line 85 at r1 (raw file):
Where is the length of each dimension specified? Is it dependent on what non-NULL entries have been set? docs/RFCS/array_encoding.md, line 173 at r1 (raw file):
I think it is definitely worth thinking through how very large arrays could be stored across multiple keys even if it is deemed unnecessary to implement right now. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 34 at r1 (raw file):
Can you point to some typical use cases for arrays to help validate the design below? (in particular to help us understand whether the deviations from postgres behavior are likely to cause problems) docs/RFCS/array_encoding.md, line 39 at r1 (raw file):
Would it be reasonable to treat array types as syntactic sugar for an interleaved table? I'm guessing that's not going to turn out to be very practical, but if it did work out it might present a different solution to the indexing problem. If the array is a separate table then you may be able to use regular indexes instead of introducing the new concept of GIN indexes, and interleaved tables reduce but do not eliminate the overhead of a separate table. docs/RFCS/array_encoding.md, line 116 at r1 (raw file):
What is the use case for ordering by an array column? I'm surprised to see comparisons other than equality as a requirement before we get to indexing. docs/RFCS/array_encoding.md, line 117 at r1 (raw file):
Is a key-encoding of array values going to required? I don't think postgres supports "ordinary" indexes of array types, only GIN indexes, in which case we may not need a sortable encoding for the array as a whole. docs/RFCS/array_encoding.md, line 129 at r1 (raw file):
This seems worrisome - in most programming languages I can think of where arrays are comparable, the dimension is not a factor (it is common for strings and one-dimensional arrays of characters to behave the same way, and strings of different lengths always do elementwise comparisons). If this is motivated by a not-yet-designed key encoding, I'd rather make comparisons of arrays with unequal dimensions an error for now instead of enshrining a behavior that may not be what we want. docs/RFCS/array_encoding.md, line 146 at r1 (raw file):
It sounds like very large arrays would be pretty slow in postgres, given this encoding. Do we have evidence that people try to use very large array columns? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 116 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
This is more of a limitation of the current codebase than a requirement, but at the moment all types need ordering operators because I'd be happy to see a refactor to permit unordered types and to then punt on array ordering. docs/RFCS/array_encoding.md, line 117 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Postgres actually does support ordinary array indexes, but perhaps we don't need to ever. It doesn't seem that useful.
Comments from Reviewable |
Nice writeup, @justinj! Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 48 at r1 (raw file):
These limitations seem different enough that we should definitely look into some typical array use cases, as @bdarnell suggested above, to validate whether our array feature will still be useful to the people that want it. docs/RFCS/array_encoding.md, line 105 at r1 (raw file):
Is the type system ready for this yet? docs/RFCS/array_encoding.md, line 129 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, we should be careful about deviating too far from postgres here unless there's a strong reason. Our stated goal is that "when we do decide to deviate from Postgres, we won't do so in a way that makes it difficult to write queries that function identically in both Postgres and Cockroach". It's a pain for us to do so, but can be a big deal for any users trying to change databases. docs/RFCS/array_encoding.md, line 155 at r1 (raw file):
Did you ever explain why it would anywhere in this doc? It'd be helpful for motivating the decision. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 34 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
+1 it would be very useful to have some concrete examples of what people use arrays for, it's hard to decide on the tradeoffs otherwise. docs/RFCS/array_encoding.md, line 47 at r1 (raw file):
"include the dimensionality" suggests we completely fix the size of the array. Should make it clear that this is just about the number of dimensions and not the dimensions themselves. docs/RFCS/array_encoding.md, line 85 at r1 (raw file):
Will these be varints (so we don't waste a lot of space for small arrays)? docs/RFCS/array_encoding.md, line 86 at r1 (raw file):
The size of the bitmap can be determined from the lengths, this only needs to be a flag. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 47 at r1 (raw file): Previously, RaduBerinde wrote…
fyi docs/RFCS/array_encoding.md, line 105 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
yes docs/RFCS/array_encoding.md, line 116 at r1 (raw file): Previously, jordanlewis (Jordan Lewis) wrote…
I think it's worthwhile to shave this yak here and now so as to side-step the entire discussion of array sorting altogether. docs/RFCS/array_encoding.md, line 129 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
See above about side-stepping that conversation for now. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 39 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think that this alternative would be tough to implement and space inefficient to boot. We really don't want to desugar at the IR layer, because the array concatenation operator would turn queries that look like they're operating on single keys into queries that are operating on multiple keys (and these operations may be wrapped in Comments from Reviewable |
docs/RFCS/array_encoding.md, line 34 at r1 (raw file): Previously, RaduBerinde wrote…
+1 It might be worth just asking the people who have commented on the ARRAY GitHub issue what concrete things they're thinking of storing there. It's hard to evaluate a design without knowing what the intended uses are. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions. docs/RFCS/array_encoding.md, line 34 at r1 (raw file): Previously, cuongdo (Cuong Do) wrote…
I commented on the issue and it's already gotten some interesting replies! docs/RFCS/array_encoding.md, line 39 at r1 (raw file): Previously, eisenstatdavid (David Eisenstat) wrote…
I'll have to defer to Eisen for the difficulties with this approach, but I will agree that it's a very appealing approach for the reasons you mentioned (GIN-style indexes for free would be amazing). docs/RFCS/array_encoding.md, line 47 at r1 (raw file): Previously, knz (kena) wrote…
👍 I'll add a note for clarity. docs/RFCS/array_encoding.md, line 48 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I'm not sure about the dimensionality constraint, but I've looked at a couple (n=2) Postgres drivers and they both ignored the lower bounds. Anecdotally, based on what I've seen, using multidimensional arrays at all is a somewhat uncommon use-case. I agree these arguments aren't that convincing though, so I'll see if I can find some more info on this. docs/RFCS/array_encoding.md, line 85 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
It's specified by the result-valued array, which contains the docs/RFCS/array_encoding.md, line 85 at r1 (raw file): Previously, RaduBerinde wrote…
That's an good idea, I'm not familiar with the tradeoffs of using varints for something like this, do you have an opinion? Off the top of my head (and based on feedback we've gotten) I would think that (very) small arrays are a common enough case that this would be a good idea. docs/RFCS/array_encoding.md, line 86 at r1 (raw file): Previously, RaduBerinde wrote…
👍 docs/RFCS/array_encoding.md, line 116 at r1 (raw file): Previously, knz (kena) wrote…
Very happy to side-step that discussion. docs/RFCS/array_encoding.md, line 146 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Posts like this imply to me that they are indeed pretty slow if you're using them like arrays from other languages. There's also some anecdotes on this issue. docs/RFCS/array_encoding.md, line 155 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Agreed, but going to attempt to completely avoid the issue for now and remove this stuff. docs/RFCS/array_encoding.md, line 173 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
I'll stop by core office hours today and try to get some info to include in this RFC. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 34 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
So far what I've gathered from this (admittedly very small dataset) is:
docs/RFCS/array_encoding.md, line 86 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
After talking to @bdarnell, he suggested that having the offset rather than a flag gives us some freedom to expand on the data available in this header in the future if we see fit (such as by adding an index, or something). @bdarnell: do you (or anyone else) have any advice on how this could be implemented to be forwards-compatible without being too bloated? For example one option could be that we tag each header entry with a docs/RFCS/array_encoding.md, line 173 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
To follow up on this, @bdarnell suggested (and correct me if I'm misrepresenting what you said) that one option would be to later on, introduce a header in the very first kv entry that could provide pointers to the various other kv entries. This should be fine to retrofit on if need be, as long as we leave ourselves the flexibility in the header format to adjust it later. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 86 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
Storing the offset instead of a boolean has the advantage that if we add new stuff to the header in arrayV2, a node that only understands arrayV1 can still find the data (but not the new stuff we added, so whether that's a good or bad thing depends on whether the new stuff is optional or not. For example, we couldn't use this flexibility to introduce the multi-data-block stuff discussed below). I wouldn't get too clever about future-proofing here, just make sure we have the option of moving to an entirely different format in the future (which we can probably do by just allocating a new type tag, but if we want to do this within the one array type tag we might want to add a version number) docs/RFCS/array_encoding.md, line 173 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
Instead of adjusting it later, if we can identify the eventual format we'll want, it would be nice to encode things from the beginning in the form that single-block arrays will use in the multi-block world. As a strawman, we could store the number of data blocks in the header somewhere, which would just be a constant 1 for now. When we support the new format, we'd set that field to the number of blocks, which would also serve as the version number to indicate that the record also contains whatever new fields we need to introduce for multi-block storage. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 86 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
How do we handle a world where there are non-optional changes made (like multi-block storage) and there are nodes that only understand arrayV1? Is there an existing mechanism in Cockroach to deal with that kind of problem? I assume we have to handle this problem when we add new datatypes in the first place? After discussing with @jordanlewis we landed on the idea of just replacing the boolean with a bitmap that would allow us to add new flags in the future (such as, "this array is multi-block and the block info is included in the header"). This doesn't solve the problem of old nodes and new data, but given that we have no hope in cases where the changes to the data are required anyway I'm not especially concerned about it (but should I be?). Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/array_encoding.md, line 86 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
The simplest way to handle this (from a migration perspective) would be to treat arrayV2 as a completely separate type, and keep using the old format indefinitely unless and until there's a schema change to force us to switch. That's similar to what we did with some of the index format changes just before beta (you'd keep using the old format until you dropped and recreated the index). Comments from Reviewable |
This RFC is entering the final comment period. |
LGTM |
Did you settle on introducing an entirely new type for the arrayV2 encoding, or are you planning on baking in a version number/bit into arrayV1's header? Reviewed 1 of 1 files at r2. docs/RFCS/array_encoding.md, line 47 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
This isn't formatting right! Markdown is interpreting your leading hypen ( docs/RFCS/array_encoding.md, line 85 at r1 (raw file): Previously, justinj (Justin Jaffray) wrote…
Varints are slower to decode and waste a bit per byte. But given that most arrays will be small (< 128 elements), seems like you'll want varints here to save three bytes on every dimension length. docs/RFCS/array_encoding.md, line 25 at r2 (raw file):
nit: fence these with "```sql" for syntax highlighting. docs/RFCS/array_encoding.md, line 76 at r2 (raw file):
If I'm understanding correctly, Comments from Reviewable |
We're planning on baking in the version number, using those reserved bits, but really we're just keeping the door open to go either route. Simpler changes could up the version number, and if need be we could still introduce a new type. Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions. docs/RFCS/array_encoding.md, line 47 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
🤦♂️ docs/RFCS/array_encoding.md, line 85 at r1 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Sounds good. I think that small arrays are indeed a common case. docs/RFCS/array_encoding.md, line 25 at r2 (raw file): Previously, benesch (Nikhil Benesch) wrote…
Done. docs/RFCS/array_encoding.md, line 76 at r2 (raw file): Previously, benesch (Nikhil Benesch) wrote…
There's an existing Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. Comments from Reviewable |
too, but don't ascribe me much weight Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. Comments from Reviewable |
Thanks for the reviews everyone! Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful. Comments from Reviewable |
In accordance with cockroachdb#16172, we will need to disallow ordering by arrays columns to keep open the door to decide on how we should order them later on. I debated making the check a method on the Type interface, but I decided that was overkill for this single check on a single Type. Regardless, the check is isolated into a single function so it can be changed easily. This approach doesn't play well with wrapped Oid types like INT2VECTOR and INT4VECTOR, but I'm not too concerned about those because users can't construct them, and since they only exist for pg_catalog compatibility anyway we might as well allow ordering by them as in Postgres. We currently don't support `IN` or `MAX`/`MIN` on arrays, so that's not something we need to be concerned about. As far as I'm aware there are no other situations where the ordering of arrays is exposed to users. I tried adding a panic to TArray's Compare method and the only test failure was a test calling Compare explicitly as an assertion. This is technically a breaking change which should be announced to users once it is released, though I doubt anyone is making use of this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column, however it's not clear to me yet if there are other situations in which a value could become key-encoded. I think more testing is in order, but I wanted to get some eyes on this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column, however it's not clear to me yet if there are other situations in which a value could become key-encoded. I think more testing is in order, but I wanted to get some eyes on this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column, however it's not clear to me yet if there are other situations in which a value could become key-encoded. I think more testing is in order, but I wanted to get some eyes on this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column, however it's not clear to me yet if there are other situations in which a value could become key-encoded. I think more testing is in order, but I wanted to get some eyes on this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column, however it's not clear to me yet if there are other situations in which a value could become key-encoded. I think more testing is in order, but I wanted to get some eyes on this.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column.
This commit introduces support for an ARRAY column type as described in the RFC (cockroachdb#16172) with the following limitations: * NULL values in arrays are currently not supported * Collated strings as array contents are currently not supported * No additional operations on arrays have been implemented * Arrays are only 1-dimensional This commit also disallows using arrays as primary keys or as an indexed column.
No description provided.