Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: array value encoding #16172

Merged
merged 1 commit into from
Jun 15, 2017
Merged

rfc: array value encoding #16172

merged 1 commit into from
Jun 15, 2017

Conversation

justinj
Copy link
Contributor

@justinj justinj commented May 26, 2017

No description provided.

@justinj justinj requested review from knz and cuongdo May 26, 2017 17:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

Very glad to see this written up. So many RFCs being produced.


Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 75 at r1 (raw file):

Arrays will be encoded into a single kv entry. This places a restriction on the
maximum size of an array, but significantly reduces the complexity of
implementation.

There are 2 limits on the maximum array size: the size of a range and the maximum size of a Raft command. The max range size is 64 MB. While that is configurable, we haven't significantly tested larger range sizes. The maximum size of a Raft command is also 64 MB, but that might be too large in practice.


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

* Element type tag
* The number of dimensions in the array as a uint8
* Length of each dimension, as uint32s

Where is the length of each dimension specified? Is it dependent on what non-NULL entries have been set?


docs/RFCS/array_encoding.md, line 173 at r1 (raw file):

* Are there any considerations for the encoding of arrays that would make it
easier for us to migrate to a multiple-kv-entry-per-array setup in the
future?

I think it is definitely worth thinking through how very large arrays could be stored across multiple keys even if it is deemed unnecessary to implement right now.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 34 at r1 (raw file):

# Motivation

Arrays are a popular feature of Postgres and are a frequently requested feature

Can you point to some typical use cases for arrays to help validate the design below? (in particular to help us understand whether the deviations from postgres behavior are likely to cause problems)


docs/RFCS/array_encoding.md, line 39 at r1 (raw file):

While the proper relational way to handle the problem arrays solve is to
create a separate table and do a join, in many cases it can be easier,
cleaner, and faster to simply have an array as a column type.

Would it be reasonable to treat array types as syntactic sugar for an interleaved table? I'm guessing that's not going to turn out to be very practical, but if it did work out it might present a different solution to the indexing problem. If the array is a separate table then you may be able to use regular indexes instead of introducing the new concept of GIN indexes, and interleaved tables reduce but do not eliminate the overhead of a separate table.


docs/RFCS/array_encoding.md, line 116 at r1 (raw file):

Comparison needs to be implemented at this stage in order to support
operations such as using arrays as an `ORDER BY` field or as part of `IN`

What is the use case for ordering by an array column? I'm surprised to see comparisons other than equality as a requirement before we get to indexing.


docs/RFCS/array_encoding.md, line 117 at r1 (raw file):

Comparison needs to be implemented at this stage in order to support
operations such as using arrays as an `ORDER BY` field or as part of `IN`
expressions. One important consideration is that the future key-encoding of

Is a key-encoding of array values going to required? I don't think postgres supports "ordinary" indexes of array types, only GIN indexes, in which case we may not need a sortable encoding for the array as a whole.


docs/RFCS/array_encoding.md, line 129 at r1 (raw file):

`NULL`, the result of comparison is `NULL`.

Not being strictly lexicographic is a departure from how Postgres handles

This seems worrisome - in most programming languages I can think of where arrays are comparable, the dimension is not a factor (it is common for strings and one-dimensional arrays of characters to behave the same way, and strings of different lengths always do elementwise comparisons).

If this is motivated by a not-yet-designed key encoding, I'd rather make comparisons of arrays with unequal dimensions an error for now instead of enshrining a behavior that may not be what we want.


docs/RFCS/array_encoding.md, line 146 at r1 (raw file):

* Restricting storage to a single kv entry per array places a hard limit on
the maximum size of a single array (64MB). This may be surprising to users who
expect arrays to allow storing a lot of data, or who are used to Postgres's

It sounds like very large arrays would be pretty slow in postgres, given this encoding. Do we have evidence that people try to use very large array columns?


Comments from Reviewable

@jordanlewis
Copy link
Member

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 116 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What is the use case for ordering by an array column? I'm surprised to see comparisons other than equality as a requirement before we get to indexing.

This is more of a limitation of the current codebase than a requirement, but at the moment all types need ordering operators because IN expressions sort their RHS before searching.

I'd be happy to see a refactor to permit unordered types and to then punt on array ordering.


docs/RFCS/array_encoding.md, line 117 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is a key-encoding of array values going to required? I don't think postgres supports "ordinary" indexes of array types, only GIN indexes, in which case we may not need a sortable encoding for the array as a whole.

Postgres actually does support ordinary array indexes, but perhaps we don't need to ever. It doesn't seem that useful.

jordan=# create table a(foo int[] primary key);
CREATE TABLE
jordan=# \d a
        Table "public.a"
 Column |   Type    | Modifiers
--------+-----------+-----------
 foo    | integer[] | not null
Indexes:
    "a_pkey" PRIMARY KEY, btree (foo)

Comments from Reviewable

@a-robinson
Copy link
Contributor

Nice writeup, @justinj!


Review status: 0 of 1 files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 48 at r1 (raw file):

Unlike Postgres, we include the
dimensionality of an array in the column type, so a given column can only
contain arrays with the same number of dimensions. We also do not support
the Postgres feature of lower bounds on dimensions other than 1.

These limitations seem different enough that we should definitely look into some typical array use cases, as @bdarnell suggested above, to validate whether our array feature will still be useful to the people that want it.


docs/RFCS/array_encoding.md, line 105 at r1 (raw file):

The existing in-memory array values will need to be augmented to support
multidimensionality and elements besides ints and strings.

Is the type system ready for this yet?


docs/RFCS/array_encoding.md, line 129 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems worrisome - in most programming languages I can think of where arrays are comparable, the dimension is not a factor (it is common for strings and one-dimensional arrays of characters to behave the same way, and strings of different lengths always do elementwise comparisons).

If this is motivated by a not-yet-designed key encoding, I'd rather make comparisons of arrays with unequal dimensions an error for now instead of enshrining a behavior that may not be what we want.

Yeah, we should be careful about deviating too far from postgres here unless there's a strong reason. Our stated goal is that "when we do decide to deviate from Postgres, we won't do so in a way that makes it difficult to write queries that function identically in both Postgres and Cockroach". It's a pain for us to do so, but can be a big deal for any users trying to change databases.


docs/RFCS/array_encoding.md, line 155 at r1 (raw file):

(and being strictly lexicographic is arguably more intuitive than going by
length first). This will make creating the key-encoding later on more
difficult, however.

Did you ever explain why it would anywhere in this doc? It'd be helpful for motivating the decision.


Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 34 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can you point to some typical use cases for arrays to help validate the design below? (in particular to help us understand whether the deviations from postgres behavior are likely to cause problems)

+1 it would be very useful to have some concrete examples of what people use arrays for, it's hard to decide on the tradeoffs otherwise.


docs/RFCS/array_encoding.md, line 47 at r1 (raw file):

Our arrays, like Postgres' arrays, are 1-indexed, homogenous, and rectangular.
Arrays can have at most 16 dimensions (this limit is arbitrary -
Postgres restricts to 6 dimensions). Unlike Postgres, we include the

"include the dimensionality" suggests we completely fix the size of the array. Should make it clear that this is just about the number of dimensions and not the dimensions themselves.


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

* Element type tag
* The number of dimensions in the array as a uint8
* Length of each dimension, as uint32s

Will these be varints (so we don't waste a lot of space for small arrays)?


docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

* The number of dimensions in the array as a uint8
* Length of each dimension, as uint32s
* An offset to the data (or 0 if there's no NULLs bitmap), as a uint32

The size of the bitmap can be determined from the lengths, this only needs to be a flag.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented May 30, 2017

Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 47 at r1 (raw file):

Previously, RaduBerinde wrote…

"include the dimensionality" suggests we completely fix the size of the array. Should make it clear that this is just about the number of dimensions and not the dimensions themselves.

fyi
"dimensionality" = number of dimensions
"shape" = number of dimensions + size in each dimension


docs/RFCS/array_encoding.md, line 105 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is the type system ready for this yet?

yes


docs/RFCS/array_encoding.md, line 116 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This is more of a limitation of the current codebase than a requirement, but at the moment all types need ordering operators because IN expressions sort their RHS before searching.

I'd be happy to see a refactor to permit unordered types and to then punt on array ordering.

I think it's worthwhile to shave this yak here and now so as to side-step the entire discussion of array sorting altogether.


docs/RFCS/array_encoding.md, line 129 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Yeah, we should be careful about deviating too far from postgres here unless there's a strong reason. Our stated goal is that "when we do decide to deviate from Postgres, we won't do so in a way that makes it difficult to write queries that function identically in both Postgres and Cockroach". It's a pain for us to do so, but can be a big deal for any users trying to change databases.

See above about side-stepping that conversation for now.


Comments from Reviewable

@eisenstatdavid
Copy link

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 39 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Would it be reasonable to treat array types as syntactic sugar for an interleaved table? I'm guessing that's not going to turn out to be very practical, but if it did work out it might present a different solution to the indexing problem. If the array is a separate table then you may be able to use regular indexes instead of introducing the new concept of GIN indexes, and interleaved tables reduce but do not eliminate the overhead of a separate table.

I think that this alternative would be tough to implement and space inefficient to boot. We really don't want to desugar at the IR layer, because the array concatenation operator would turn queries that look like they're operating on single keys into queries that are operating on multiple keys (and these operations may be wrapped in IF and other control flow). We could desugar at the encoding/decoding layer instead, which is roughly equivalent to supporting multiple-KV arrays but may hide some of the benefits of interleaving for query optimization.


Comments from Reviewable

@cuongdo
Copy link
Contributor

cuongdo commented May 30, 2017

docs/RFCS/array_encoding.md, line 34 at r1 (raw file):

Previously, RaduBerinde wrote…

+1 it would be very useful to have some concrete examples of what people use arrays for, it's hard to decide on the tradeoffs otherwise.

+1 It might be worth just asking the people who have commented on the ARRAY GitHub issue what concrete things they're thinking of storing there. It's hard to evaluate a design without knowing what the intended uses are.


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented May 31, 2017

Review status: 0 of 1 files reviewed at latest revision, 15 unresolved discussions.


docs/RFCS/array_encoding.md, line 34 at r1 (raw file):

Previously, cuongdo (Cuong Do) wrote…

+1 It might be worth just asking the people who have commented on the ARRAY GitHub issue what concrete things they're thinking of storing there. It's hard to evaluate a design without knowing what the intended uses are.

I commented on the issue and it's already gotten some interesting replies!


docs/RFCS/array_encoding.md, line 39 at r1 (raw file):

Previously, eisenstatdavid (David Eisenstat) wrote…

I think that this alternative would be tough to implement and space inefficient to boot. We really don't want to desugar at the IR layer, because the array concatenation operator would turn queries that look like they're operating on single keys into queries that are operating on multiple keys (and these operations may be wrapped in IF and other control flow). We could desugar at the encoding/decoding layer instead, which is roughly equivalent to supporting multiple-KV arrays but may hide some of the benefits of interleaving for query optimization.

I'll have to defer to Eisen for the difficulties with this approach, but I will agree that it's a very appealing approach for the reasons you mentioned (GIN-style indexes for free would be amazing).


docs/RFCS/array_encoding.md, line 47 at r1 (raw file):

Previously, knz (kena) wrote…

fyi
"dimensionality" = number of dimensions
"shape" = number of dimensions + size in each dimension

👍 I'll add a note for clarity.


docs/RFCS/array_encoding.md, line 48 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Unlike Postgres, we include the
dimensionality of an array in the column type, so a given column can only
contain arrays with the same number of dimensions. We also do not support
the Postgres feature of lower bounds on dimensions other than 1.

These limitations seem different enough that we should definitely look into some typical array use cases, as @bdarnell suggested above, to validate whether our array feature will still be useful to the people that want it.

I'm not sure about the dimensionality constraint, but I've looked at a couple (n=2) Postgres drivers and they both ignored the lower bounds. Anecdotally, based on what I've seen, using multidimensional arrays at all is a somewhat uncommon use-case. I agree these arguments aren't that convincing though, so I'll see if I can find some more info on this.


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Where is the length of each dimension specified? Is it dependent on what non-NULL entries have been set?

It's specified by the result-valued array, which contains the NULLs as values. We do omit the NULL values in the data here, which I should clarify.


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

Previously, RaduBerinde wrote…

Will these be varints (so we don't waste a lot of space for small arrays)?

That's an good idea, I'm not familiar with the tradeoffs of using varints for something like this, do you have an opinion? Off the top of my head (and based on feedback we've gotten) I would think that (very) small arrays are a common enough case that this would be a good idea.


docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

Previously, RaduBerinde wrote…

The size of the bitmap can be determined from the lengths, this only needs to be a flag.

👍


docs/RFCS/array_encoding.md, line 116 at r1 (raw file):

Previously, knz (kena) wrote…

I think it's worthwhile to shave this yak here and now so as to side-step the entire discussion of array sorting altogether.

Very happy to side-step that discussion.


docs/RFCS/array_encoding.md, line 146 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It sounds like very large arrays would be pretty slow in postgres, given this encoding. Do we have evidence that people try to use very large array columns?

Posts like this imply to me that they are indeed pretty slow if you're using them like arrays from other languages. There's also some anecdotes on this issue.


docs/RFCS/array_encoding.md, line 155 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Did you ever explain why it would anywhere in this doc? It'd be helpful for motivating the decision.

Agreed, but going to attempt to completely avoid the issue for now and remove this stuff.


docs/RFCS/array_encoding.md, line 173 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it is definitely worth thinking through how very large arrays could be stored across multiple keys even if it is deemed unnecessary to implement right now.

I'll stop by core office hours today and try to get some info to include in this RFC.


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented Jun 5, 2017

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 34 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I commented on the issue and it's already gotten some interesting replies!

So far what I've gathered from this (admittedly very small dataset) is:

  • non-GIN indexing is not very important
  • GIN indexing is important, but there are still use cases without it
  • some users would have large arrays, but many would have small(-ish) ones.
  • lots of the desire for arrays is driven by performance, which we already somewhat solve via interleaved tables
  • indexing into the arrays to retrieve individual elements does not seem particularly common
  • it seems very common for people to use arrays as sets (to the point where I almost feel as if a set datatype would be a more valuable addition, if not for postgres compatibility)

docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

👍

After talking to @bdarnell, he suggested that having the offset rather than a flag gives us some freedom to expand on the data available in this header in the future if we see fit (such as by adding an index, or something). @bdarnell: do you (or anyone else) have any advice on how this could be implemented to be forwards-compatible without being too bloated? For example one option could be that we tag each header entry with a kind and its length, but I'm not sure if that's overkill for what we're doing here.


docs/RFCS/array_encoding.md, line 173 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I'll stop by core office hours today and try to get some info to include in this RFC.

To follow up on this, @bdarnell suggested (and correct me if I'm misrepresenting what you said) that one option would be to later on, introduce a header in the very first kv entry that could provide pointers to the various other kv entries. This should be fine to retrofit on if need be, as long as we leave ourselves the flexibility in the header format to adjust it later.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 5, 2017

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

After talking to @bdarnell, he suggested that having the offset rather than a flag gives us some freedom to expand on the data available in this header in the future if we see fit (such as by adding an index, or something). @bdarnell: do you (or anyone else) have any advice on how this could be implemented to be forwards-compatible without being too bloated? For example one option could be that we tag each header entry with a kind and its length, but I'm not sure if that's overkill for what we're doing here.

Storing the offset instead of a boolean has the advantage that if we add new stuff to the header in arrayV2, a node that only understands arrayV1 can still find the data (but not the new stuff we added, so whether that's a good or bad thing depends on whether the new stuff is optional or not. For example, we couldn't use this flexibility to introduce the multi-data-block stuff discussed below). I wouldn't get too clever about future-proofing here, just make sure we have the option of moving to an entirely different format in the future (which we can probably do by just allocating a new type tag, but if we want to do this within the one array type tag we might want to add a version number)


docs/RFCS/array_encoding.md, line 173 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

To follow up on this, @bdarnell suggested (and correct me if I'm misrepresenting what you said) that one option would be to later on, introduce a header in the very first kv entry that could provide pointers to the various other kv entries. This should be fine to retrofit on if need be, as long as we leave ourselves the flexibility in the header format to adjust it later.

Instead of adjusting it later, if we can identify the eventual format we'll want, it would be nice to encode things from the beginning in the form that single-block arrays will use in the multi-block world. As a strawman, we could store the number of data blocks in the header somewhere, which would just be a constant 1 for now. When we support the new format, we'd set that field to the number of blocks, which would also serve as the version number to indicate that the record also contains whatever new fields we need to introduce for multi-block storage.


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented Jun 6, 2017

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Storing the offset instead of a boolean has the advantage that if we add new stuff to the header in arrayV2, a node that only understands arrayV1 can still find the data (but not the new stuff we added, so whether that's a good or bad thing depends on whether the new stuff is optional or not. For example, we couldn't use this flexibility to introduce the multi-data-block stuff discussed below). I wouldn't get too clever about future-proofing here, just make sure we have the option of moving to an entirely different format in the future (which we can probably do by just allocating a new type tag, but if we want to do this within the one array type tag we might want to add a version number)

How do we handle a world where there are non-optional changes made (like multi-block storage) and there are nodes that only understand arrayV1? Is there an existing mechanism in Cockroach to deal with that kind of problem? I assume we have to handle this problem when we add new datatypes in the first place?

After discussing with @jordanlewis we landed on the idea of just replacing the boolean with a bitmap that would allow us to add new flags in the future (such as, "this array is multi-block and the block info is included in the header"). This doesn't solve the problem of old nodes and new data, but given that we have no hope in cases where the changes to the data are required anyway I'm not especially concerned about it (but should I be?).


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2017

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/array_encoding.md, line 86 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

How do we handle a world where there are non-optional changes made (like multi-block storage) and there are nodes that only understand arrayV1? Is there an existing mechanism in Cockroach to deal with that kind of problem? I assume we have to handle this problem when we add new datatypes in the first place?

After discussing with @jordanlewis we landed on the idea of just replacing the boolean with a bitmap that would allow us to add new flags in the future (such as, "this array is multi-block and the block info is included in the header"). This doesn't solve the problem of old nodes and new data, but given that we have no hope in cases where the changes to the data are required anyway I'm not especially concerned about it (but should I be?).

The simplest way to handle this (from a migration perspective) would be to treat arrayV2 as a completely separate type, and keep using the old format indefinitely unless and until there's a schema change to force us to switch. That's similar to what we did with some of the index format changes just before beta (you'd keep using the old format until you dropped and recreated the index).


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented Jun 7, 2017

This RFC is entering the final comment period.

@RaduBerinde
Copy link
Member

LGTM

@benesch
Copy link
Contributor

benesch commented Jun 9, 2017

Did you settle on introducing an entirely new type for the arrayV2 encoding, or are you planning on baking in a version number/bit into arrayV1's header?


Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


docs/RFCS/array_encoding.md, line 47 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

👍 I'll add a note for clarity.

This isn't formatting right! Markdown is interpreting your leading hypen (-) as the start of a bulleted list. I'd suggest using an emdash (—) with no adjacent spaces instead.


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

That's an good idea, I'm not familiar with the tradeoffs of using varints for something like this, do you have an opinion? Off the top of my head (and based on feedback we've gotten) I would think that (very) small arrays are a common enough case that this would be a good idea.

Varints are slower to decode and waste a bit per byte. But given that most arrays will be small (< 128 elements), seems like you'll want varints here to save three bytes on every dimension length.


docs/RFCS/array_encoding.md, line 25 at r2 (raw file):

    CREATE INDEX idx ON t(intArr);
    CREATE TABLE u (INT[] arr PRIMARY KEY);

nit: fence these with "```sql" for syntax highlighting.


docs/RFCS/array_encoding.md, line 76 at r2 (raw file):

The column type protobuf will be modified to include an `ARRAY` value for
the `Kind`, and a nullable field indicating the element type if the type is
an array.

If I'm understanding correctly, int[][] will be encoded as {..., Kind: ARRAY, ArrayType: INT }, and its two-dimensionality will be deduced only when the value is decoded?


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented Jun 9, 2017

We're planning on baking in the version number, using those reserved bits, but really we're just keeping the door open to go either route. Simpler changes could up the version number, and if need be we could still introduce a new type.


Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions.


docs/RFCS/array_encoding.md, line 47 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

This isn't formatting right! Markdown is interpreting your leading hypen (-) as the start of a bulleted list. I'd suggest using an emdash (—) with no adjacent spaces instead.

🤦‍♂️


docs/RFCS/array_encoding.md, line 85 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Varints are slower to decode and waste a bit per byte. But given that most arrays will be small (< 128 elements), seems like you'll want varints here to save three bytes on every dimension length.

Sounds good. I think that small arrays are indeed a common case.


docs/RFCS/array_encoding.md, line 25 at r2 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: fence these with "```sql" for syntax highlighting.

Done.


docs/RFCS/array_encoding.md, line 76 at r2 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

If I'm understanding correctly, int[][] will be encoded as {..., Kind: ARRAY, ArrayType: INT }, and its two-dimensionality will be deduced only when the value is decoded?

There's an existing repeated array_dimensions field in structured.proto (that I think
but haven't checked currently isn't used at all), that we were going to use (with just like, -1s or something I guess), but come to think of it I guess we could even just replace it with an int...


Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


Comments from Reviewable

@benesch
Copy link
Contributor

benesch commented Jun 13, 2017

:lgtm: too, but don't ascribe me much weight


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


Comments from Reviewable

@justinj
Copy link
Contributor Author

justinj commented Jun 15, 2017

Thanks for the reviews everyone!


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


Comments from Reviewable

@justinj justinj merged commit da96863 into cockroachdb:master Jun 15, 2017
justinj pushed a commit to justinj/cockroach that referenced this pull request Jun 15, 2017
In accordance with cockroachdb#16172, we will need to disallow ordering by arrays
columns to keep open the door to decide on how we should order them
later on.

I debated making the check a method on the Type interface, but I decided
that was overkill for this single check on a single Type.  Regardless,
the check is isolated into a single function so it can be changed
easily. This approach doesn't play well with wrapped Oid types like
INT2VECTOR and INT4VECTOR, but I'm not too concerned about those because
users can't construct them, and since they only exist for pg_catalog
compatibility anyway we might as well allow ordering by them as in
Postgres.

We currently don't support `IN` or `MAX`/`MIN` on arrays, so that's not
something we need to be concerned about.

As far as I'm aware there are no other situations where the ordering of
arrays is exposed to users. I tried adding a panic to TArray's Compare
method and the only test failure was a test calling Compare explicitly
as an assertion.

This is technically a breaking change which should be announced to users
once it is released, though I doubt anyone is making use of this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jun 28, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column, however it's not clear to me yet if there are other situations
in which a value could become key-encoded. I think more testing is in
order, but I wanted to get some eyes on this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 16, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column, however it's not clear to me yet if there are other situations
in which a value could become key-encoded. I think more testing is in
order, but I wanted to get some eyes on this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 21, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column, however it's not clear to me yet if there are other situations
in which a value could become key-encoded. I think more testing is in
order, but I wanted to get some eyes on this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 24, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column, however it's not clear to me yet if there are other situations
in which a value could become key-encoded. I think more testing is in
order, but I wanted to get some eyes on this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 25, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column, however it's not clear to me yet if there are other situations
in which a value could become key-encoded. I think more testing is in
order, but I wanted to get some eyes on this.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 26, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 26, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column.
justinj pushed a commit to justinj/cockroach that referenced this pull request Jul 26, 2017
This commit introduces support for an ARRAY column type as described in
the RFC (cockroachdb#16172) with the following limitations:

* NULL values in arrays are currently not supported
* Collated strings as array contents are currently not supported
* No additional operations on arrays have been implemented
* Arrays are only 1-dimensional

This commit also disallows using arrays as primary keys or as an indexed
column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.