sql: add support for ARRAY type #2115

petermattis · 2015-08-14T16:58:07Z

~~arrays in expressions~~ sql: add support for ARRAY indexing #10518 sql: add support for ARRAY operators and functions #10519
arrays on pgwire sql/pgwire: text decoding and binary en/decoding for arrays is not supported #13358 sql/pgwire: Support decoding ARRAY OIDs in string format #10713 sql: pgwire behavior for vector types is partially incorrect #13434
arrays in pg_catalog virtual tables sql: list the typelem for arrays in pg_type #13524 sql: list the conversion functions in pg_type #12526 sql: incompatible array type names in pg_type (Npgsql support) #13470
arrays in table/view descriptors (storing arrays)
KV array encoding (storing arrays)
DistSQL support (encoding over the wire)
optimize table lookups for = ANY(array) like for IN expressions sql: optimize = ANY like IN #13593

The text was updated successfully, but these errors were encountered:

vivekmenezes · 2016-01-26T16:28:12Z

I've been thinking about how youtube could store all their user comments on cockroachdb. On youtube you can post a comment against a video. You can also reply to an existing comment about a video. So the comment space can be modeled in a table where every row represents a video, with a column containing an array of all the comments related to a video. In addition comments can reference other comments, so the comment type needs to be a tuple like <id, other-comment-id, comment>

The array type should support fast append and update operations, to allow quick commenting and editing of comments.

kevincox · 2016-01-26T16:32:48Z

@vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.

JackKrupansky · 2016-01-26T16:52:37Z

I'd recommend arrays within a row to be limited to "modest" size (dozens, maybe hundreds of entries at most.) In any case, a row should be a manageable size.
User-defined types are great to define the tuple to be stored in an array. Synchronizing a bunch of parallel arrays (user ID, timestamp, comment text) is a pain. Need to look at how they would be queried, such as all comments by a particular user or all comments within a timestamp range.
Non-modest arrays are best handled as rows in a table, accessed either vis JOIN, nested SELECT, or a separate manual query.
Non-modest arrays frequently need to be queryable, so rows in a table make sense.
The eternal question: Should we just do what Postgres does? (Or at least use it as a starting point for discussion.)
Most comments these days are threaded (hierarchical), which further argues that user comments should be distinct rows.

vivekmenezes · 2016-01-26T17:01:33Z

One of the reasons for me putting in a real world example, is to allow the
choice of an architecture where each element of the array maps to a
different kv pair on the underlying KV store be part of the consideration
here. It will still mean a lot of contention reading the entire array, but
perhaps we anyway need to support a read with a timestamp a few
milliseconds into the past, or perhaps a read up to a certain array index
to reduce contention.

Alternatively, propose a different schema for the domain. A table with each
individual comment stored in an individual row and the comment id acting as
the primary key, with a column for the video id and secondary index on it.
The comments are read using a secondary index on the video.

A key design consideration is the ability to read all the comments off a
single node (few nodes)

On Tue, Jan 26, 2016 at 11:33 AM Kevin Cox notifications@github.com wrote:

@vivekmenezes https://github.com/vivekmenezes Your method would work
for small comment sets but you would quickly run into an issue with
contention on that array for videos with people commenting quickly and size
problems with large amounts of comments. However I think that your
suggestion of update operations is a great one. Having array operations
built in would avoid having to read the value and then process it locally
in a transaction before sending back the new array to the server. Using
update commands could create much shorter transactions witch would
obviously be good for performance.

—
Reply to this email directly or view it on GitHub
#2115 (comment)
.

kevincox · 2016-01-26T17:02:48Z

WWPGD (What would postgres do) seems like a pretty good starting point for db related questions (it has served sqlite well although very different from cockroach).

I suspect that inline arrays should probably not go over a "modest" size. I'm not saying it should be forbidden but probably discouraged and not optimized for.

At the very least there comes a point where returning the whole array becomes not very useful and that probably means that you need to filter, and once you filter you should probably be using a separate table.

That's my 2 cents at least.

JackKrupansky · 2016-01-26T23:10:16Z

We might also consider the Cassandra model of collections - list, set, and map. Sure, sometimes people actually do want mathematical arrays, matrices, and multi-dimensional arrays (ala Postgres), but sometimes they just want simple lists or are trying to simulate a set using an array, or similarly storing keyword/value pairs in two-dimensional arrays or parallel linear arrays.

The map collection type is nice since it lets you store and query arbitrary semi-structured data, typically keyword/value pairs much more naturally.

Cassandra has convenient operations to operate on lists, sets, and maps, such as append and prepend for list, removing list elements, removing all occurrences of a value from a list or set, removing from a map by keyword, etc. Doing all this with raw arrays would be... not so fun.

And I would encourage the same philosophy that Cassandra adopted for lists, sets, and maps: "Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used."

It all depends on the use case.

See:
https://cassandra.apache.org/doc/cql3/CQL.html#collections

kevincox · 2016-01-26T23:20:59Z

The Cassandra collection model looks really nice. It also appears similar to what I was recommending. I have quoted some of what I deem as the important parts below:

"Collections are meant for storing/denormalizing relatively small amount of data."
"Collections are always read in their entirety"
"Collections cannot have more than 65535 elements."

That limit seems pretty high but I wouldn't be surprised if it is considered bad practice to get anywhere near that. Personally I would be thinking that embedded collections would be suited for "tens of elements" at most.

Again, there are multiple right answers, just sharing what I'm thinking :)

MehSha · 2016-02-10T15:02:30Z

+1 for Cassandra collection model.
arrays are very cool, reduce number of queries and checks to get data, both in one2many and many2many cases.
in many cases we emulate set by two queries (even if we have array).
and maps, eliminate two column tables and unnecessary relations.
all this inches us closed to embedding model (like in Mongodb) and increase both convenience (number of queries) and performance.

please implement

amangoeliitb · 2016-07-08T22:04:18Z

Any updates on this @petermattis ?

petermattis · 2016-07-09T00:07:17Z

@amangoeliitb ARRAY is on our roadmap, but not scheduled for the near term. Do you mind explaining what your use case is? Perhaps there is a suitable alternative using the existing SQL functionality.

kulshekhar · 2016-08-10T17:30:31Z

@petermattis One of the best use cases for ARRAY and JSON types would be to enable the implementation of a simple row level ACL system. They could also be a more performant alternative to JOINs in several cases, especially in a distributed setup.

tlvenn · 2016-12-06T02:26:17Z

I have seen quite a few Array related PRs been merged lately and was wondering how far we are from basic support of array type. Thanks !

jhartman86 · 2016-12-06T20:24:51Z

+1 re: @kulshekhar's comment "could also be more performant alternative to joins in several cases"

cuongdo · 2016-12-08T15:39:02Z

What specific functionality do you need for arrays? Postgres exposes a lot of syntax and functions for arrays, so we've been limiting our work to what's needed to support Hibernate ORM.

kulshekhar · 2016-12-08T15:50:12Z

@cuongdo

index the array field
check if an array contains a particular value

These are the most important, from my perspective. Additional nice to have functionality would be

check if an array contains any value from another list
check if an array contains all values present in another list
append a value
merge/concatenate a new array array

tlvenn · 2016-12-09T02:45:21Z

Honestly, it would be already useful to be able to simply store and retrieve an array.

Then comes the list @kulshekhar posted and I agree, being able to index the array and check if an array contains a given value is probably the most desired features afterward.

spencerkimball · 2017-03-28T18:53:25Z

@jordanlewis @nvanbenschoten could you guys update this issue?

cuongdo · 2017-05-02T20:49:14Z

Another key functionality @bdarnell brought up is indexing of ARRAY columns.

justinj · 2017-05-30T20:52:02Z

Hi @MehSha, @kulshekhar, @tlvenn, @jhartman86, @amangoeliitb, @kajmagnus, @beldpro-ci, @thearchitect, @yinhm, @nickjackson, @arypurnomoz, @mbertschler, @sudhakar, @tlvenn, @bkleef, @domnulnopcea, @iamcarbon, @muesli, @heiskr, @arduanov, @daliwali, @manigandham since you expressed interest in this issue, I was wondering if you had any insight you could share on how you plan to use arrays within Cockroach!

I'm interested in a couple main questions:

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
How large do you expect your arrays to become (either in elements or in data)?
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
What's the appeal of using arrays for you, vs. using a normalized table?

Thanks so much!

kevincox · 2017-05-30T21:31:35Z

TL;DR I use arrays to contain "tags" which I almost exclusively query to see if some value is in the array of tags.

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Most of my array usage is checking membership. For example I have short arrays of small data (numbers or small strings) and I do a query if x is a member of that array. Ocasionally I will return the whole set of array elements in a query.

My second most common use case is a small set of data such as email addresses which I will almost always return in it's entirety.

I almost never access by index.

How large do you expect your arrays to become (either in elements or in data)?

Very small. Most are [0 2] elements, a couple get to around 20. Again, these are integers or short strings.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I ocasionally use a GIN index to find rows that contain a certian element in an array.

What's the appeal of using arrays for you, vs. using a normalized table?

I expect that they will be stored closer to the row itself for quick checking as well as the ease of indexing and querying.

Hope that helps explain my use case. Feel free to ask for clarification or anything else.

kulshekhar · 2017-05-31T05:12:34Z

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

I mainly use these fields with queries. When I have to use the data from an array field, I normally parse it in the application.

How large do you expect your arrays to become (either in elements or in data)?

I have applications with Array field size ranging upto 20000 elements

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

GIN indexing does it for me

What's the appeal of using arrays for you, vs. using a normalized table?

The main use case is application level ACL - arrays allow row level permissions to be managed in a flexible and performant manner. It would also work better than joins in a distributed setup, I imagine

heiskr · 2017-05-31T05:38:51Z

I'll join in, why not? :)

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Querying using ANY.

How large do you expect your arrays to become (either in elements or in data)?

Not unbounded, rarely more than a list of dozen IDs or so.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I'm not at the scale where that matters... yet ;)

What's the appeal of using arrays for you, vs. using a normalized table?

I'm not a 3rd normal form queen, just as long as I don't have duplicate data. I'm actually transitioning from NoSQL, so even just having a schema is a step in the traditional route. In these tables I'm storing versions of things, and the relationships are part of the versioning, so having to join over the
different versions in a different table... it'd be like (parent_kind, parent_id, parent_version_id, child_kind... etc) Not a pleasant query to do every time I want to read, which is way way more common than write.

I would say though that jsonb support is the bigger blocker at this point.

kulshekhar · 2017-05-31T06:15:06Z

I would say though that jsonb support is the bigger blocker at this point

I would agree with this. jsonb could also be used for arrays

gr0uch · 2017-05-31T08:22:56Z

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

unnest. also array operators array_cat, array_append, array_remove, array_length, etc. I wouldn't be replacing entire arrays.

How large do you expect your arrays to become (either in elements or in data)?

Arbitrarily large, perhaps up to the row size limit.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I haven't had a use case that would benefit from indexing array columns yet.

What's the appeal of using arrays for you, vs. using a normalized table?

Better read performance and less joins, more flexible data modelling.

justinj · 2017-06-05T15:09:18Z

Thanks so much all of you for your input! It's much appreciated.

alex-sherwin · 2017-08-18T18:24:13Z

I haven't had a use case that would benefit from indexing array columns yet.

A simple, and probably common scenario, would be a standard MIME email that has been broken out into parts so it's indexed/searchable.

Such a domain model would (usually) contain a small number of TO and CC headers representing a (typically) small number of recipients.

In this case, it's quite convenient to be able to have a native array type in a column that is indexed so you can search for a specific email address. In postgres, this is easily accomplished with a text[] column, a GIN index and array operators while querying (which GIN indexes support, see https://www.postgresql.org/docs/9.6/static/indexes-types.html)

justinj · 2017-09-19T17:37:43Z

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

alex-sherwin · 2017-09-19T19:23:57Z

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

@justinj

It wasn't immediately obvious to me when checking on this issues updates to see that the feature was actually implemented and is part of 1.1 (currently beta).. just for anyone else looking here as well

Release note: None

petermattis added this to the 1.0 milestone Aug 14, 2015

tbg added the SQL label Aug 15, 2015

jess-edwards mentioned this issue Aug 17, 2015

Product Roadmap #2132

Closed

78 tasks

petermattis mentioned this issue Oct 22, 2015

sql/parser: flesh out the support for functions #2042

Closed

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed SQL labels Feb 13, 2016

knz added the A-sql-semantics label Feb 21, 2016

Linicks mentioned this issue Mar 25, 2016

sql/pgwire: Elixir Connection Issues - Postgrex Protocol Errors - First Test #5582

Closed

knz mentioned this issue Aug 4, 2016

sql: support compound types #8318

Closed

3 tasks

cuongdo self-assigned this Sep 8, 2016

This was referenced Nov 7, 2016

sql: add support for ARRAY indexing #10518

Closed

sql: add support for ARRAY operators and functions #10519

Closed

dianasaur323 mentioned this issue Feb 10, 2017

Product Roadmap #13517

Closed

petermattis removed this from the 1.0 milestone Mar 29, 2017

cuongdo removed their assignment Apr 5, 2017

cuongdo mentioned this issue May 9, 2017

sql: write ARRAY encoding RFC #15818

Closed

justinj mentioned this issue May 31, 2017

rfc: array value encoding #16172

Merged

petermattis modified the milestones: 1.1, Later Jun 1, 2017

petermattis assigned justinj Jun 1, 2017

justinj closed this as completed Sep 19, 2017

awoods187 mentioned this issue Oct 29, 2018

sql: fix unimplemented github tracking error number #31989

Closed

4 tasks

This was referenced Nov 22, 2018

sql: support array slicing #32551

Open

sql: support multi-dimensional arrays #32552

Open

knz added a commit to knz/cockroach that referenced this issue Nov 22, 2018

sql: replace references to cockroachdb#2115 by cockroachdb#32551

2d41b42

Release note: None

knz added a commit to knz/cockroach that referenced this issue Nov 22, 2018

sql: replace references to cockroachdb#2115 by cockroachdb#32552

5b4473f

Release note: None

knz added a commit to knz/cockroach that referenced this issue Nov 27, 2018

sql: replace references to cockroachdb#2115 by cockroachdb#32551

ba3240d

Release note: None

knz added a commit to knz/cockroach that referenced this issue Nov 27, 2018

sql: replace references to cockroachdb#2115 by cockroachdb#32552

9316354

Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: add support for ARRAY type #2115

sql: add support for ARRAY type #2115

petermattis commented Aug 14, 2015 •

edited by jordanlewis

Loading

vivekmenezes commented Jan 26, 2016

kevincox commented Jan 26, 2016

JackKrupansky commented Jan 26, 2016

vivekmenezes commented Jan 26, 2016

kevincox commented Jan 26, 2016

JackKrupansky commented Jan 26, 2016

kevincox commented Jan 26, 2016

MehSha commented Feb 10, 2016

amangoeliitb commented Jul 8, 2016

petermattis commented Jul 9, 2016

kulshekhar commented Aug 10, 2016

tlvenn commented Dec 6, 2016 •

edited

Loading

jhartman86 commented Dec 6, 2016

cuongdo commented Dec 8, 2016

kulshekhar commented Dec 8, 2016

tlvenn commented Dec 9, 2016

spencerkimball commented Mar 28, 2017

cuongdo commented May 2, 2017

justinj commented May 30, 2017

kevincox commented May 30, 2017 •

edited

Loading

kulshekhar commented May 31, 2017

heiskr commented May 31, 2017

kulshekhar commented May 31, 2017

gr0uch commented May 31, 2017

justinj commented Jun 5, 2017

alex-sherwin commented Aug 18, 2017 •

edited

Loading

justinj commented Sep 19, 2017

alex-sherwin commented Sep 19, 2017

sql: add support for ARRAY type #2115

sql: add support for ARRAY type #2115

Comments

petermattis commented Aug 14, 2015 • edited by jordanlewis Loading

vivekmenezes commented Jan 26, 2016

kevincox commented Jan 26, 2016

JackKrupansky commented Jan 26, 2016

vivekmenezes commented Jan 26, 2016

kevincox commented Jan 26, 2016

JackKrupansky commented Jan 26, 2016

kevincox commented Jan 26, 2016

MehSha commented Feb 10, 2016

amangoeliitb commented Jul 8, 2016

petermattis commented Jul 9, 2016

kulshekhar commented Aug 10, 2016

tlvenn commented Dec 6, 2016 • edited Loading

jhartman86 commented Dec 6, 2016

cuongdo commented Dec 8, 2016

kulshekhar commented Dec 8, 2016

tlvenn commented Dec 9, 2016

spencerkimball commented Mar 28, 2017

cuongdo commented May 2, 2017

justinj commented May 30, 2017

kevincox commented May 30, 2017 • edited Loading

kulshekhar commented May 31, 2017

heiskr commented May 31, 2017

kulshekhar commented May 31, 2017

gr0uch commented May 31, 2017

justinj commented Jun 5, 2017

alex-sherwin commented Aug 18, 2017 • edited Loading

justinj commented Sep 19, 2017

alex-sherwin commented Sep 19, 2017

petermattis commented Aug 14, 2015 •

edited by jordanlewis

Loading

tlvenn commented Dec 6, 2016 •

edited

Loading

kevincox commented May 30, 2017 •

edited

Loading

alex-sherwin commented Aug 18, 2017 •

edited

Loading