How to implement support for multi-values fields (#1300)? #1733

Yury-Fridlyand · 2023-06-13T00:46:40Z

Yury-Fridlyand
Jun 13, 2023
Maintainer

I opened this topic to discuss how to implement fix for #1300. Unlike issue, a discussion allows open threads to post replies to every message, which could be more useful there.

I want to share different ways to implement that below and collect your opinions and votes. Before that let me clarify some requirements and describe why legacy engine works incorrect (given that we can't implement this in V2 the same ways as in V1).

Requirements

A column K in a table N can't change the type.
All values in a column K in a table N always should be in the same type.
New docs could be added or deleted from an OpenSearch index represented by a table N.
An OpenSearch index mapping cannot be changed.
The main (targeted) consumer of the SQL plugin is a third-party application like JDBC driver.

Sample data

An index called ~~doesn't-matter-how~~ dbg with 3 docs, all docs have only one field.
To simplify understanding the response I'll add _doc to the response even though V1 doesn't provide it.

The mapping

"mappings" :
{
  "properties" :
  {
    "myNum" :
    {
      "type" : "long"
    }
  }
}

The docs

[
  {
    "myNum" : 5
  },
  {
    "myNum" : [3, 4]
  },
  {
    "myNum" : [[1, 2], [3, 4], 5]
  }
]

The query

SELECT * FROM dbg

V1 response

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5
    ],
    [
      "doc_2",
      [
        3,
        4
      ]
    ],
    ...
  ],
  "size": 2,
  "status": 200
}

It is errorneous even though it represents data as is. It violates the second requirement. The schema in that response is also not applicable for all rows.

V2 response

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5
    ],
    [
      "doc_2",
      3
    ],
    [
      "doc_3",
      1
    ]
  ],
  "size": 2,
  "status": 200
}

Or as a table:

_doc	myNum
doc_1	5
doc_2	3
doc_3	1

This response is valid in terms of SQL, but loses the data.

Yury-Fridlyand · 2023-06-13T00:46:56Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 1: flatten values

Flatten values as NESTED function does by adding extra rows.

Query:

SELECT * FROM dbg

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5
    ],
    [
      "doc_2",
      3
    ],
    [
      "doc_2",
      4
    ],
    [
      "doc_3",
      1
    ],
    [
      "doc_3",
      2
    ],
    [
      "doc_3",
      3
    ],
    [
      "doc_3",
      4
    ],
    [
      "doc_3",
      5
    ],
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum
doc_1	5
doc_2	3
doc_2	4
doc_3	1
doc_3	2
doc_3	3
doc_3	4
doc_3	5

Cons:

This may add a significant amount of rows.
Rows are added when user didn't request them.
Unfortunately, grouping is lost unless extra info (as a column, for example) added to the response.
This violates principle one row for one doc.
Could be incompatible with other features, for example, pagination.

0 replies

Yury-Fridlyand · 2023-06-13T00:47:08Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 2: Expand values

Add extra columns to entire result set if a doc has a field with multiple values.

Query:

SELECT * FROM dbg

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    },
    {
      "name": "myNum[0]",
      "type": "long"
    },
    {
      "name": "myNum[1]",
      "type": "long"
    },
    {
      "name": "myNum[0][0]",
      "type": "long"
    },
    {
      "name": "myNum[0][1]",
      "type": "long"
    },
    {
      "name": "myNum[1][0]",
      "type": "long"
    },
    {
      "name": "myNum[1][1]",
      "type": "long"
    },
    {
      "name": "myNum[2]",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
    ],
    [
      "doc_2",
      null,
      3,
      4,
      null,
      null,
      null,
      null,
      null,
    ],
    [
      "doc_3",
      null,
      null,
      null,
      1,
      2,
      3,
      4,
      5
    ],
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum	myNum[0]	myNum[1]	myNum[0][0]	myNum[0][1]	myNum[1][0]	myNum[1][1]	myNum[2]
doc_1	5	null	null	null	null	null	null	null
doc_2	null	3	4	null	null	null	null	null
doc_3	null	null	null	1	2	3	4	5

Cons:

This may add a significant amount of columns.
Columns are added when user didn't request them.

0 replies

Yury-Fridlyand · 2023-06-13T00:47:19Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 3: provide value as a json

Add one extra column with a raw json, which contains the field.

Query:

SELECT * FROM dbg

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    },
    {
      "name": "myNum.json",
      "type": "string"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5,
      "{ \"myNum\": 5 }"
    ],
    [
      "doc_2",
      null,
      "{ \"myNum\": [3, 4] }"
    ],
    ,
    [
      "doc_3",
      null,
      "{ \"myNum\": [[1, 2], [3, 4], 5] }"
    ]
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum	myNum.json
doc_1	5	{ "myNum": 5 }
doc_2	null	{ "myNum": [3, 4] }
doc_3	null	{ "myNum": [[1, 2], [3, 4], 5] }

Cons:

Extra column would be added to all fields.
Extra column would be added those fields which have multiple values:
- Requires sampling the entire response.
- Response depends on data and may be changed.
- Incompatible with pagination and probably, with some other features.

Procs:

Behavior could be configurable and switchable by settings or query parameters/hints. See Option 5 and Option X below.

0 replies

Yury-Fridlyand · 2023-06-13T00:47:27Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 4: ignore not compatible values

Value [3, 4] can't be converted to a number, so should be skipped.
Don't hurry up to reject this option, see other options below.

Query:

SELECT * FROM dbg

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5
    ],
    [
      "doc_2",
      null
    ],
    [
      "doc_3",
      null
    ]
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum
doc_1	5
doc_2	null
doc_3	null

Cons:

Some values are omitted.

Procs:

Completely no ambiguity.

0 replies

Yury-Fridlyand · 2023-06-13T00:47:37Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 5: cast or convert

This implies Option 4 and Option 3, but allows user to use a function(s) to convert multi-value fields.

Query:

SELECT CAST(myNum AS JSON) AS `myNum.json`, myNum FROM dbg;

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    },
    {
      "name": "myNum.json",
      "type": "string"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5,
      "{ \"myNum\": 5 }"
    ],
    [
      "doc_2",
      null,
      "{ \"myNum\": [3, 4] }"
    ],
    ,
    [
      "doc_3",
      null,
      "{ \"myNum\": [[1, 2], [3, 4], 5] }"
    ]
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum	myNum.json
doc_1	5	{ "myNum": 5 }
doc_2	null	{ "myNum": [3, 4] }
doc_3	null	{ "myNum": [[1, 2], [3, 4], 5] }

Procs:

No ambiguity.
No cons from Option 3.

0 replies

Yury-Fridlyand · 2023-06-13T00:47:46Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 6: PartiQL

This implies Option 4, but allows user to use PartiQL syntax and define how to deal with multiple values.
A number of sub-options is possible here, syntax and behavior in each case should be dicsussed in further.

Query:

SELECT CASE WHEN (myNum IS ARRAY) THEN myNum[0]
       ELSE myNum END AS myNum
FROM dbg;

or

SELECT num AS myNum FROM dbg as d, d.myNum[0] as num

Example of the response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5
    ],
    [
      "doc_2",
      3
    ],
    ,
    [
      "doc_3",
      null
    ]
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum
doc_1	5
doc_2	3
doc_3	null

Cons:

Complex implementation.

Procs:

Allows users to twirl and represent their data as they want.
Allows users to filter

0 replies

Yury-Fridlyand · 2023-06-13T00:48:09Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 7: painless script

This implies Option 4, but allows user to submit a painless script to process docs/fields. This script will be posted straight to the OpenSearch engine.

Query:

SELECT myNum, painless_script('...') ...

or

SELECT *, painless_script(myNum, '...') ...

or

{
  "query" : "SELECT * ...",
  "painless_script" : "..."
}

Response:

{
  "schema": [
    {
      "name": "_doc",
      "type": "string"
    },
    {
      "name": "myNum",
      "type": "long"
    },
    {
      "name": "myNum.scripted",
      "type": "long"
    }
  ],
  "total": 2,
  "datarows": [
    [
      "doc_1",
      5,
      null
    ],
    [
      "doc_2",
      null,
      4
    ],
    ,
    [
      "doc_3",
      null,
      5
    ]
  ],
  "size": 2,
  "status": 200
}

Being converted to the table that reponse looks like:

_doc	myNum	myNum.scripted
doc_1	5	null
doc_2	null	4
doc_3	null	5

Cons:

Potential security breach.
Implementation could be complex.
Script language is similar to javascript and could be inconvenient to use by SQL developers.
Not-so-painless actually.

Procs:

The user takes full responsibility of how to process the data.

0 replies

Yury-Fridlyand · 2023-06-13T00:48:17Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Option 8: error on multiple values

This is modification of Option 4, but once SQL engine meets a field with multiple values, like [3, 4], it returns a error to the user if no instruction given how to process multiple values.
This could be seamlessly combined with any Options 5 to 7, or with a query parameter/hint (see Option X) could be combined with all options.

Query:

SELECT * FROM dbg

Response:

{
  "error": {
    "reason": "Error occurred in SQL engine",
    "details": "...",
    "type": "SomeTypeOfException"
  },
  "status": 503
}

Cons

May return a valid result set or an error on the same query, depending on the data in the index.

Procs

Does not affect users who don't have multiple values.
Clearly notifies a user and requests an action (changing request).

0 replies

Yury-Fridlyand · 2023-06-13T00:48:24Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Now the highlight of the program comes.

Option X: let user pick the option

This requires following steps:

Define a default option. This could be also configurable by plugin settings.
Define which options to support.
Define how user can pick an option per-query. Any (or all) could be implemented:
1. As a function: SELECT flatten_values(myNum) ..., SELECT convert_to_json(myNum) ...
2. As an engine hint: /*! flatten_values(myNum) */ SELECT * ..., SELECT /*! convert_to_json */ myNum ...
3. As a request parameter: { "query": "SELECT * ... ", "flatten_values": "myNum, otherColumn" }, { "query": "SELECT myNum ... ", "convert_to_json": "*" }

0 replies

Yury-Fridlyand · 2023-06-13T00:51:03Z

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Please, feel free to update/edit, add options and comment them.

My choice is to implement (in order):

Option 4 or 8
Option 5
Option X
Options 1, 2 and 6 and maybe 3

0 replies

MaxKsyunz · 2023-06-13T06:04:10Z

MaxKsyunz
Jun 13, 2023
Maintainer

Thank you @Yury-Fridlyand for starting this discussion. I like this layout.

As to mutli-valued fields, from my perspective:

For SQL plugin to be trustworthy, it must return all the data requested.
For SQL plugin to be easy to use, the API user must be able to get the data in the shape it was indexed in.

If {"myNum" : [[1, 2], [3, 4], 5] } is what was indexed, then as application developer, I'd expect SELECT myNum FROM dbg to return [[1, 2], [3, 4], 5] in myNum field. Presumably that shape makes sense for my application.

If I'm using the REST endpoint, then I expect JSON response and dealing with this case should be straight-forward.

How to map [[1, 2], [3, 4], 5] as a value in JDBC (or ODBC) is a different question. Best I see right now is to return such values as user-defined types where we defined semantics.

The one thing I dislike about this is the schema could change based on the data returned but I don't see anything in the JDBC documentation that expects it to be the same.

This is where configurability could be helpful. If a client can say how to convert multi-valued field to a single value, then they can get a consistent schema from query to query with simple types.

1 reply

Yury-Fridlyand Jun 13, 2023
Maintainer Author

the schema could change based on the data returned

That violates first requirement I posted on the very top: a column K from a table N returned by any/all query should be always the same type, regardless of data returned or stored in the table.

User can re-create the table by changing the column type. That will be column K' from a table N' with the same rule applied.
User can do a convert/cast/transform, which would produce f(K) with the same rule applied.

acarbonetto · 2023-06-20T20:05:46Z

acarbonetto
Jun 20, 2023
Maintainer

Suggestion is that we implement #5 first: #1300 (comment)
That way we have a work-around for any one interested in using data as arrays (rather than the current implementation in V2 that only supports non-array values).

0 replies

acarbonetto · 2023-06-21T06:13:57Z

acarbonetto
Jun 21, 2023
Maintainer

We can add support for arrays by including a meta tag in the mapping. Meta fields and properties can be added dynamically using the mapping API. See: PUT /<target>/_mapping endpoint. This will allow one to add field properties to a existing mapping.

For existing/new mappings, we could use the meta.asArray (RFC on naming) property as a trigger to determine if the field should be treated exclusively as an array overriding the existing V2 behaviour (which forces the data type to be a single element). Note that meta fields must be string settings - so a regular true value won't work here.

For example:

            "int0" : {
                "type" : "integer",
                "meta": {
                    "asArray": "true"
                }
            }

In PPL, the JSON array is output. We should consider an overriding meta property that would behave like V2 in all cases (which forces the data type to be a single element). We could use the same meta property with the false value to force behaviour.

For example:

            "int0" : {
                "type" : "integer",
                "meta": {
                    "asArray": "false"
                }
            }

Once multi-field support is available, we could have both options available to user, for example:

            "int0" : {
                "type" : "integer",
                "fields" : {
                    "array": {
                        "type": "integer",
                        "meta": {
                            "asArray": "true"
                        }
                    }
                 }
            }

0 replies

acarbonetto · 2023-06-26T21:32:01Z

acarbonetto
Jun 26, 2023
Maintainer

Note: in postgresql, the columns are strictly defined. We can use meta to strictly enforce the typing. It allows for multi-dimensional arrays too. For example:

CREATE TABLE sal_emp (
    name            text,
    pay_by_quarter  integer[],
    schedule        text[][]
);

SELECT pay_by_quarter[3] FROM sal_emp;
 pay_by_quarter
----------------
          10000
          25000

Reference: https://www.postgresql.org/docs/current/arrays.html

0 replies

forestmvey · 2023-06-29T16:30:05Z

forestmvey
Jun 29, 2023
Maintainer

Partiql also supports querying arrays using square parenthesis for arrays of literals.

{ 
    'root': {
        'array': << 
       [
            [1, 2], [3, 4], 5,
            'NA'
        ]
        >> 
    }
}

Query whole array:

PartiQL> select array from root;
==='
<<
  {
    'array': [
      [
        1,
        2
      ],
      [
        3,
        4
      ],
      5,
      'NA'
    ]
  }
>>
---

Query First Index Of Array:

PartiQL> SELECT array[0] FROM root;
==='
<<
  {
    '_1': [
      1,
      2
    ]
  }
>>
---

Query First Index of First Index Array:

PartiQL> SELECT array[0][0] FROM root;
==='
<<
  {
    '_1': 1
  }
>>
---

Query Third Index To Array:

PartiQL> SELECT array[2] FROM root;
==='
<<
  {
    '_1': 5
  }
>>
---

With this implementation in the V2 engine we can avoid a breaking change. Without the user specifying array indexing in their SQL query the current behaviour will be adhered to and the first value in the array will be returned. When the user uses square parenthesis the responsibility for that index to be valid falls on the user. Following are some examples and edge cases to show this implementation in the SQL plugin.

Arrays of Objects

"accounts": [{"id": 1}, {"id": 2}]

Query:

SELECT accounts[0].id FROM people;

Response:
1

Example query that fails due to not indexing array correctly and missing value is returned.

Query:

SELECT accounts.id[0] FROM people;

Response:
null

Example query that fails due to indexing array out of bounds and missing value is returned.

Query:

SELECT accounts[2].id FROM people;

Response:
null

Example query that returns whole array.

Query:

SELECT accounts[:] FROM people;

Response:
[{"id": 1}, {"id": 2}]

Example query that returns whole array without specifying ending parenthesis. Whole array support is returned by default when the user uses square parenthesis in any part of the paths of selected field.

Query:

SELECT accounts[0].id FROM people;

Response:
{"id": 1}

Arrays of Objects with inner arrays

"accounts": [{"id": [[1, 1], 2]}, {"id": [3, 4]}]

An example where we return the whole resulting array implicitly:

Query:

SELECT accounts[0].id FROM people;

Response:
[[1, 1], 2]

An example where we return the whole resulting array implicitly:

Query:

SELECT accounts[0].id[0][0] FROM people;

Response:
1

Limitations:
TBD

0 replies

normanj-bitquill · 2024-10-18T17:31:09Z

normanj-bitquill
Oct 18, 2024

There is a method OpenSearchExprValueFactory.construct(). It has a parameter supportArrays. For the SQL plugin (including PPL), this is always set to false. As a result arrays (with any level of nesting) are collapsed down to the first value.

If supportArrays is set to true, then it will correctly parse array values (including nesting). These can flow through to the results correctly (at least for JSON results).

Given this data:

{
  "name": "value1",
  "value": [1]
}

{
  "name": "value2",
  "value": [2, 3]
}

{
  "name": "value3",
  "value": [
    [4,5],
    [6,7],
    8
  ]
}

Can use the SQL query:

SELECT * FROM test1

Or PPL query:

source=test1

Can get these results:

{
  "schema": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "value",
      "type": "long"
    }
  ],
  "datarows": [
    [
      "value1",
      [
        1
      ]
    ],
    [
      "value2",
      [
        2,
        3
      ]
    ],
    [
      "value3",
      [
        [
          4,
          5
        ],
        [
          6,
          7
        ],
        8
      ]
    ]
  ],
  "total": 3,
  "size": 3
}

Need to investigate how this would impact SQL queries in particular. May need to move the logic for collapsing an array closer to the JDBC connector. Some consumers of JSON results would be able to use the data with arrays.

2 replies

normanj-bitquill Oct 18, 2024

Only one test failed, but it was an interesting case. The query is:

source = people | fields accounts, accounts.id

Uses data from deep_nested_index_data.json. It produces this result:

fetched rows / total rows = 1/1
    +-----------------------+-------------+
    | accounts              | accounts.id |
    |-----------------------+-------------|
    | [{'id': 1},{'id': 2}] | 1           |
    +-----------------------+-------------+

Note: it included both accounts, which matches the data in the source document. accounts.id is only the first value from the accounts array.

Does acounts.id even make sense? That is actually a key of an object in the array. In JQuery, this would be more like accounts[].id.
Should accounts.id return an array of values?

normanj-bitquill Oct 18, 2024

Here are two possibilities:

Whenever possible parse an array as an array. This will change existing behaviour, but may be more intuitive to users.
Require the user to update the _meta for an index to indicate which fields should be treated as arrays. Preserves existing behaviour, but requires the user to update the index before they can receive arrays back from queries.

penghuo · 2024-10-19T00:17:09Z

penghuo
Oct 19, 2024
Maintainer

could we priortize the basic case 'source=people | where accounts=1 | fields accounts', other case we can not support throw exception.

1 reply

normanj-bitquill Oct 21, 2024

For clarity, if people.accounts is an array field, it should also be an array in the results of the above query.

normanj-bitquill · 2024-10-22T21:50:37Z

normanj-bitquill
Oct 22, 2024

Initially we will throw errors for operators as and functions that do not handle multi valued fields correctly. Should include the field that caused the failure in the error message.

0 replies

normanj-bitquill · 2024-10-28T21:10:05Z

normanj-bitquill
Oct 28, 2024

The initial change has been merged in. Array values by default are preserved. They will fail with most operators and functions. At this point, a user should only specify a field with array values in the projection list.

0 replies

normanj-bitquill · 2024-10-28T21:58:24Z

normanj-bitquill
Oct 28, 2024

Here is a list of operators that need to be updated to handle arrays.

GROUP BY
ORDER BY
JOIN
=
!=
>
>=
<
<=
ABS() - should throw an error with the failing expression in the message
ADD() - should throw an error with the failing expression in the message
CBRT() - should throw an error with the failing expression in the message
CEIL() - should throw an error with the failing expression in the message
CONV() - should throw an error with the failing expression in the message
CRC32() - should throw an error with the failing expression in the message
DIVIDE() - should throw an error with the failing expression in the message
EXP() - should throw an error with the failing expression in the message
EXPM1() - should throw an error with the failing expression in the message
FLOOR() - should throw an error with the failing expression in the message
LN() - should throw an error with the failing expression in the message
LOG() - should throw an error with the failing expression in the message
LOG2() - should throw an error with the failing expression in the message
LOG10() - should throw an error with the failing expression in the message
MOD() - should throw an error with the failing expression in the message
MODULUS() - should throw an error with the failing expression in the message
MULTIPLY() - should throw an error with the failing expression in the message
POW() - should throw an error with the failing expression in the message
POWER() - should throw an error with the failing expression in the message
RAND() - should throw an error with the failing expression in the message
RINT() - should throw an error with the failing expression in the message
ROUND() - should throw an error with the failing expression in the message
SIGN() - should throw an error with the failing expression in the message
SIGNUM() - should throw an error with the failing expression in the message
SQRT() - should throw an error with the failing expression in the message
STRCMP() - should throw an error with the failing expression in the message
SUBTRACT() - should throw an error with the failing expression in the message
TRUNCATE() - should throw an error with the failing expression in the message
+ - should throw an error with the failing expression in the message
− - should throw an error with the failing expression in the message
* - should throw an error with the failing expression in the message
/ - should throw an error with the failing expression in the message
% - should throw an error with the failing expression in the message
ACOS() - should throw an error with the failing expression in the message
ASIN() - should throw an error with the failing expression in the message
ATAN() - should throw an error with the failing expression in the message
ATAN2() - should throw an error with the failing expression in the message
COS() - should throw an error with the failing expression in the message
COSH() - should throw an error with the failing expression in the message
COT() - should throw an error with the failing expression in the message
DEGREES() - should throw an error with the failing expression in the message
RADIANS() - should throw an error with the failing expression in the message
SIN() - should throw an error with the failing expression in the message
SINH() - should throw an error with the failing expression in the message
TAN() - should throw an error with the failing expression in the message
ADDDATE() - should throw an error with the failing expression in the message
ADDTIME() - should throw an error with the failing expression in the message
CONVERT_TZ() - should throw an error with the failing expression in the message
DATE() - should throw an error with the failing expression in the message
DATEDIFF() - should throw an error with the failing expression in the message
DATETIME() - should throw an error with the failing expression in the message
DATE_ADD() - should throw an error with the failing expression in the message
DATE_FORMAT() - should throw an error with the failing expression in the message
DATE_SUB() - should throw an error with the failing expression in the message
DAY() - should throw an error with the failing expression in the message
DAYNAME() - should throw an error with the failing expression in the message
DAYOFMONTH() - should throw an error with the failing expression in the message
DAYOFWEEK() - should throw an error with the failing expression in the message
DAYOFYEAR() - should throw an error with the failing expression in the message
DAY_OF_MONTH() - should throw an error with the failing expression in the message
DAY_OF_WEEK() - should throw an error with the failing expression in the message
DAY_OF_YEAR() - should throw an error with the failing expression in the message
EXTRACT() - should throw an error with the failing expression in the message
FROM_DAYS() - should throw an error with the failing expression in the message
FROM_UNIXTIME() - should throw an error with the failing expression in the message
GET_FORMAT() - should throw an error with the failing expression in the message
HOUR() - should throw an error with the failing expression in the message
HOUR_OF_DAY() - should throw an error with the failing expression in the message
LAST_DAY() - should throw an error with the failing expression in the message
MAKEDATE() - should throw an error with the failing expression in the message
MAKETIME() - should throw an error with the failing expression in the message
MICROSECOND() - should throw an error with the failing expression in the message
MINUTE() - should throw an error with the failing expression in the message
MINUTE_OF_DAY() - should throw an error with the failing expression in the message
MINUTE_OF_HOUR() - should throw an error with the failing expression in the message
MONTHNAME() - should throw an error with the failing expression in the message
PERIOD_ADD() - should throw an error with the failing expression in the message
PERIOD_DIFF() - should throw an error with the failing expression in the message
QUARTER() - should throw an error with the failing expression in the message
SECOND() - should throw an error with the failing expression in the message
SECOND_OF_MINUTE() - should throw an error with the failing expression in the message
SEC_TO_TIME() - should throw an error with the failing expression in the message
SUBDATE() - should throw an error with the failing expression in the message
SUBTIME() - should throw an error with the failing expression in the message
STR_TO_DATE() - should throw an error with the failing expression in the message
TIME() - should throw an error with the failing expression in the message
TIMEDIFF() - should throw an error with the failing expression in the message
TIMESTAMP() - should throw an error with the failing expression in the message
TIMESTAMPADD() - should throw an error with the failing expression in the message
TIMESTAMPDIFF() - should throw an error with the failing expression in the message
TIME_FORMAT() - should throw an error with the failing expression in the message
TIME_TO_SEC() - should throw an error with the failing expression in the message
TO_DAYS() - should throw an error with the failing expression in the message
TO_SECONDS() - should throw an error with the failing expression in the message
UNIX_TIMESTAMP() - should throw an error with the failing expression in the message
WEEK() - should throw an error with the failing expression in the message
WEEKOFYEAR() - should throw an error with the failing expression in the message
WEEK_OF_YEAR() - should throw an error with the failing expression in the message
YEAR() - should throw an error with the failing expression in the message
YEARWEEK() - should throw an error with the failing expression in the message
LIKE - should throw an error with the failing expression in the message
ASCII() - should throw an error with the failing expression in the message
CONCAT()
CONCAT_WS()
LEFT() - should throw an error with the failing expression in the message
LENGTH() - should throw an error with the failing expression in the message
LOCATE() - should throw an error with the failing expression in the message
REPLACE() - should throw an error with the failing expression in the message
RIGHT() - should throw an error with the failing expression in the message
RTRIM() - should throw an error with the failing expression in the message
SUBSTRING() - should throw an error with the failing expression in the message
TRIM() - should throw an error with the failing expression in the message
UPPER() - should throw an error with the failing expression in the message
AVG() - should throw an error with the failing expression in the message
COUNT()
MAX()
MIN()
PERCENTILE() - should throw an error with the failing expression in the message
STDDEV_POP() - should throw an error with the failing expression in the message
STDDEV_SAMP() - should throw an error with the failing expression in the message
SUM() - should throw an error with the failing expression in the message
VAR_POP() - should throw an error with the failing expression in the message
VAR_SAMP() - should throw an error with the failing expression in the message

0 replies

normanj-bitquill · 2024-10-29T22:33:17Z

normanj-bitquill
Oct 29, 2024

@penghuo Here are the next steps that I am thinking of:

Update functions and operators that produce incorrect results with array values to throw an exception. This is meant as a short term solution to prevent users from receiving incorrect results. For example, update the COUNT function to throw an exception when the argument is a multi-valued field.
Go back and fix each functions and operators from step 1 to produce correct results.

An example is the COUNT function. It will count all elements in each array value.

Does this approach make sense?

0 replies

normanj-bitquill · 2024-10-29T22:41:56Z

normanj-bitquill
Oct 29, 2024

Looking into COUNT, it looks like the aggregation can be performed in the OpenSearch engine. This produces the result of counting all elements in all array values.

MIN and MAX show similar issues. The unexpected result is produced in the OpenSearch engine.

POST test3/_search
{
  "size": 0,
  "aggs": {
    "number_of_values": {
      "value_count": {
        "field": "y"
      }
    }
  }
}

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "number_of_values": {
      "value": 10
    }
  }
}

0 replies

normanj-bitquill · 2024-10-29T22:53:55Z

normanj-bitquill
Oct 29, 2024

If the sort order (ORDER BY clause) is applied to a multi-valued field, then the OpenSearch engine will only perform the sort based on the first element is each array.

0 replies

normanj-bitquill · 2024-10-31T22:51:38Z

normanj-bitquill
Oct 31, 2024

There are limitations in the OpenSearch server that cause aggregations to behave differently from relational databases.

The SQL plugin can work around this by not pushing down aggregations on multi-valued fields. Currently the mapping does not provide a way to distinguish a single valued field from a multi-valued field.

Here is an example mapping for an index. Field y is multi-valued.

{
  "properties" : {
    "x" : {
      "type" : "long"
    },
    "y": {
      "type": "long"
    }
  }
}

The SQL plugin could make use of information in the _meta section of the mapping to know which fields are multi-valued. This requires the user to update the mapping. It should not have impact on other consumers of the mapping since the properties field is not changed.

0 replies

How to implement support for multi-values fields (#1300)? #1733

Yury-Fridlyand Jun 13, 2023 Maintainer

Requirements

Sample data

The mapping

The docs

The query

V1 response

V2 response

Replies: 24 comments · 4 replies

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 1: flatten values

Cons:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 2: Expand values

Cons:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 3: provide value as a json

Cons:

Procs:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 4: ignore not compatible values

Cons:

Procs:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 5: cast or convert

Procs:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 6: PartiQL

Cons:

Procs:

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 7: painless script

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option 8: error on multiple values

Cons

Procs

Yury-Fridlyand Jun 13, 2023 Maintainer Author

Option X: let user pick the option

Yury-Fridlyand Jun 13, 2023 Maintainer Author

MaxKsyunz Jun 13, 2023 Maintainer

Yury-Fridlyand Jun 13, 2023 Maintainer Author

acarbonetto Jun 20, 2023 Maintainer

acarbonetto Jun 21, 2023 Maintainer

acarbonetto Jun 26, 2023 Maintainer

forestmvey Jun 29, 2023 Maintainer

penghuo Oct 19, 2024 Maintainer

Yury-Fridlyand
Jun 13, 2023
Maintainer

Replies: 24 comments 4 replies

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

Yury-Fridlyand
Jun 13, 2023
Maintainer Author

MaxKsyunz
Jun 13, 2023
Maintainer

Yury-Fridlyand Jun 13, 2023
Maintainer Author

acarbonetto
Jun 20, 2023
Maintainer

acarbonetto
Jun 21, 2023
Maintainer

acarbonetto
Jun 26, 2023
Maintainer

forestmvey
Jun 29, 2023
Maintainer

penghuo
Oct 19, 2024
Maintainer