Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Filtering over nested arrays of scalars #182

Merged
merged 15 commits into from
Oct 18, 2024

Conversation

daniel-chambers
Copy link
Collaborator

@daniel-chambers daniel-chambers commented Sep 25, 2024

Note: This PR is currently stacked on #187 and will be rebased once that PR has merged. Rebased on main.

RFC Rendered

This PR also makes some minor quality of life improvements:

  • Adds some useful targets to the justfile
  • Widens the width of content in the ndc-spec documentation website by 150px to help give a bit more space to accommodate the capabilities tables
  • RFC
  • Specification updates
    • Changelog
    • Specification pages
    • Tutorial pages
    • reference/types.md
  • Implement your feature in the reference connector
  • Generate test cases in ndc-test if appropriate
  • Does your feature add a new capability?
    • Update specification/capabilities.md in the specification
    • Check the capability in ndc-test

hallettj
hallettj previously approved these changes Sep 25, 2024
Copy link

@hallettj hallettj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I appreciate the new comparison target idea since that makes disambiguating easier for the connector. I agree that this should only apply to arrays of scalars.

@daniel-chambers
Copy link
Collaborator Author

daniel-chambers commented Sep 30, 2024

There is a problem with this approach. It doesn't allow use of logical operators beyond the new ArrayComparison::ExistsInNestedArray boundary. So, for example, if the following data existed:

[
  Customer {
    nested_numbers: [ [2,1], [1,0] ]
  },
  Customer {
    nested_numbers: [ [2,3], [1,0] ]
  }
]  

and I wanted to ask the following question:

give me all customers where the inner array inside nested_numbers contains 1 and also contains 2

query {
  Customer(where: { nested_numbers: { inner: { _and: [ { _eq: 1 }, { _eq: 2 } ] } } }) {
    id
  }
}

I couldn't because there's no way to nest logical operators inside a ArrayComparison::ExistsInNestedArray.

collection: customers
query:
  fields:
    id:
      type: column
      column: id
  predicate:
    type: array_comparison
    column:
      type: column
      name: nested_numbers
      path: []
    comparison:
      type: exists_in_nested_array
      nested_comparison:
        type: and # This does not exist
        comparisons:
          - type: exists_binary
            operator: eq
            value:
              type: literal
              value: 1
          - type: exists_binary
            operator: eq
            value:
              type: literal
              value: 2

arguments: {}
collection_relationships: {}

Perhaps a simple solution to this, that doesn't require another level of "And", "Or", "Not" inside ArrayComparison, is to make ArrayComparison::ExistsInNestedArray take an array of nested comparisons which are implicitly "and-ed" together.

enum ArrayComparison {
    ...
    ExistsInNestedArray {
        nested_comparisons: Vec<ArrayComparison>
    }
}

This would allow us to do logical conjunction (ie "and") within the same exists scope. For disjunction, my understanding is that we can always lift the logical "or" to the top of the expression. Here's me asking ChatGPT about it: https://chatgpt.com/share/66fa38a6-c7a4-8000-a290-6af0aaa258a5 (its answer about "Existential Quantification and Conjunction (∃ and ∧)" is revealing)

@paf31
Copy link
Collaborator

paf31 commented Sep 30, 2024

For disjunction, my understanding is that we can always lift the logical "or" to the top of the expression. Here's me asking ChatGPT about it: https://chatgpt.com/share/66fa38a6-c7a4-8000-a290-6af0aaa258a5 (its answer about "Existential Quantification and Conjunction (∃ and ∧)" is revealing)

Only if you want to perform DNF translation or something else whenever an OR moves past an AND.

@hallettj
Copy link

I had a thought - we had said that it would be helpful if Expression were partially-applied instead of assuming that every expression evaluates in the context of a record. Could we get to a similar place by adding a variant to ComparisonTarget like ComparisonTarget::ContextuallyRelevantValue? Then instead of defining a parallel set of expressions for applying to arrays we could basically duplicate the Exists variant.

pub enum Expression {
    // ...
    ExistsInArray {
        column: ComparisonTarget,
        predicate: Option<Box<Expression>>,
    },
}

pub enum ComparisonTarget {
    // ...
    ContextuallyRelevantValue, // unit variant
}

In that formulation given the GraphQL query,

query {
  Customer(where: { nested_numbers: { inner: { _and: [ { _eq: 1 }, { _eq: 2 } ] } } }) {
    id
  }
}

I'm imagining an NDC query like this,

collection: customers
query:
  fields:
    id:
      type: column
      column: id
  predicate:
    type: exists_in_array
    column:
      type: column
      name: nested_numbers
      path: []
    predicate:
      type: exists_in_array
      column: { type: "contextually_relevant_value" }
      predicate:
        type: and
        expressions:
          - type: binary_comparison_operator
            column: { type: "contextually_relevant_value" }
            operator: eq
            value:
              type: literal
              value: 1
          - type: binary_comparison_operator
            column: { type: "contextually_relevant_value" }
            operator: eq
            value:
              type: literal
              value: 2

arguments: {}
collection_relationships: {}

But @daniel-chambers I think I disagree with how the expressions apply in that query. I think that GraphQL query asks if there exists one element that is equal to 1 and also equal to 2 so I expect that query wouldn't return any rows. To filter where the inner array contains possibly-different elements where one is equal to 1 and one is equal to 2 I think the and operator should be moved higher:

query {
  Customer(where: { nested_numbers: { _and: [{ inner: { _eq: 1 } }, { inner: { _eq: 2 } }] } }) {
    id
  }
}

But either way I see your point that we need a change to allow boolean conjunctions.

So my idea might lead to some possible nonsensical expressions. But we can avoid problems by not producing such expressions. I think it would be better to change the Expression type to use that partial-application idea. But maybe extending ComparisonTarget is easier.

@paf31
Copy link
Collaborator

paf31 commented Sep 30, 2024

Along the same lines as contextually_relevant_value, we could maybe solve this by adding a new variant to ExistsInCollection:

pub enum ExistsInCollection {
    Related {..},
    Unrelated {..},
    NestedCollection {..},
    NestedScalarCollection {
        column_name: FieldName,
        arguments: BTreeMap<ArgumentName, Argument>,
        field_path: Vec<FieldName>,
    }

NestedScalarCollection explodes an array of scalars into a nested collection, with a single column whose name can be static and chosen by us, similar to __value for functions. The caller can then use the usual expressions to filter by rows of this nested collection.

@hallettj
Copy link

hallettj commented Sep 30, 2024

@paf31 Does ExistsInCollection::NestedScalarCollection handle cases of arrays of arrays? Either an array of arrays of scalars, or an array of arrays of records?

Edit: I guess why wouldn't those cases work? Sounds good.

@daniel-chambers
Copy link
Collaborator Author

daniel-chambers commented Oct 1, 2024

I think that GraphQL query asks if there exists one element that is equal to 1 and also equal to 2 so I expect that query wouldn't return any rows.

@hallettj It's not asking if there exists one element, it's there exists any element in the inner array. Remember this is a doubly-nested array in the example, so "inner" represents an array of ints. So its the same inner array that must contain both 1 and 2. Here's an example:

[ [1,4], [3,1] ] -> false
[ [4,6], [3,0] ] -> false
[ [2,6], [3,1] ] -> false
[ [2,1], [1,0] ] -> true

EDIT: Actually, I think you're right. Sorry!! I think to do what I wanted above you can't use two nested exists, you need an exists and contains (eg ∃ inner. ( CONTAINS(inner, 1) && CONTAINS(inner, 2) ), not ∃ inner. ( ∃ int. (int == 1 && int == 2 ) )).


I think @paf31's latest idea seems reasonable... it certainly follows an existing pattern we use for functions (a virtual column). I think I prefer it to extending ComparisonTarget with ContextuallyRelevantValue because ContextuallyRelevantValue is only really valid once you descend through Expression::ExistsInArray. Whereas hijacking ComparisonTarget::Column by introducing a virtual __value column means that the actual data structure is always valid, it's just the schema used in it (ie the column name) that might be invalid.

I'll update the RFC with the latest idea.

@daniel-chambers
Copy link
Collaborator Author

What we have lost, however, is the is_empty and contains functions over any array. We could keep Expression::ArrayComparison only keep ArrayComparison::Contains and ArrayComparison::IsEmpty to continue to support these two things.

I've updated the RFC with these ideas.

@daniel-chambers daniel-chambers force-pushed the daniel/rfc-nested-scalar-array-comparisons branch from 6d90ace to 36c3e5a Compare October 1, 2024 06:01
hallettj
hallettj previously approved these changes Oct 8, 2024
Copy link

@hallettj hallettj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Do we have an idea what GraphQL expressions that map to ArrayComparisons will look like? Like will contains and is_empty be reserved operator names that supersede scalar comparisons?

@daniel-chambers
Copy link
Collaborator Author

@hallettj Check out the v3-engine RFC for more information about graphql shape.

As for reserved names, how they are named in graphql is up to how you configure the naming in OpenDD. If the connector defines both and their names clash, you will need to rename one or the other.

@daniel-chambers daniel-chambers force-pushed the daniel/rfc-nested-scalar-array-comparisons branch from 9a5a95f to b4b06fa Compare October 16, 2024 06:21
@daniel-chambers daniel-chambers changed the base branch from main to daniel/extend-test-data October 16, 2024 06:22
@daniel-chambers daniel-chambers force-pushed the daniel/rfc-nested-scalar-array-comparisons branch from b4b06fa to 3eff113 Compare October 17, 2024 04:11
@daniel-chambers daniel-chambers changed the base branch from daniel/extend-test-data to daniel/add-arguments-to-places October 17, 2024 04:12
@daniel-chambers daniel-chambers self-assigned this Oct 17, 2024
@daniel-chambers daniel-chambers force-pushed the daniel/rfc-nested-scalar-array-comparisons branch from 78ee304 to 4787e85 Compare October 17, 2024 07:06

```rust,no_run,noplayground
{{#include ../../../../../ndc-reference/bin/reference/main.rs:eval_expression_exists}}
```

Note in particular, we push the current row onto the stack of `scopes` before executing the inner query, so that [references to columns in those scopes](../../../specification/queries/filtering.md#referencing-a-column-from-a-collection-in-scope) can be resolved correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use subheadings here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@@ -0,0 +1,280 @@
/* Globals */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I widened the width of content in the ndc-spec documentation website by 150px to help give a bit more space to accommodate the capabilities tables

paf31
paf31 previously approved these changes Oct 17, 2024
Copy link
Collaborator

@paf31 paf31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks.

Just so I understand the overlap (contains and is_empty can be expressed using exists-in-scalar-array), I assume this is meant to future-proof things for connectors which can implement one but not the other? Do we have such a connector in mind or is it speculative?

@daniel-chambers
Copy link
Collaborator Author

@paf31 ClickHouse was the connector I had in mind, because I remember @BenoitRanque saying that arbitrary existential quantification across nested arrays wasn't going to be possible.

Base automatically changed from daniel/add-arguments-to-places to main October 18, 2024 03:24
@daniel-chambers daniel-chambers dismissed paf31’s stale review October 18, 2024 03:24

The base branch was changed.

@daniel-chambers daniel-chambers force-pushed the daniel/rfc-nested-scalar-array-comparisons branch from ad09a0b to 685841d Compare October 18, 2024 03:27
@daniel-chambers daniel-chambers merged commit 96ddd9d into main Oct 18, 2024
11 checks passed
@daniel-chambers daniel-chambers deleted the daniel/rfc-nested-scalar-array-comparisons branch October 18, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants