[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

qianheng-aws · 2024-07-29T08:54:40Z

Is your feature request related to a problem?
Currently in a logical plan of OpenSearch, the LogicalEval is a blocker for many push-down optimization rules like TableScanPushDown.PUSH_DOWN_PROJECT, TableScanPushDown.PUSH_DOWN_SORT, TableScanPushDown.PUSH_DOWN_LIMIT and etc. Not only does it reduce the performance of execution, it also lead to wrong results if we cannot do pushdown optimization.

E.g. Below 2 PPL are semantic equal, but the latter one with eval retrieves all fields from OpenSearch(which is unnecessary), while not enough documents due to the limitation of settings of plugins.query.size_limit.

# no eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":200,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

--------------------------------------------------------

# with eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | eval FlightMin = FlightTimeMin |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 10,
          "offset": 0
        },
        "children": [
          {
            "name": "EvalOperator",
            "description": {
              "expressions": {
                "FlightMin": "FlightTimeMin"
              }
            },
            "children": [
              {
                "name": "OpenSearchIndexScan",
                "description": {
                  "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":300,"timeout":"1m"}, searchDone=false)"""
                },
                "children": []
              }
            ]
          }
        ]
      }
    ]
  }
}

What solution would you like?
Since eval operator only adds or updates fields without any benefit on skipping or project data, and it could block other optimization, we should push down sort/limit/project operator through eval to evaluate eval's output as late as possible.

1. Limit

For limit, we could push down it under eval directly since it doesn't have any expressions.

2. Sort

For sort, for each sort field, we should analyze whether it's produced by eval.

2.1 All sort fields are not produced by eval, which means they can all be resolved by the Environment before eval, so we can push down it under eval directly

2.2 One or more fields are produced by eval, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

2.2.1 the field is produced by equation to ReferenceExpression which is just simple rename like eval(sortField=originField), we can also push down sort by do some field replacement for these fields. Of course, there is some more complex cases like eval(newField=originField, sortField=newField), we should do replacement iteratively from right to left, or we just narrow down this case to not let it push down.
2.2.2 the field is produced by function, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

3. Project

When we say pushing down a project under eval, it actually inserts another project before eval instead of pushing down the origin project itself.

We should consruct the new projectList by append all eval reference expressions to the original projectList, and kick off all reference expressions produced by eval. For example, if we have plan project(a, b, d) - eval(b=c+2, a=b+1), we should optimize it to project(a, b, c) - eval(a=b+1, b=c+2)) - project(c, d). Thus, we can do project in advance or push down project(c, d) into TableScan if possible, and ensure they are semantic equal.

Sub-Tasks

What alternatives have you considered?

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

LantaoJin · 2024-08-02T06:11:37Z

Not all scenarios for operators sort/limit/project in a query are able to push down through eval. Could you explain more details of your design in this description? @qianheng-aws

qianheng-aws added enhancement New feature or request untriaged labels Jul 29, 2024

qianheng-aws changed the title ~~[FEATURE] Pull Up LogicalEval to evaluate as late as possible~~ [FEATURE] Push Down Sort/Limit/Project through LogicalEval Jul 29, 2024

qianheng-aws mentioned this issue Jul 31, 2024

Push down limit through eval #2876

Merged

7 tasks

LantaoJin changed the title ~~[FEATURE] Push Down Sort/Limit/Project through LogicalEval~~ [RFC] Push Down Sort/Limit/Project through LogicalEval Aug 2, 2024

LantaoJin removed the untriaged label Aug 2, 2024

qianheng-aws mentioned this issue Aug 15, 2024

Push down sort through eval #2937

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

qianheng-aws commented Jul 29, 2024 •

edited by LantaoJin

Loading

LantaoJin commented Aug 2, 2024

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

Comments

qianheng-aws commented Jul 29, 2024 • edited by LantaoJin Loading

1. Limit

2. Sort

3. Project

LantaoJin commented Aug 2, 2024

qianheng-aws commented Jul 29, 2024 •

edited by LantaoJin

Loading