Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Perform big data performance and regression tests #1862

Open
Yury-Fridlyand opened this issue Jul 12, 2023 · 0 comments
Open

[FEATURE] Perform big data performance and regression tests #1862

Yury-Fridlyand opened this issue Jul 12, 2023 · 0 comments
Labels
ci infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc.

Comments

@Yury-Fridlyand
Copy link
Collaborator

Is your feature request related to a problem?

Issue from the customer: https://opensearch.slack.com/archives/C0526AVT84S/p1689036508739229

For context, my_alias points to 800 indices, each index is sorted by hw_id and snapshot_day, each index has only one unique value of snapshot_day and each index has ~750 million records distributed on 10 shards of ~30GB each.

The cluster is in AWS, has 20 data nodes and 3 master nodes. All are r6g.12xlarge.search.

SQL query runs ~8 seconds:

select * from my_alias where hw_id = 'abcd' and (snapshot_day between cast('2023-04-27' as date) and cast('2023-05-03' as date) or snapshot_day between cast('2023-06-30' as date) and cast('2023-07-05' as date)) limit 10000

DSL equivalent runs ~4:

{
  "from": 0,
  "size": 10000,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "hw_id": "abcd"
          }
        },
        {
          "bool": {
            "should": [
              {
                "range": {
                  "snapshot_day": {
                    "gte": "2023-04-27",
                    "lte": "2023-05-03"
                  }
                }
              },
              {
                "range": {
                  "snapshot_day": {
                    "gte": "2023-06-30",
                    "lte": "2023-07-05"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

NOTE
Track down all slack discussion, the query was optimized and accelerated.

What solution would you like?

  • Get or generate a huge dataset
  • Allocate a cluster for tests
  • Make test framework (maybe reuse Jenkins)
  • Run the test and investigate this specific issue
  • Run more tests to find other bottlenecks
  • Re-run all tests on all OpenSearch releases to detect degradation
  • Update release workflow to repeat these tests before every code freeze

What alternatives have you considered?

  • Create Best Practices documentation section
  • Automatically optimize some functions and replace them by literals, e.g. DATE('...') to DATE '...', PI() to 3.1415 and so on to reduce or even completely avoid scripts in filters pushed down

Do you have any additional context?

Opened #1847

@Yury-Fridlyand Yury-Fridlyand added enhancement New feature or request untriaged infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. ci maintenance Improves code quality, but not the product and removed untriaged labels Jul 12, 2023
@acarbonetto acarbonetto removed enhancement New feature or request maintenance Improves code quality, but not the product labels Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc.
Projects
None yet
Development

No branches or pull requests

2 participants