Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(query): improve project set #16326

Merged
merged 7 commits into from
Aug 27, 2024

Conversation

Dousir9
Copy link
Member

@Dousir9 Dousir9 commented Aug 25, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Improve performance of flatten, json_each, json_array_elements, jq.

Performance Test (Databend Cloud XSMALL)

select sum(id), sum(LENGTH(value)) from t, lateral flatten(input => c);

Query Duration: 7.6s → 3.1s (245%)
Snowflake XSMALL: 2.6s

Full Test Script

Prepare json data

import json
import csv
import os

def create_json_data(num_entries):
    data = []
    for i in range(num_entries):
        entry = {
            "id": i + 1,
            "name": f"name_{i + 1}",
            "age": 20 + (i % 10),
            "email": f"user{i + 1}@example.com",
            "address": {
                "street": f"street_{i + 1}",
                "city": f"city_{i % 5}",
                "postal_code": f"{10000 + i}"
            }
        }
        data.append(entry)
    return data

def save_as_variant_csv(data, filename):
    dir_path = os.path.dirname(__file__)
    with open(f'{dir_path}/{filename}', "w", newline="") as csv_file:
        writer = csv.writer(csv_file)
        for index, entry in enumerate(data):
            variant_format = json.dumps(entry)
            writer.writerow([index, variant_format])

if __name__ == "__main__":
    num_entries = 1000000
    filename = "dataset.csv"

    json_data = create_json_data(num_entries)
    save_as_variant_csv(json_data, filename)

Create table

create or replace table t(id int, c variant);
COPY INTO t FROM 'fs:////Users/xujinkai/Desktop/MyTest/databend/json' files = ('dataset.csv')  file_format = (type = CSV);

Test

select sum(id), sum(LENGTH(value)) from t, lateral flatten(input => c);

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Covered by existing tests

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-chore this PR only has small changes that no need to record, like coding styles. label Aug 25, 2024
@Dousir9 Dousir9 added the ci-cloud Build docker image for cloud test label Aug 26, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16326-88125a4-1724684469

note: this image tag is only available for internal use,
please check the internal doc for more details.

@Dousir9 Dousir9 marked this pull request as ready for review August 26, 2024 15:51
@Dousir9 Dousir9 added this pull request to the merge queue Aug 27, 2024
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Aug 27, 2024
@BohuTANG BohuTANG merged commit a055124 into databendlabs:main Aug 27, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-chore this PR only has small changes that no need to record, like coding styles.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants