Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect scan metrics and feed them into the datafusion analyze framework #696

Closed
ShiKaiWi opened this issue Mar 3, 2023 · 1 comment · Fixed by #714
Closed

Collect scan metrics and feed them into the datafusion analyze framework #696

ShiKaiWi opened this issue Mar 3, 2023 · 1 comment · Fixed by #714
Assignees
Labels
feature New feature or request

Comments

@ShiKaiWi
Copy link
Member

ShiKaiWi commented Mar 3, 2023

Describe This Problem

By datafusion analyze utilities, we can use explain analyze [sql] to retrieve a description for the costs of every execution stage of a sql, e.g.:

"CoalescePartitionsExec, metrics=[output_rows=1, elapsed_compute=21.596µs, spill_count=0, spilled_bytes=0, mem_used=0]
  ProjectionExec: expr=[tsid@0 as tsid, opdate@1 as opdate, f_info_delistdate@2 as f_info_delistdate, f_info_corp_fundmanagementid@3 as f_info_corp_fundmanagementid, f_sales_service_rate@4 as f_sales_service_rate, f_info_custodianbankid@5 as f_info_custodianbankid, max_num_holder@6 as max_num_holder, f_investment_area@7 as f_investment_area, max_num_coltarget@8 as max_num_coltarget, investstrategy@9 as investstrategy, close_institu_oef_down@10 as close_institu_oef_down, risk_return@11 as risk_return, f_pchredm_pchminamt@12 as f_pchredm_pchminamt, close_institu_oef_up@13 as close_institu_oef_up, f_pchredm_pchminamt_ex@14 as f_pchredm_pchminamt_ex, f_info_investobject@15 as f_info_investobject, s_fellow_distor@16 as s_fellow_distor, f_info_investconception@17 as f_info_investconception, f_info_trustee@18 as f_info_trustee, f_info_firstinvesttype@19 as f_info_firstinvesttype, f_info_decision_basis@20 as f_info_decision_basis, f_personal_subtype@21 as f_personal_subtype, is_indexfund@22 as is_indexfund, f_pchredm_pchstartdate@23 as f_pchredm_pchstartdate, f_info_setupdate@24 as f_info_setupdate, f_info_type@25 as f_info_type, f_info_redmstartdate@26 as f_info_redmstartdate, f_info_maturitydate@27 as f_info_maturitydate, close_institu_subtype@28 as close_institu_subtype, f_info_isinitial@29 as f_info_isinitial, f_info_minbuyamount@30 as f_info_minbuyamount, f_info_pinyin@31 as f_info_pinyin, f_issue_totalunit@32 as f_issue_totalunit, f_info_investscope@33 as f_info_investscope, f_issue_oef_startdateinst@34 as f_issue_oef_startdateinst, f_info_managementfeeratio@35 as f_info_managementfeeratio, f_info_registrant@36 as f_info_registrant, opmode@37 as opmode, f_issue_oef_dnddateinst@38 as f_issue_oef_dnddateinst, f_personal_startdateind@39 as f_personal_startdateind, f_info_custodianfeeratio@40 as f_info_custodianfeeratio, f_personal_enddateind@41 as f_personal_enddateind, f_info_parvalue@42 as f_info_parvalue, f_info_fund_id@43 as f_info_fund_id, crny_code@44 as crny_code, f_info_trusttype@45 as f_info_trusttype, f_info_listdate@46 as f_info_listdate, f_info_restrictedornot@47 as f_info_restrictedornot, f_info_ptmyear@48 as f_info_ptmyear, f_info_anndate@49 as f_info_anndate, f_info_expectedrateofreturn@50 as f_info_expectedrateofreturn, f_info_structuredornot@51 as f_info_structuredornot, f_closed_operation_period@52 as f_closed_operation_period, f_info_fullname@53 as f_info_fullname, f_info_issuingplace@54 as f_info_issuingplace, f_closed_operation_interval@55 as f_closed_operation_interval, f_info_exchmarket@56 as f_info_exchmarket, f_info_name@57 as f_info_name, object_id@58 as object_id, f_info_benchmark@59 as f_info_benchmark, f_info_firstinveststyle@60 as f_info_firstinveststyle, f_info_windcode@61 as f_info_windcode, f_info_corp_fundmanagementcomp@62 as f_info_corp_fundmanagementcomp, f_info_status@63 as f_info_status, f_info_front_code@64 as f_info_front_code, f_info_issuedate@65 as f_info_issuedate, f_info_custodianbank@66 as f_info_custodianbank, f_info_backend_code@67 as f_info_backend_code, is_dfd_del@68 as is_dfd_del], metrics=[output_rows=1, elapsed_compute=10.073µs, spill_count=0, spilled_bytes=0, mem_used=0]
    CoalesceBatchesExec: target_batch_size=4096, metrics=[output_rows=1, elapsed_compute=139.837µs, spill_count=0, spilled_bytes=0, mem_used=0]
      FilterExec: is_dfd_del@68 IS NULL OR is_dfd_del@68 = 0 AND f_info_windcode@61 = 002298.OF AND opdate@1 >= 0 AND opdate@1 <= 1677665081864, metrics=[output_rows=1, elapsed_compute=70.81µs, spill_count=0, spilled_bytes=0, mem_used=0]
        ScanTable: table=vanguardwind_chinamutualfunddescription_v1, parallelism=8, order=None, , metrics=[]

However, the metrics of ScanTable which is provided by CeresDB is empty, and the most of the performance issues happen in this stage, so without the metrics, it is hard to troubleshoot performance problems.

Proposal

Collect the metrics of ScanTable stage, and feed them into the datafusion analyze framework.

Additional Context

No response

@ShiKaiWi ShiKaiWi added the feature New feature or request label Mar 3, 2023
@ShiKaiWi ShiKaiWi self-assigned this Mar 3, 2023
@jiacai2050
Copy link
Contributor

Some metrics I think of

  • Number of SST to read
  • Read cost of each SST
  • Pushdowned filters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants