Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IGNORE_ERRORS when scanning from pyarrow/pandas #4646

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

royi-luo
Copy link
Collaborator

@royi-luo royi-luo commented Dec 17, 2024

Description

Adds support for the default IGNORE_ERRORS behaviour (skipping rows that trigger PK-related exceptions) when scanning from pandas/pyarrow.

Fixes #4534

Contributor agreement

@royi-luo royi-luo self-assigned this Dec 17, 2024
@royi-luo royi-luo force-pushed the royi/ignore-errors-df-polars branch from 3fdb113 to 2f21964 Compare December 17, 2024 21:35
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.51%. Comparing base (970f5c4) to head (46d076d).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/function/table/bind_data.cpp 60.00% 2 Missing ⚠️
...rc/include/function/table/simple_table_functions.h 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4646   +/-   ##
=======================================
  Coverage   86.50%   86.51%           
=======================================
  Files        1369     1372    +3     
  Lines       57955    57999   +44     
  Branches     7203     7209    +6     
=======================================
+ Hits        50136    50175   +39     
- Misses       7652     7657    +5     
  Partials      167      167           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@royi-luo royi-luo force-pushed the royi/ignore-errors-df-polars branch from 615623b to 074809c Compare December 18, 2024 17:55
@royi-luo royi-luo force-pushed the royi/ignore-errors-df-polars branch from 074809c to dbad6a6 Compare December 18, 2024 18:17

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@royi-luo royi-luo force-pushed the royi/ignore-errors-df-polars branch 2 times, most recently from 65e95d3 to 2f0172f Compare December 19, 2024 02:49
@royi-luo royi-luo force-pushed the royi/ignore-errors-df-polars branch from 1ee740f to ed44e03 Compare December 19, 2024 03:26
@royi-luo royi-luo marked this pull request as ready for review December 19, 2024 03:53
@royi-luo royi-luo requested a review from ray6080 December 19, 2024 03:53
Copy link

Benchmark Result

Master commit hash: 970f5c43e58661c004038fd0a07e18f5fc3e1dcb
Branch commit hash: 5577693b03752ff14708163f93d5c8269ef0597f

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 648.40 656.85 -8.45 (-1.29%)
aggregation q28 11835.33 11744.63 90.70 (0.77%)
filter q14 125.15 125.64 -0.49 (-0.39%)
filter q15 126.86 129.02 -2.16 (-1.67%)
filter q16 310.27 302.46 7.81 (2.58%)
filter q17 455.60 447.10 8.51 (1.90%)
filter q18 1922.69 1963.15 -40.46 (-2.06%)
filter zonemap-node 87.02 87.08 -0.06 (-0.07%)
filter zonemap-node-lhs-cast 89.01 87.37 1.64 (1.87%)
filter zonemap-rel 5792.17 5464.53 327.64 (6.00%)
fixed_size_expr_evaluator q07 578.19 584.23 -6.04 (-1.03%)
fixed_size_expr_evaluator q08 808.46 813.72 -5.26 (-0.65%)
fixed_size_expr_evaluator q09 810.85 817.39 -6.54 (-0.80%)
fixed_size_expr_evaluator q10 244.45 240.40 4.06 (1.69%)
fixed_size_expr_evaluator q11 236.03 233.31 2.71 (1.16%)
fixed_size_expr_evaluator q12 234.70 231.42 3.28 (1.42%)
fixed_size_expr_evaluator q13 1454.99 1484.51 -29.52 (-1.99%)
fixed_size_seq_scan q23 115.55 112.83 2.72 (2.41%)
join q29 597.91 613.93 -16.02 (-2.61%)
join q30 1513.31 1545.78 -32.47 (-2.10%)
join q31 5.82 6.39 -0.57 (-8.93%)
join SelectiveTwoHopJoin 56.53 47.77 8.75 (18.32%)
ldbc_snb_ic q35 2722.90 2603.36 119.54 (4.59%)
ldbc_snb_ic q36 511.24 558.12 -46.88 (-8.40%)
ldbc_snb_is q32 4.29 5.18 -0.89 (-17.17%)
ldbc_snb_is q33 13.11 12.15 0.96 (7.93%)
ldbc_snb_is q34 1.10 1.08 0.02 (2.14%)
multi-rel multi-rel-large-scan 1214.24 1216.57 -2.34 (-0.19%)
multi-rel multi-rel-lookup 33.88 30.38 3.50 (11.51%)
multi-rel multi-rel-small-scan 68.09 74.65 -6.56 (-8.78%)
order_by q25 134.78 134.91 -0.13 (-0.10%)
order_by q26 462.13 456.55 5.59 (1.22%)
order_by q27 1485.00 1478.33 6.68 (0.45%)
recursive_join recursive-join-bidirection 293.34 260.87 32.47 (12.45%)
recursive_join recursive-join-dense 7353.50 5001.85 2351.65 (47.02%)
recursive_join recursive-join-path 23847.94 23387.10 460.85 (1.97%)
recursive_join recursive-join-sparse 14457.29 13850.58 606.71 (4.38%)
recursive_join recursive-join-trail 7321.76 5044.11 2277.66 (45.15%)
scan_after_filter q01 170.70 176.81 -6.11 (-3.45%)
scan_after_filter q02 156.02 161.55 -5.53 (-3.42%)
shortest_path_ldbc100 q37 89.53 95.79 -6.26 (-6.54%)
shortest_path_ldbc100 q38 368.59 355.75 12.85 (3.61%)
shortest_path_ldbc100 q39 64.32 57.57 6.74 (11.71%)
shortest_path_ldbc100 q40 437.41 433.75 3.66 (0.84%)
var_size_expr_evaluator q03 2087.17 2142.09 -54.92 (-2.56%)
var_size_expr_evaluator q04 2225.50 2304.84 -79.34 (-3.44%)
var_size_expr_evaluator q05 2628.59 2634.66 -6.07 (-0.23%)
var_size_expr_evaluator q06 1363.39 1353.87 9.52 (0.70%)
var_size_seq_scan q19 1465.67 1488.09 -22.41 (-1.51%)
var_size_seq_scan q20 2558.03 2420.47 137.56 (5.68%)
var_size_seq_scan q21 2308.75 2300.22 8.53 (0.37%)
var_size_seq_scan q22 127.94 128.75 -0.80 (-0.62%)

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several KUZU_API changes which I don't follow why they're now needed in this PR?

CMakeLists.txt Outdated
@@ -317,7 +317,7 @@ add_subdirectory(third_party)
if(${BUILD_KUZU})
add_definitions(-DKUZU_ROOT_DIRECTORY="${PROJECT_SOURCE_DIR}")
add_definitions(-DKUZU_CMAKE_VERSION="${CMAKE_PROJECT_VERSION}")
add_definitions(-DKUZU_EXTENSION_VERSION="0.7.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm Do we need to upgrade the extension version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a virtual function to TableFuncBindData which is inherited by some classes in the extensions (e.g. JSONScanBindData) so I probably need to bump the extension version.

That single change (me needing to export ScanBindData/TableFuncBindData) caused a bunch of dynamic casts in the macos extension test to fail due to the casted targets not being exported which is why I added all those KUZU_API` changes. I'm not completely sure why they showed up now and not before though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. We don't need to bump extension version manually in PRs now. The CI pipeline will handle this.

Copy link
Contributor

@ray6080 ray6080 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit strange, as we have virtual function in TableFuncBindData previously. I wonder why we don't need to mark KUZU_API previously, but now. @acquamarin Can you also take a look at the KUZU_API changes in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably because the only virtual method before is copy which is pure virtual (so the definitions are all in the child classes). The method I added has a default implementation so it actually needs to exported for child classes in the extensions to access.

src/binder/bound_scan_source.cpp Show resolved Hide resolved
@@ -17,6 +18,7 @@ namespace kuzu {
PyArrowScanConfig::PyArrowScanConfig(const common::case_insensitive_map_t<Value>& options) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why PyArrowScan and PandasScan have different ways to convert scanConfig. I need to take another look to see if we can unify them. but can you also take a look?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason is that we don't support the pyarrow scan options (skipNum and limitNum) in pandas scan so before this PR we don't actually support any configuration for pandas scan. I could look into if it's doable to support those options in pandas scan and if it is I'll add the support and unify the config in a separate PR.

Copy link

Benchmark Result

Master commit hash: bfe46c071c48fcf8fcdaa911238929bf617c53d1
Branch commit hash: 41af2155e39eac31dcf0619f368e3843afca570b

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 649.95 642.92 7.03 (1.09%)
aggregation q28 11348.79 11143.70 205.09 (1.84%)
filter q14 125.65 125.51 0.14 (0.11%)
filter q15 128.80 131.77 -2.97 (-2.25%)
filter q16 302.14 300.58 1.56 (0.52%)
filter q17 447.86 444.57 3.29 (0.74%)
filter q18 1950.47 1936.40 14.07 (0.73%)
filter zonemap-node 87.58 N/A N/A
filter zonemap-node-lhs-cast 86.64 N/A N/A
filter zonemap-rel 5948.40 N/A N/A
fixed_size_expr_evaluator q07 573.05 547.12 25.93 (4.74%)
fixed_size_expr_evaluator q08 818.57 761.17 57.40 (7.54%)
fixed_size_expr_evaluator q09 813.74 761.27 52.46 (6.89%)
fixed_size_expr_evaluator q10 241.72 241.04 0.68 (0.28%)
fixed_size_expr_evaluator q11 234.97 235.99 -1.02 (-0.43%)
fixed_size_expr_evaluator q12 225.94 234.97 -9.03 (-3.84%)
fixed_size_expr_evaluator q13 1480.76 1468.12 12.64 (0.86%)
fixed_size_seq_scan q23 117.44 121.73 -4.29 (-3.52%)
join q29 633.61 652.48 -18.87 (-2.89%)
join q30 1554.59 1504.33 50.26 (3.34%)
join q31 5.68 8.07 -2.39 (-29.65%)
join SelectiveTwoHopJoin 53.00 N/A N/A
ldbc_snb_ic q35 2669.22 388.13 2281.10 (587.72%)
ldbc_snb_ic q36 533.62 35.88 497.74 (1387.05%)
ldbc_snb_is q32 5.12 7.02 -1.89 (-26.99%)
ldbc_snb_is q33 13.28 16.35 -3.07 (-18.76%)
ldbc_snb_is q34 1.03 4.13 -3.10 (-75.08%)
multi-rel multi-rel-large-scan 1390.97 1737.67 -346.70 (-19.95%)
multi-rel multi-rel-lookup 29.42 61.18 -31.76 (-51.91%)
multi-rel multi-rel-small-scan 85.39 67.45 17.94 (26.59%)
order_by q25 135.65 135.63 0.02 (0.01%)
order_by q26 460.31 459.01 1.30 (0.28%)
order_by q27 1468.06 1467.03 1.03 (0.07%)
recursive_join recursive-join-bidirection 305.03 N/A N/A
recursive_join recursive-join-dense 7409.49 N/A N/A
recursive_join recursive-join-path 24095.37 N/A N/A
recursive_join recursive-join-sparse 14151.22 N/A N/A
recursive_join recursive-join-trail 7389.07 N/A N/A
scan_after_filter q01 172.80 170.69 2.12 (1.24%)
scan_after_filter q02 158.14 159.40 -1.27 (-0.79%)
shortest_path_ldbc100 q37 83.80 3336.32 -3252.52 (-97.49%)
shortest_path_ldbc100 q38 365.89 68.44 297.45 (434.64%)
shortest_path_ldbc100 q39 63.58 85.71 -22.13 (-25.82%)
shortest_path_ldbc100 q40 425.11 74.58 350.53 (469.99%)
var_size_expr_evaluator q03 2119.21 2057.00 62.21 (3.02%)
var_size_expr_evaluator q04 2251.00 2241.83 9.16 (0.41%)
var_size_expr_evaluator q05 2713.13 2625.31 87.82 (3.35%)
var_size_expr_evaluator q06 1351.10 1346.92 4.17 (0.31%)
var_size_seq_scan q19 1478.50 1468.72 9.78 (0.67%)
var_size_seq_scan q20 2857.87 2766.27 91.60 (3.31%)
var_size_seq_scan q21 2396.68 2263.77 132.91 (5.87%)
var_size_seq_scan q22 127.48 128.98 -1.49 (-1.16%)

Copy link

Benchmark Result

Master commit hash: ea98fb1d3ebd69858317b682e02046974116f56c
Branch commit hash: 7b44801175c67bf6756898e944cb6d674feed6a8

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 669.36 655.83 13.52 (2.06%)
aggregation q28 11892.68 11940.05 -47.37 (-0.40%)
filter q14 125.54 128.75 -3.22 (-2.50%)
filter q15 133.39 129.98 3.40 (2.62%)
filter q16 316.15 304.38 11.77 (3.87%)
filter q17 448.53 454.03 -5.50 (-1.21%)
filter q18 1906.39 1962.50 -56.12 (-2.86%)
filter zonemap-node 88.72 90.91 -2.19 (-2.41%)
filter zonemap-node-lhs-cast 88.63 89.18 -0.56 (-0.62%)
filter zonemap-node-null 84.89 85.24 -0.35 (-0.41%)
filter zonemap-rel 5675.46 5804.58 -129.12 (-2.22%)
fixed_size_expr_evaluator q07 570.46 590.29 -19.84 (-3.36%)
fixed_size_expr_evaluator q08 801.09 822.09 -21.00 (-2.55%)
fixed_size_expr_evaluator q09 803.61 807.54 -3.92 (-0.49%)
fixed_size_expr_evaluator q10 235.89 245.48 -9.59 (-3.91%)
fixed_size_expr_evaluator q11 229.14 236.87 -7.73 (-3.26%)
fixed_size_expr_evaluator q12 225.73 238.14 -12.41 (-5.21%)
fixed_size_expr_evaluator q13 1477.69 1468.58 9.10 (0.62%)
fixed_size_seq_scan q23 117.29 120.64 -3.35 (-2.78%)
join q29 599.38 638.23 -38.85 (-6.09%)
join q30 1553.34 1602.93 -49.60 (-3.09%)
join q31 4.51 3.53 0.98 (27.63%)
join SelectiveTwoHopJoin 42.37 52.94 -10.58 (-19.98%)
ldbc_snb_ic q35 2591.06 2637.72 -46.66 (-1.77%)
ldbc_snb_ic q36 573.38 531.72 41.65 (7.83%)
ldbc_snb_is q32 6.60 6.04 0.57 (9.40%)
ldbc_snb_is q33 13.51 15.45 -1.94 (-12.54%)
ldbc_snb_is q34 1.10 1.04 0.06 (5.73%)
multi-rel multi-rel-large-scan 1223.99 1200.95 23.04 (1.92%)
multi-rel multi-rel-lookup 20.64 17.02 3.63 (21.31%)
multi-rel multi-rel-small-scan 93.98 103.27 -9.29 (-9.00%)
order_by q25 137.73 136.17 1.56 (1.14%)
order_by q26 447.92 456.77 -8.85 (-1.94%)
order_by q27 1463.35 1489.28 -25.93 (-1.74%)
recursive_join recursive-join-bidirection 287.94 297.74 -9.80 (-3.29%)
recursive_join recursive-join-dense 7410.09 7439.58 -29.49 (-0.40%)
recursive_join recursive-join-path 23988.30 24135.27 -146.97 (-0.61%)
recursive_join recursive-join-sparse 14384.62 14878.48 -493.86 (-3.32%)
recursive_join recursive-join-trail 7365.13 7387.09 -21.96 (-0.30%)
scan_after_filter q01 173.59 171.14 2.45 (1.43%)
scan_after_filter q02 160.20 158.43 1.78 (1.12%)
shortest_path_ldbc100 q37 84.83 98.94 -14.11 (-14.26%)
shortest_path_ldbc100 q38 368.38 348.53 19.85 (5.70%)
shortest_path_ldbc100 q39 62.94 65.64 -2.70 (-4.11%)
shortest_path_ldbc100 q40 430.61 446.69 -16.08 (-3.60%)
var_size_expr_evaluator q03 2067.73 2095.21 -27.48 (-1.31%)
var_size_expr_evaluator q04 2202.20 2289.30 -87.10 (-3.80%)
var_size_expr_evaluator q05 2688.44 2653.28 35.16 (1.33%)
var_size_expr_evaluator q06 1339.62 1361.55 -21.93 (-1.61%)
var_size_seq_scan q19 1450.09 1480.01 -29.92 (-2.02%)
var_size_seq_scan q20 2673.31 2780.40 -107.10 (-3.85%)
var_size_seq_scan q21 2315.31 2316.18 -0.87 (-0.04%)
var_size_seq_scan q22 129.13 129.22 -0.09 (-0.07%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Support IGNORE_ERRORS when scanning from in-memory sources
2 participants