GH-33212: [C++][Python] Add use_threads to pyarrow.substrait.run_query #33623

westonpace · 2023-01-12T01:19:44Z

Also adds memory_pool and & function_registry to the various DeclarationToXyz methods. Converts ExecuteSerializedPlan to DeclarationToReader instead of the bespoke thing it was doing before.

Closes: [Python][C++] Add use_threads to run_substrait_query #33212

… & function_registry to the various DeclarationToXyz methods

github-actions · 2023-01-12T01:20:05Z

Closes: [Python][C++] Add use_threads to run_substrait_query #33212

westonpace · 2023-01-12T01:20:08Z

CC @vibhatha mind taking a look?

vibhatha · 2023-01-12T02:39:14Z

CC @vibhatha mind taking a look?

Sure @westonpace I will take look now.

vibhatha · 2023-01-12T02:40:50Z

cpp/src/arrow/compute/exec/exec_plan.cc

  if (use_threads) {
-    return DeclarationToTableAsync(std::move(declaration), *threaded_exec_context());
+    ExecContext ctx(memory_pool, ::arrow::internal::GetCpuThreadPool(),


Just curious, is this always going to be using the CPU thread pool?

Yes. If the user is passing in bool use_threads then they are choosing between "current thread" and "CPU thread pool". This keeps the simple case simple for most users.

If the user doesn't want the CPU thread pool and they don't want to do everything on the calling thread then they can use the overload that takes a custom ExecContext.

cpp/src/arrow/compute/exec/exec_plan.h

cpp/src/arrow/engine/substrait/serde.cc

cpp/src/arrow/engine/substrait/serde.h

vibhatha · 2023-01-12T03:05:32Z

cpp/src/arrow/engine/substrait/util.cc

-namespace {
-
-/// \brief A SinkNodeConsumer specialized to output ExecBatches via PushGenerator
-class SubstraitSinkConsumer : public compute::SinkNodeConsumer {


I remember, I am doing a similar cleanup (based on your suggestion) in a particular PR (surely not merged yet), but I cannot remember where, but it is better we get this cleaned up and actually this was one of your suggestions as I can remember.

Yes, that sounds likely. If you can remember the PR I can take another look.

But probably it is better to merge on your work, it is in the simplest form.

cpp/src/arrow/engine/substrait/util.h

vibhatha · 2023-01-12T03:09:11Z

cpp/src/arrow/engine/substrait/util.h

 ARROW_ENGINE_EXPORT Result<std::shared_ptr<RecordBatchReader>> ExecuteSerializedPlan(
    const Buffer& substrait_buffer, const ExtensionIdRegistry* registry = NULLPTR,
    compute::FunctionRegistry* func_registry = NULLPTR,
-    const ConversionOptions& conversion_options = {});
+    const ConversionOptions& conversion_options = {}, bool use_threads = true,


Are we going to use threads by default? Is that okay? I mean depending on various use cases which would be already running things in parallel? Just curious about the usage.

The default is generally to maximize performance at whatever expense to CPU & RAM. I think this is ok. Users usually want things to run as quickly as possible.

vibhatha · 2023-01-12T03:11:55Z

python/pyarrow/_substrait.pyx

@@ -39,7 +39,7 @@ cdef CDeclaration _create_named_table_provider(dict named_args, const std_vector
        c_name = names[i]
        py_names.append(frombytes(c_name))

-    py_table = named_args["provider"](py_names)
+        py_table = named_args["provider"](py_names)


Not sure why this is indented? Why should it run within the loop?

Oops, good catch! I was experimenting with this function for something unrelated and didn't notice the change because I was hiding whitespace changes when doing self-review.

Thanks, I have reverted this.

python/pyarrow/tests/test_substrait.py

vibhatha

@westonpace thank you for working on this. I added a few suggestions. Except for those minor changes, I think the current changes are good to go.

Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@users.noreply.github.com>

js8544 · 2023-01-13T05:15:40Z

https://github.com/apache/arrow/actions/runs/3907791966/jobs/6677356655
@westonpace, I think this CI failure is caused by this pr. Could you please have a look?

vibhatha · 2023-01-13T06:06:00Z

/arrow/cpp/src/arrow/compute/exec/exec_plan.h:427: error: The following parameter of arrow::compute::DeclarationToTable(Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR) is not documented:
  parameter 'declaration' (warning treated as error, aborting now)

I think this is the issue.

I have created an issue here: #33649

I will work on this.

cc @westonpace @js8544

vibhatha · 2023-01-13T06:55:25Z

I created the PR here: #33651

Let's wait for the CIs.

ursabot · 2023-01-14T13:44:07Z

Benchmark runs are scheduled for baseline = 97998d8 and contender = e1027dc. e1027dc is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.99% ⬆️0.15%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.5% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e1027dc7 ec2-t3-xlarge-us-east-2
[Finished] e1027dc7 test-mac-arm
[Finished] e1027dc7 ursa-i9-9960x
[Finished] e1027dc7 ursa-thinkcentre-m75q
[Finished] 97998d83 ec2-t3-xlarge-us-east-2
[Finished] 97998d83 test-mac-arm
[Finished] 97998d83 ursa-i9-9960x
[Finished] 97998d83 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Add use_threads to pyarrow.substrait.run_query. Also adds memory_pool…

d5dd8ed

… & function_registry to the various DeclarationToXyz methods

github-actions bot added Component: C++ Component: Python labels Jan 12, 2023

Lint

8ff08e0