Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33212: [C++][Python] Add use_threads to pyarrow.substrait.run_query #33623

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Jan 12, 2023

Also adds memory_pool and & function_registry to the various DeclarationToXyz methods. Converts ExecuteSerializedPlan to DeclarationToReader instead of the bespoke thing it was doing before.

… & function_registry to the various DeclarationToXyz methods
@github-actions
Copy link

@westonpace
Copy link
Member Author

CC @vibhatha mind taking a look?

@vibhatha
Copy link
Collaborator

CC @vibhatha mind taking a look?

Sure @westonpace I will take look now.

if (use_threads) {
return DeclarationToTableAsync(std::move(declaration), *threaded_exec_context());
ExecContext ctx(memory_pool, ::arrow::internal::GetCpuThreadPool(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is this always going to be using the CPU thread pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the user is passing in bool use_threads then they are choosing between "current thread" and "CPU thread pool". This keeps the simple case simple for most users.

If the user doesn't want the CPU thread pool and they don't want to do everything on the calling thread then they can use the overload that takes a custom ExecContext.

Comment on lines -43 to -46
namespace {

/// \brief A SinkNodeConsumer specialized to output ExecBatches via PushGenerator
class SubstraitSinkConsumer : public compute::SinkNodeConsumer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember, I am doing a similar cleanup (based on your suggestion) in a particular PR (surely not merged yet), but I cannot remember where, but it is better we get this cleaned up and actually this was one of your suggestions as I can remember.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds likely. If you can remember the PR I can take another look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But probably it is better to merge on your work, it is in the simplest form.

ARROW_ENGINE_EXPORT Result<std::shared_ptr<RecordBatchReader>> ExecuteSerializedPlan(
const Buffer& substrait_buffer, const ExtensionIdRegistry* registry = NULLPTR,
compute::FunctionRegistry* func_registry = NULLPTR,
const ConversionOptions& conversion_options = {});
const ConversionOptions& conversion_options = {}, bool use_threads = true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to use threads by default? Is that okay? I mean depending on various use cases which would be already running things in parallel? Just curious about the usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is generally to maximize performance at whatever expense to CPU & RAM. I think this is ok. Users usually want things to run as quickly as possible.

@@ -39,7 +39,7 @@ cdef CDeclaration _create_named_table_provider(dict named_args, const std_vector
c_name = names[i]
py_names.append(frombytes(c_name))

py_table = named_args["provider"](py_names)
py_table = named_args["provider"](py_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is indented? Why should it run within the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch! I was experimenting with this function for something unrelated and didn't notice the change because I was hiding whitespace changes when doing self-review.

Thanks, I have reverted this.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace thank you for working on this. I added a few suggestions. Except for those minor changes, I think the current changes are good to go.

westonpace and others added 2 commits January 12, 2023 14:24
Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@users.noreply.github.com>
@westonpace westonpace merged commit e1027dc into apache:master Jan 13, 2023
@js8544
Copy link
Collaborator

js8544 commented Jan 13, 2023

https://github.com/apache/arrow/actions/runs/3907791966/jobs/6677356655
@westonpace, I think this CI failure is caused by this pr. Could you please have a look?

@vibhatha
Copy link
Collaborator

/arrow/cpp/src/arrow/compute/exec/exec_plan.h:427: error: The following parameter of arrow::compute::DeclarationToTable(Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR) is not documented:
  parameter 'declaration' (warning treated as error, aborting now)

I think this is the issue.

I have created an issue here: #33649

I will work on this.

cc @westonpace @js8544

@vibhatha
Copy link
Collaborator

I created the PR here: #33651

Let's wait for the CIs.

@ursabot
Copy link

ursabot commented Jan 14, 2023

Benchmark runs are scheduled for baseline = 97998d8 and contender = e1027dc. e1027dc is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.99% ⬆️0.15%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.5% ⬆️0.09%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] e1027dc7 ec2-t3-xlarge-us-east-2
[Finished] e1027dc7 test-mac-arm
[Finished] e1027dc7 ursa-i9-9960x
[Finished] e1027dc7 ursa-thinkcentre-m75q
[Finished] 97998d83 ec2-t3-xlarge-us-east-2
[Finished] 97998d83 test-mac-arm
[Finished] 97998d83 ursa-i9-9960x
[Finished] 97998d83 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][C++] Add use_threads to run_substrait_query
4 participants