Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Support udf through RPC #7519

Merged
merged 1 commit into from
Feb 8, 2022
Merged

[feature] Support udf through RPC #7519

merged 1 commit into from
Feb 8, 2022

Conversation

yangzhg
Copy link
Member

@yangzhg yangzhg commented Dec 29, 2021

Proposed changes

Support UDF is implemented through GRPC protocol. This brings several benefits:

  1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf
  2. UDF is decoupled from Doris, udf will not be doris coredump together, udf computing resources are separated from doris, and doris services are not affected

But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large.

Create function like

CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES (
  "SYMBOL"="add_int",
  "OBJECT_FILE"="127.0.0.1:9999",
  "TYPE"="RPC"
);

function service need to implements check_fn and fn_call methods
#7502 #7578

the rpc udf call is a little slower than the native function call in vectorized mode:

image
image
But it will much slow in non-vectorized mode, and if using rpc udf in vectorized mode the query time is equal to native function call in non-vectorized mode
image

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Code refactor (Modify the code structure, format the code, etc...)
  • Optimization. Including functional usability improvements and performance improvements.
  • Dependency. Such as changes related to third-party components.
  • Other.

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have created an issue on (Fix [Feature] Support UDF through RPC  #7520) and described the bug/feature there in detail
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • If these changes need document changes, I have updated the document
  • Any dependent changes have been merged

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@yangzhg yangzhg changed the title [feature] Support udf through RPC Dec 29, 2021
@morningman morningman self-assigned this Jan 5, 2022
@stalary
Copy link
Contributor

stalary commented Jan 5, 2022

Let me see

@morningman morningman assigned morningman and stalary and unassigned morningman Jan 5, 2022
@morningman morningman added area/udf Issues or PRs related to the UDF kind/feature Categorizes issue or PR as related to a new feature. labels Jan 5, 2022
@stalary
Copy link
Contributor

stalary commented Jan 8, 2022

Whether RPC addresses support load balance?

@stalary
Copy link
Contributor

stalary commented Jan 8, 2022

_brpc_stub_cache split _internal_client_cache& _function_client_cache Whether compatibility problems exist

_node_info.brpc_port)) {
ExecEnv::GetInstance()->brpc_stub_cache()->erase(_open_closure->cntl.remote_side());
}
ExecEnv::GetInstance()->brpc_internal_client_cache()->erase(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it no longer necessary to judge unavailabe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not cost so much when create a new stub

ptype->set_type(PParameterType::NULL_TYPE);
continue;
}
switch (_children[i]->type().type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether need to support TYPE_STRING?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I will add later


if (_client == nullptr) {
return Status::InternalError(
fmt::format("rpc env init error: {}/{}", _fn.hdfs_location, _rpc_function_symbol));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hdfs_location?

be/src/exprs/rpc_fn_call.cpp Outdated Show resolved Hide resolved
Status st = _eval_children(context, row, &response);
if (!st.ok() || response.status().status_code() != 0 ||
(response.result().has_null() && response.result().null_map(0))) {
res_val.is_null = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to return null if RPC failed?

PFunctionCallResponse response;
request.set_function_name(_symbol);
int64_t name_hash = 0;
murmur_hash3_x64_64(_symbol.c_str(), _symbol.size(), 21217891, &name_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does murmur_hash3_x64_64 work for ARM64?

@@ -63,3 +67,150 @@ message PUniqueId {
required int64 lo = 2;
};

message PGenericType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a same datatype define in data.proto, how about unify them?
I will do this in PR #7939

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but I think the datatype define in data.proto can not describe complex types very well, PGenericType + PList + PMap + PField + PStruct + PDecimal may be better. And BTW, datatype should not to complex, it should be used more easier between languages

gensrc/proto/types.proto Outdated Show resolved Hide resolved
gensrc/proto/types.proto Outdated Show resolved Hide resolved
@yangzhg yangzhg force-pushed the rpc_udf branch 6 times, most recently from 452dd4c to 8754929 Compare February 7, 2022 07:03
…ocol. This brings several benefits:

1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf
2. UDF is decoupled from Doris, udf will not be doris coredump together, udf computing resources are separated from doris, and doris services are not affected

But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large.

Create function like

```
CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES (
  "SYMBOL"="add_int",
  "OBJECT_FILE"="127.0.0.1:9999",
  "TYPE"="RPC"
);
```
function service need to implements `check_fn` and `fn_call` methods
Note:
THIS IS AN EXPERIMENTAL FEATRUE, THE INTERFACE AND DATA STRUCTURE MAY CHANGED IN FUTURE !!
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 7, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2022

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2022

PR approved by anyone and no changes requested.

@morningman
Copy link
Contributor

Document need to be added later

@yangzhg yangzhg merged commit f8d086d into apache:master Feb 8, 2022
@yangzhg yangzhg deleted the rpc_udf branch February 14, 2022 01:48
@yangzhg yangzhg mentioned this pull request Mar 3, 2022
@morningman morningman mentioned this pull request Mar 8, 2022
3 tasks
@wzymumon wzymumon mentioned this pull request May 31, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/udf Issues or PRs related to the UDF kind/feature Categorizes issue or PR as related to a new feature. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support UDF through RPC
3 participants