Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-8162] Encode keys as NESTED for flink keyselector #9464

Merged
merged 1 commit into from
Sep 5, 2019

Conversation

sunjincheng121
Copy link
Member

This is a hotfix, the changes is encode keys as OUTER for flink key supplier.
This PR blocked the PR: #9296

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@dmvk
Copy link
Member

dmvk commented Sep 3, 2019

Run Java PreCommit

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for the help! How did you find this?

The patch makes total sense. I'm only concerned that this may break other runners (how can we make sure that they implement this correctly??). Maybe we should go the opposite way and make the flink side coder NESTED instead?

Also I'm wondering if we can create a simple @ValidatesRunner test that would make sure that both FnApi and Runner use the same coder across all runners 🤔

@robertwb You've deprecated coder contexts few years back, what is the current status of this effort?

PreCommit test failures seem to be unrelated (CassandraIO again).

@dmvk
Copy link
Member

dmvk commented Sep 3, 2019

cc @kennknowles

@kennknowles
Copy link
Member

Adding a couple of folks who would know the answer better. I am not sure where the conversation ended about portability and "outer" vs "nested" encodings.

@robertwb
Copy link
Contributor

robertwb commented Sep 3, 2019

IIRC, we were moving to simply use nested coders everywhere at the portability layer. That seems the better fix.

@sunjincheng121
Copy link
Member Author

@robertwb thanks a lot for the feedbacks. I agree that for making our framework as simply as possible is a good idea.

What I concerned is it will introduce some overhead for each key if we use NESTED here.
Considering the performance problem, should we make an exception here, I meant that we using OUTER for group keys case is simple and efficient.

What to do you think?

@sunjincheng121
Copy link
Member Author

Hi @dmvk I have debug in my local many times, finally find this bug, and great thanks for helping double check the PreCommit test failures is unrelated this PR.

@kennknowles
Copy link
Member

OK, this matches what I thought. I think "nested" means "self-delimiting". So for things like fixed-width twos complement integers there is no performance cost. The performance cost is only for non-fixed-width encodings like strings. The only unusual case is protobuf-style "varint" and related encodings.

@dmvk
Copy link
Member

dmvk commented Sep 4, 2019

@kennknowles great, can you think about any @ValidatesRunner test case, that would ensure all runners use this correctly?

@robertwb
Copy link
Contributor

robertwb commented Sep 4, 2019

Is there a measurable performance difference/overhead is for using nested in this context? If not, I'd rather keep thing simple over introducing unnecessary optimization.

@robertwb
Copy link
Contributor

robertwb commented Sep 4, 2019

CC: @lukecwik

@lukecwik
Copy link
Member

lukecwik commented Sep 4, 2019

The proto specifically says to use NESTED:

We currently use NESTED everywhere in the Beam portability APIs since NESTED vs OUTER becomes quite complicated and extremely error prone. Until there are benchmarks that show that this is a noticeable area of performance concern, we should stick with a simpler API specification.

Mental note to self is to update the proto descriptions to mention self-delimiting everywhere instead of NESTED/OUTER.

@sunjincheng121 sunjincheng121 changed the title [hotfix] Encode keys as OUTER for flink key supplier [hotfix] Encode keys as NESTED for flink keyselector Sep 5, 2019
@sunjincheng121
Copy link
Member Author

Thanks for all of your feedback! @kennknowles @robertwb @lukecwik @dmvk

Frankly speaking, I did not do any performance testing, just an intuitive theoretical speculation. If we need to use NESTED from an architectural perspective, I agree with this decision.

I have update the PR which change the OUTER to NESTED.

Best, Jincheng

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for the contribution!

@dmvk dmvk merged commit ce4cd3b into apache:master Sep 5, 2019
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this PR. A couple of comments:

  1. Why is there no JIRA issue? "Hotfixes" are not in line with the Beam development guide. This is quite an important change which effects backwards-compatibility.
  2. There is no proper explanation why this is necessary.
  3. The original OUTER context was chosen for a reason because there is no need to use NESTED for the Flink key. If tests are failing, I think we should look at the root cause and not change the entire encoding of Flink keys.

In my opinion, without further explanation and analysis, this warrants a revert.

@mxm
Copy link
Contributor

mxm commented Sep 5, 2019

It looks to me the test issues were coming from the timer key passed from the portability layer. We could easily change the encoding context there.

@dmvk
Copy link
Member

dmvk commented Sep 5, 2019

@mxm reverting this until your comments are resolved

@sunjincheng121 can you please create a JIRA for the issue?

@mxm
Copy link
Contributor

mxm commented Sep 5, 2019

Just to be clear, fixing the underlying cause is a one liner. The following line needs to be commented out:

edit: We explicitly need to change the key encoding context here:

// Move from NESTED transport (Beam) to OUTER context (Flink internal)
K bytes = CoderUtils.decodeFromByteArray(keyCoder, key.toByteArray(), Coder.Context.NESTED);
ByteBuffer encodedKey = ByteBuffer.wrap(bytes.toByteArray());

The simplification that comes from the NESTED context is worthwhile to think about, but I think we should stick to the current serialization format to not break backwards-compatibility.

@tweise
Copy link
Contributor

tweise commented Sep 5, 2019

+1 for reverting this

I find it helpful to annotate the source, which would not only provide relevant context but also suitable author(s) to ask questions.

@sunjincheng121 please see https://beam.apache.org/contribute/
@dmvk https://beam.apache.org/contribute/committer-guide/

@sunjincheng121
Copy link
Member Author

Thank you very much for all of your feedback!

There are three solutions to this problem in my thoughts (we have already mentioned it above):

  Approach 1: Use OUTER in both keyselector and portability level.
     Benefit: no need to add length to the key, very simple solution
     Disadvantages: The architecture does not meet Luke's reference to the "use NESTED everywhere in the Beam portability" principle.

  Approach 2: NESTED is used in both keselector and portability level.
     Benefits: In line with Luke's reference to the "use NESTED everywhere in the Beam portability" principle.
     Disadvantages: Added length on the key(but there are currently no clear performance issues found).

  Approach 3: Use OUTER in the keyselector and NESTED in the portability level.
    Benefits: small code modification
    Disadvantages:
      - Decode the key every time when read/write/clean the state, so the performance is not prefect.

I have did a simple test for prepareStateBackend:
case1:
1000 characters in length, loop 10000 times, call prepareStateBackend
With decode: Time used: 84ms
Without decode: Time used: 46ms

case2:
10000 characters in length, loop 10000 times, call prepareStateBackend
With decode: Time used: 227ms
Without decode: Time used: 116ms

Hi @mxm, regarding you said: "break backwards-compatibility.". Do you meant that the break state backwards-compatibility, i.e., when change the OUTER to NESTED, the key-group-id will be changed, then user can not get the old state when upgrade the job. If so, #3 also encounter same problem. because we chanege the current key of state, user also cannot get the old state when upgrade the job. If I understand correctly, all of the approaches should have the backwards-compatibility problem. So, please correct me If there anything I misunderstand.

So, for now, I still prefer #2.

What do you think? @kennknowles @robertwb @dmvk @mxm @lukecwik @tweise

Best,
Jincheng

@dmvk
Copy link
Member

dmvk commented Sep 6, 2019

Personally. I lean towards 2) as well, because it's way more error prone. These things are super hard to test and it makes the runner code hard to understand (at least for me, because I'm not that familiar with the runner code-base).

Also coder contexts are marked as deprecated, is the deprecation still valid?

@lukecwik
Copy link
Member

lukecwik commented Sep 6, 2019

Yes, the deprecation on coder contexts is valid. Originally coders were used to read/write from/to files in addition to encoding/decoding records that are being processed by the pipeline and that reading/writing is what really needed an OUTER context.

@tweise
Copy link
Contributor

tweise commented Sep 6, 2019

@sunjincheng121 thanks for the follow-up. I think this discussion would be worth having on the mailing list though :)

@sunjincheng121
Copy link
Member Author

Thanks for the reply @dmvk @lukecwik @tweise
I think It's good idea for bring up the discussion in mailling list @tweise
I would like to do it and let's try to find out the final solution for this problem as soon as possible.(After all, @dmvk 's PR has been opened for a long time.)

@tweise
Copy link
Contributor

tweise commented Sep 6, 2019

@sunjincheng121 that sounds good.

There is another PR that will unblock the 1.9 runner: #9484

The 1.9 runner PR has open issues on its own also.

@sunjincheng121 sunjincheng121 changed the title [hotfix] Encode keys as NESTED for flink keyselector [BEAM-8162] Encode keys as NESTED for flink keyselector Sep 6, 2019
@mxm
Copy link
Contributor

mxm commented Sep 6, 2019

(After all, @dmvk 's PR has been opened for a long time.)

It may have been open for a long time but it was only considered ready-to-review with tests passing yesterday. I too want this to be part of 1.9 but we can't force it. Things need enough time to be reviewed to ensure the quality of the next Beam release.

@sunjincheng121
Copy link
Member Author

Yes @mxm we already do your best try to make it available. and I can see you already do more effort for that PR. 👍
Best, Jincheng

@mxm
Copy link
Contributor

mxm commented Sep 6, 2019

To comment on your summary, only approach #2 and #3 are possible because we cannot change the SDKs internal encoding. The encoding should be transparent to the Runner and independent of the key encoding a Runner chooses. The state request handler which caused the issues happens to leak the internal encoding of the SDK.

I agree that it would be convenient to just use NESTED context. I'm not against it. However, I think it is also valid to keep the existing behavior, as the state requests is the only interface that exposes the raw key. In any case, it makes sense to think about this, test this, and ask the people who are most familiar with the code.

@mxm
Copy link
Contributor

mxm commented Sep 6, 2019

I can see you already do more effort for that PR.

I don't consider a response time of one day to be long. Please keep in mind that people may be occupied with other tasks.

@sunjincheng121
Copy link
Member Author

Hi @mxm @tweise I think you are much familiar with this code. So I just share my thoughts, and looking forward your decision :)

@sunjincheng121
Copy link
Member Author

I can see you already do more effort for that PR.

I don't consider a response time of one day to be long. Please keep in mind that people may be occupied with other tasks.

Sorry, maybe I didn't express my meaning correctly. I meant that, we already do our best to fix it. :)

@sunjincheng121
Copy link
Member Author

Is there any progress here? @mxm

@dmvk
Copy link
Member

dmvk commented Sep 12, 2019

@sunjincheng121 #9484

@sunjincheng121
Copy link
Member Author

Great thanks for your remind @dmvk!

I seriously looked at the changes in #9484(already merged). I found that the changes to the Java side are essentially the same as my current PR, so until now I still think there is no problem with the solution in my PR. Merging this PR is no problem. We can always improve or fix some code if reviewer have any concern. So, I think that it is not necessary to revert this PR. I will add more comments on the #9484.

Best,
Jincheng

@mxm
Copy link
Contributor

mxm commented Sep 27, 2019

I very much appreciate your work on this @sunjincheng121. Let it be said, your contributions are recognized. I also have to repeat that your solution in this PR was incomplete. Not only did it not adhere to our contribution standard because it was originally without a JIRA ("hotfix"), but it was also missing the Python part which would have broken the Python support, as our tests showed. It required careful investigation to find out the root of the problem and to not break any other SDK as the result of a quick fix. We could not iterate on your PR because it was already merged, so reverting it was the only option.

I know that getting one's commit reverted is not a good feeling. It happens to everyone eventually, and it is not to diminish your contributors in any way. Keep up the great contributions. I know that everyone in the Beam community greatly appreciates your work.

@sunjincheng121
Copy link
Member Author

Hi @mxm, thanks for your reply. I am glad to share my thoughts as follows:

  1. We can create a JIRA if there is no JIRA number. This is also mentioned by @dmvk.
    Actually I find that there are many commits without JIRA numbers in Beam, so I don't think this is a strong reason to revert a commit.
    (BTW, this PR is helping @dmvk to fix BEAM-7730's PR issue: [BEAM-7730] Introduce Flink 1.9 Runner #9296 (comment)).

  2. Regarding to breaking the Python part, the tests already passed when merging this PR and so we didn't find this problem.Besides, as this commit doesn't break the master tests and will not block other contributors, so we have enough time to fix the Python part if we find that it is broken. So we have other better choices instead of reverting the commit.

  3. Regarding to the changes to fix the Python problem on top of this PR, I think less changes are needed and only need add encode_nested and decode_nested in coder.py and bundle_processor.py for using the encode_nested(It's very clear in your PR). In this way, we can not only solve this problem, but also encourage contributors and committers.

BTW, not all the new contributors know the rules of Beam well. We should have a better way to correct the mistakes of new contributors. Reverting is not the only way, nor the best way.

I believe we have the same goal: to find a better way to make contributors happy to contribute to the Beam community.

So I just share my thoughts not only for this PR but also for how to deal with same kinds of problems in the future.

Great thanks for affirmation of my contribution. Feel free to correct me if there are anything incorrect.

Best,
Jincheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants