[BEAM-8162] Encode keys as NESTED for flink keyselector #9464

sunjincheng121 · 2019-09-03T05:49:20Z

This is a hotfix, the changes is encode keys as OUTER for flink key supplier.
This PR blocked the PR: #9296

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

dmvk · 2019-09-03T08:09:38Z

Run Java PreCommit

dmvk

Wow, thanks for the help! How did you find this?

The patch makes total sense. I'm only concerned that this may break other runners (how can we make sure that they implement this correctly??). Maybe we should go the opposite way and make the flink side coder NESTED instead?

Also I'm wondering if we can create a simple @ValidatesRunner test that would make sure that both FnApi and Runner use the same coder across all runners 🤔

@robertwb You've deprecated coder contexts few years back, what is the current status of this effort?

PreCommit test failures seem to be unrelated (CassandraIO again).

dmvk · 2019-09-03T15:59:07Z

cc @kennknowles

kennknowles · 2019-09-03T19:05:00Z

Adding a couple of folks who would know the answer better. I am not sure where the conversation ended about portability and "outer" vs "nested" encodings.

robertwb · 2019-09-03T21:10:48Z

IIRC, we were moving to simply use nested coders everywhere at the portability layer. That seems the better fix.

sunjincheng121 · 2019-09-04T04:42:29Z

@robertwb thanks a lot for the feedbacks. I agree that for making our framework as simply as possible is a good idea.

What I concerned is it will introduce some overhead for each key if we use NESTED here.
Considering the performance problem, should we make an exception here, I meant that we using OUTER for group keys case is simple and efficient.

What to do you think?

sunjincheng121 · 2019-09-04T05:04:03Z

Hi @dmvk I have debug in my local many times, finally find this bug, and great thanks for helping double check the PreCommit test failures is unrelated this PR.

kennknowles · 2019-09-04T07:00:10Z

OK, this matches what I thought. I think "nested" means "self-delimiting". So for things like fixed-width twos complement integers there is no performance cost. The performance cost is only for non-fixed-width encodings like strings. The only unusual case is protobuf-style "varint" and related encodings.

dmvk · 2019-09-04T08:01:08Z

@kennknowles great, can you think about any @ValidatesRunner test case, that would ensure all runners use this correctly?

robertwb · 2019-09-04T20:06:49Z

Is there a measurable performance difference/overhead is for using nested in this context? If not, I'd rather keep thing simple over introducing unnecessary optimization.

robertwb · 2019-09-04T20:06:59Z

CC: @lukecwik

lukecwik · 2019-09-04T22:52:31Z

The proto specifically says to use NESTED:

beam/model/fn-execution/src/main/proto/beam_fn_api.proto

Line 677 in 964b964

// nested context.

We currently use NESTED everywhere in the Beam portability APIs since NESTED vs OUTER becomes quite complicated and extremely error prone. Until there are benchmarks that show that this is a noticeable area of performance concern, we should stick with a simpler API specification.

Mental note to self is to update the proto descriptions to mention self-delimiting everywhere instead of NESTED/OUTER.

sunjincheng121 · 2019-09-05T03:54:38Z

Thanks for all of your feedback! @kennknowles @robertwb @lukecwik @dmvk

Frankly speaking, I did not do any performance testing, just an intuitive theoretical speculation. If we need to use NESTED from an architectural perspective, I agree with this decision.

I have update the PR which change the OUTER to NESTED.

Best, Jincheng

dmvk

LGTM 👍 Thanks for the contribution!

mxm

I missed this PR. A couple of comments:

Why is there no JIRA issue? "Hotfixes" are not in line with the Beam development guide. This is quite an important change which effects backwards-compatibility.
There is no proper explanation why this is necessary.
The original OUTER context was chosen for a reason because there is no need to use NESTED for the Flink key. If tests are failing, I think we should look at the root cause and not change the entire encoding of Flink keys.

In my opinion, without further explanation and analysis, this warrants a revert.

mxm · 2019-09-05T12:17:24Z

It looks to me the test issues were coming from the timer key passed from the portability layer. We could easily change the encoding context there.

dmvk · 2019-09-05T13:10:28Z

@mxm reverting this until your comments are resolved

@sunjincheng121 can you please create a JIRA for the issue?

mxm · 2019-09-05T14:16:46Z

Just to be clear, fixing the underlying cause is a one liner. The following line needs to be commented out:

beam/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/ExecutableStageDoFnOperator.java

Line 349 in 8869fce

// final ByteBuffer encodedKey = FlinkKeyUtils.encodeKey(key, keyCoder);

edit: We explicitly need to change the key encoding context here:

// Move from NESTED transport (Beam) to OUTER context (Flink internal)
K bytes = CoderUtils.decodeFromByteArray(keyCoder, key.toByteArray(), Coder.Context.NESTED);
ByteBuffer encodedKey = ByteBuffer.wrap(bytes.toByteArray());

The simplification that comes from the NESTED context is worthwhile to think about, but I think we should stick to the current serialization format to not break backwards-compatibility.

tweise · 2019-09-05T14:27:59Z

+1 for reverting this

I find it helpful to annotate the source, which would not only provide relevant context but also suitable author(s) to ask questions.

@sunjincheng121 please see https://beam.apache.org/contribute/
@dmvk https://beam.apache.org/contribute/committer-guide/

sunjincheng121 · 2019-09-06T09:38:02Z

Thank you very much for all of your feedback!

There are three solutions to this problem in my thoughts (we have already mentioned it above):

Approach 1: Use OUTER in both keyselector and portability level.
Benefit: no need to add length to the key, very simple solution
Disadvantages: The architecture does not meet Luke's reference to the "use NESTED everywhere in the Beam portability" principle.

Approach 2: NESTED is used in both keselector and portability level.
Benefits: In line with Luke's reference to the "use NESTED everywhere in the Beam portability" principle.
Disadvantages: Added length on the key(but there are currently no clear performance issues found).

Approach 3: Use OUTER in the keyselector and NESTED in the portability level.
    Benefits: small code modification
    Disadvantages:
      - Decode the key every time when read/write/clean the state, so the performance is not prefect.

I have did a simple test for prepareStateBackend:
case1:
1000 characters in length, loop 10000 times, call prepareStateBackend
With decode: Time used: 84ms
Without decode: Time used: 46ms

case2:
10000 characters in length, loop 10000 times, call prepareStateBackend
With decode: Time used: 227ms
Without decode: Time used: 116ms

Hi @mxm, regarding you said: "break backwards-compatibility.". Do you meant that the break state backwards-compatibility, i.e., when change the OUTER to NESTED, the key-group-id will be changed, then user can not get the old state when upgrade the job. If so, #3 also encounter same problem. because we chanege the current key of state, user also cannot get the old state when upgrade the job. If I understand correctly, all of the approaches should have the backwards-compatibility problem. So, please correct me If there anything I misunderstand.

So, for now, I still prefer #2.

What do you think? @kennknowles @robertwb @dmvk @mxm @lukecwik @tweise

Best,
Jincheng

dmvk · 2019-09-06T12:07:09Z

Personally. I lean towards 2) as well, because it's way more error prone. These things are super hard to test and it makes the runner code hard to understand (at least for me, because I'm not that familiar with the runner code-base).

Also coder contexts are marked as deprecated, is the deprecation still valid?

lukecwik · 2019-09-06T13:37:58Z

Yes, the deprecation on coder contexts is valid. Originally coders were used to read/write from/to files in addition to encoding/decoding records that are being processed by the pipeline and that reading/writing is what really needed an OUTER context.

tweise · 2019-09-06T14:45:49Z

@sunjincheng121 thanks for the follow-up. I think this discussion would be worth having on the mailing list though :)

sunjincheng121 · 2019-09-06T15:43:00Z

Thanks for the reply @dmvk @lukecwik @tweise
I think It's good idea for bring up the discussion in mailling list @tweise
I would like to do it and let's try to find out the final solution for this problem as soon as possible.(After all, @dmvk 's PR has been opened for a long time.）

tweise · 2019-09-06T15:48:19Z

@sunjincheng121 that sounds good.

There is another PR that will unblock the 1.9 runner: #9484

The 1.9 runner PR has open issues on its own also.

mxm · 2019-09-06T15:58:56Z

(After all, @dmvk 's PR has been opened for a long time.）

It may have been open for a long time but it was only considered ready-to-review with tests passing yesterday. I too want this to be part of 1.9 but we can't force it. Things need enough time to be reviewed to ensure the quality of the next Beam release.

sunjincheng121 · 2019-09-06T16:05:59Z

Yes @mxm we already do your best try to make it available. and I can see you already do more effort for that PR. 👍
Best, Jincheng

mxm · 2019-09-06T16:22:36Z

To comment on your summary, only approach #2 and #3 are possible because we cannot change the SDKs internal encoding. The encoding should be transparent to the Runner and independent of the key encoding a Runner chooses. The state request handler which caused the issues happens to leak the internal encoding of the SDK.

I agree that it would be convenient to just use NESTED context. I'm not against it. However, I think it is also valid to keep the existing behavior, as the state requests is the only interface that exposes the raw key. In any case, it makes sense to think about this, test this, and ask the people who are most familiar with the code.

mxm · 2019-09-06T16:25:34Z

I can see you already do more effort for that PR.

I don't consider a response time of one day to be long. Please keep in mind that people may be occupied with other tasks.

sunjincheng121 · 2019-09-06T17:18:59Z

Hi @mxm @tweise I think you are much familiar with this code. So I just share my thoughts, and looking forward your decision :)

sunjincheng121 · 2019-09-06T19:05:35Z

I can see you already do more effort for that PR.

I don't consider a response time of one day to be long. Please keep in mind that people may be occupied with other tasks.

Sorry, maybe I didn't express my meaning correctly. I meant that, we already do our best to fix it. :)

sunjincheng121 · 2019-09-12T08:57:35Z

Is there any progress here? @mxm

dmvk · 2019-09-12T09:03:53Z

@sunjincheng121 #9484

sunjincheng121 · 2019-09-27T09:02:52Z

Great thanks for your remind @dmvk!

I seriously looked at the changes in #9484(already merged). I found that the changes to the Java side are essentially the same as my current PR, so until now I still think there is no problem with the solution in my PR. Merging this PR is no problem. We can always improve or fix some code if reviewer have any concern. So, I think that it is not necessary to revert this PR. I will add more comments on the #9484.

Best,
Jincheng

mxm · 2019-09-27T11:32:30Z

I very much appreciate your work on this @sunjincheng121. Let it be said, your contributions are recognized. I also have to repeat that your solution in this PR was incomplete. Not only did it not adhere to our contribution standard because it was originally without a JIRA ("hotfix"), but it was also missing the Python part which would have broken the Python support, as our tests showed. It required careful investigation to find out the root of the problem and to not break any other SDK as the result of a quick fix. We could not iterate on your PR because it was already merged, so reverting it was the only option.

I know that getting one's commit reverted is not a good feeling. It happens to everyone eventually, and it is not to diminish your contributors in any way. Keep up the great contributions. I know that everyone in the Beam community greatly appreciates your work.

sunjincheng121 · 2019-09-29T10:48:38Z

Hi @mxm, thanks for your reply. I am glad to share my thoughts as follows:

We can create a JIRA if there is no JIRA number. This is also mentioned by @dmvk.
Actually I find that there are many commits without JIRA numbers in Beam, so I don't think this is a strong reason to revert a commit.
(BTW, this PR is helping @dmvk to fix BEAM-7730's PR issue: [BEAM-7730] Introduce Flink 1.9 Runner #9296 (comment)).
Regarding to breaking the Python part, the tests already passed when merging this PR and so we didn't find this problem.Besides, as this commit doesn't break the master tests and will not block other contributors, so we have enough time to fix the Python part if we find that it is broken. So we have other better choices instead of reverting the commit.
Regarding to the changes to fix the Python problem on top of this PR, I think less changes are needed and only need add encode_nested and decode_nested in coder.py and bundle_processor.py for using the encode_nested(It's very clear in your PR). In this way, we can not only solve this problem, but also encourage contributors and committers.

BTW, not all the new contributors know the rules of Beam well. We should have a better way to correct the mistakes of new contributors. Reverting is not the only way, nor the best way.

I believe we have the same goal: to find a better way to make contributors happy to contribute to the Beam community.

So I just share my thoughts not only for this PR but also for how to deal with same kinds of problems in the future.

Great thanks for affirmation of my contribution. Feel free to correct me if there are anything incorrect.

Best,
Jincheng

sunjincheng121 mentioned this pull request Sep 3, 2019

[BEAM-7730] Introduce Flink 1.9 Runner #9296

Merged

dmvk self-requested a review September 3, 2019 07:16

dmvk reviewed Sep 3, 2019

View reviewed changes

kennknowles requested review from lukecwik and boyuanzz September 3, 2019 19:03

[hotfix] Encode keys as NESTED for flink keyselector

4a0eaf3

sunjincheng121 force-pushed the hotfix branch from a856332 to 4a0eaf3 Compare September 5, 2019 03:50

sunjincheng121 changed the title ~~[hotfix] Encode keys as OUTER for flink key supplier~~ [hotfix] Encode keys as NESTED for flink keyselector Sep 5, 2019

dmvk approved these changes Sep 5, 2019

View reviewed changes

dmvk merged commit ce4cd3b into apache:master Sep 5, 2019

mxm reviewed Sep 5, 2019

View reviewed changes

dmvk mentioned this pull request Sep 5, 2019

Revert "[hotfix] Encode keys as NESTED for flink keyselector" #9483

Merged

sunjincheng121 changed the title ~~[hotfix] Encode keys as NESTED for flink keyselector~~ [BEAM-8162] Encode keys as NESTED for flink keyselector Sep 6, 2019

sunjincheng121 mentioned this pull request Sep 27, 2019

[BEAM-8157] Ensure key encoding for state requests is consistent across SDKs #9484

Merged

[BEAM-8162] Encode keys as NESTED for flink keyselector #9464

[BEAM-8162] Encode keys as NESTED for flink keyselector #9464

Conversation

sunjincheng121 commented Sep 3, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

dmvk commented Sep 3, 2019

dmvk left a comment

Choose a reason for hiding this comment

dmvk commented Sep 3, 2019

kennknowles commented Sep 3, 2019

robertwb commented Sep 3, 2019

sunjincheng121 commented Sep 4, 2019

sunjincheng121 commented Sep 4, 2019

kennknowles commented Sep 4, 2019

dmvk commented Sep 4, 2019

robertwb commented Sep 4, 2019

robertwb commented Sep 4, 2019

lukecwik commented Sep 4, 2019

sunjincheng121 commented Sep 5, 2019

dmvk left a comment

Choose a reason for hiding this comment

mxm left a comment

Choose a reason for hiding this comment

mxm commented Sep 5, 2019

dmvk commented Sep 5, 2019

mxm commented Sep 5, 2019 • edited Loading

tweise commented Sep 5, 2019

sunjincheng121 commented Sep 6, 2019

dmvk commented Sep 6, 2019

lukecwik commented Sep 6, 2019

tweise commented Sep 6, 2019

sunjincheng121 commented Sep 6, 2019

tweise commented Sep 6, 2019

mxm commented Sep 6, 2019

sunjincheng121 commented Sep 6, 2019

mxm commented Sep 6, 2019

mxm commented Sep 6, 2019

sunjincheng121 commented Sep 6, 2019

sunjincheng121 commented Sep 6, 2019

sunjincheng121 commented Sep 12, 2019

dmvk commented Sep 12, 2019

sunjincheng121 commented Sep 27, 2019

mxm commented Sep 27, 2019

sunjincheng121 commented Sep 29, 2019

mxm commented Sep 5, 2019 •

edited

Loading