Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27812][K8S] Bump K8S client version to 4.6.1 #26093

Closed
wants to merge 6 commits into from

Conversation

igorcalabria
Copy link
Contributor

@igorcalabria igorcalabria commented Oct 11, 2019

What changes were proposed in this pull request?

Updated kubernetes client.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

Does this PR introduce any user-facing change?

Nope, it should be transparent to users

How was this patch tested?

This patch was tested manually using a simple pyspark job

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]

This is caused by a bug on kubernetes-client library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @igorcalabria .
Currently, our testing Jenkins server is down. I'll review this PR after Jenkins is back.

@ifilonenko
Copy link
Contributor

@igorcalabria thank you for writing this patch

I have ran this internally on my cluster and can confirm that non-daemon threads no longer block the JVM from exiting.

Against branch 2.4 and master:

I see the following non-daemon threads exist when running
spark-submit against the following pyspark job:

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()

After the python process exited, the JVM process remained and I see the following using jstack:

"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]

Against this patch:

I no longer see the following non-daemon threads and the JVM is able to exit after the completion of the python process.

@mccheah @erikerlandson @felixcheung @holdenk we should backport this fix to branch 2.4 as well and cut a new release with this patch.

Copy link
Contributor

@ifilonenko ifilonenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes regarding consistency of kubernetes-client version and the bump of okhttp as well

@@ -29,7 +29,7 @@
<name>Spark Project Kubernetes</name>
<properties>
<sbt.project.name>kubernetes</sbt.project.name>
<kubernetes.client.version>4.4.2</kubernetes.client.version>
<kubernetes.client.version>4.6.0</kubernetes.client.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the kubernetes.client.version in resource-managers/kubernetes/integration-tests/pom.xml as well.

and bump:

    <dependency>
      <groupId>com.squareup.okhttp3</groupId>
      <artifactId>okhttp</artifactId>
      - <version>3.8.1</version>
      + <version>3.12.0</version>
    </dependency>

as I believe there is a dependency here https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.6.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Is there a reason why okhttp is added as an explicit dependency(kubernetes-client already brings it)? If we really need a specific version, shouldn't we at least exclude it from kubernetes-client? I'm saying this because it's very easy to end up with a unexpected version on the classpath. kubernetes-client uses okhttp's version 3.12 since version 4.1.2 but we're only updating it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may have been to make sure that another dependency on okhttp didn't win out and override to a lower version. It can just be updated here, or you can try removing it and examining the transitive dependencies afterwards. If it's already at 3.12.0, then it can likely be omitted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, @mccheah do we have an opinion on this? I am okay with removing the explicit dependency since kubernetes-client already brings it in. If there was a reason, I can't seem to remember, to explicitly state, let's exclude it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks okay to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed we should remove explicit dep to avoid a mismatch failure in the future

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 12, 2019

@igorcalabria . The following claims is not matched with your PR description.

Standard kubernetes tests should validate this patch.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

@SparkQA
Copy link

SparkQA commented Oct 13, 2019

Test build #111979 has finished for PR 26093 at commit 7f15bac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@igorcalabria
Copy link
Contributor Author

@igorcalabria . The following claims is not matched with your PR description.

Standard kubernetes tests should validate this patch.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

@dongjoon-hyun A manual test is described above by @ifilonenko. If you want, I could add an integration test for this specific case. I skipped it because other PRs updating kubernetes-client version didn't add any tests and the README in kubernetes/integration-tests states that the framework is under review and subject to major changes

this matches the version used by kubernetes-client
@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112032 has finished for PR 26093 at commit 5d6bcd7.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 14, 2019

If it's easy to write a test for this, OK. If it's complex or would take a long time to run, I can see just verifying the small behavior change with a manual test.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17039/

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17039/

@mccheah
Copy link
Contributor

mccheah commented Oct 14, 2019

It's pretty difficult to cover all the code paths for changes in a downstream library. @dongjoon-hyun @srowen is there a standard protocol requiring an additional test for downstream dependencies? In this case, we're picking up a fix to an internal implementation detail in the downstream library, and I'd imagine we'd trust the downstream library to have already done its due diligence to test their fix. Regression tests exist to catch new problems introduced by the dependency bumps as we go along.

I'd agree that a manual test demonstrating the desired end behavior is sufficient here.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17042/

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17042/

@dongjoon-hyun
Copy link
Member

Hi, @igorcalabria and @mccheah .
I'm okay as long as the manual description becomes the PR description and the commit log.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112035 has finished for PR 26093 at commit b31b375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ifilonenko
Copy link
Contributor

+1 on my end, I have tested using manual tests. An integration test seems like a bit of overkill, although a simple jstack check in the test would be sufficient.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.

Upgrading versions are okay, but the problem of this PR is to claim too much without a concrete background. This PR doesn't provide the way how to verify the claims in the PR description at all. Since the PR description will be a permanent commit log, we should not merge this PR AS-IS.

I'd like to recommend to remove all the claims from the PR description and narrow down the focus to the bug fixes inside the client library itself. Then, this PR can be considered as a proper proposal.

@erikerlandson
Copy link
Contributor

@mccheah @erikerlandson @felixcheung @holdenk we should backport this fix to branch 2.4 as well and cut a new release with this patch.

Will there be a 2.4.5 ?

@ifilonenko
Copy link
Contributor

@erikerlandson I think there should be. This bug will cause pyspark resources to hang unless some external process does resource reaping.

@dongjoon-hyun I agree, the PR description should change

@srowen srowen changed the title [SPARK-27812][K8s] Bump client version [SPARK-27812][K8S] Bump K8S client version to 4.6.0 Oct 15, 2019
@srowen
Copy link
Member

srowen commented Oct 15, 2019

There will be a 2.4.5 eventually, I'm sure. 2.4.x is a sort of "LTS" branch.

@dongjoon-hyun
Copy link
Member

Of course, @erikerlandson . As @srowen said, it's the LTS branch.

Since 2.4.0 was released on November, 2018, branch-2.4 should be maintained at least May, 2020 (18 months). After that, as a LTS branch, I believe branch-2.4 will be maintained for a while.

Since 2.4.4 was released on September, 2019. 2.4.5 will arrive early 2020 and so on. I hope more regular releases. But, the 2.4.x release schedule heavily depends on the patches on branche-2.4 and the volunteer for release manager.

@igorcalabria
Copy link
Contributor Author

@dongjoon-hyun I have updated the description to include a manual test. I believe that both jira issues fix claims are now supported by the manual test case provided in the description. Please let me know if you need anything else.

Cheers,

@dongjoon-hyun
Copy link
Member

Thank you for updating, @igorcalabria ! I'll follow the verification steps and try to land this.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, during reviews, I found that this became outdated already. Could you update this PR with the latest version, @igorcalabria .

@dongjoon-hyun
Copy link
Member

Also, cc @jiangxb1987 since he is a release manager for 3.0.0-preview.

@holdenk
Copy link
Contributor

holdenk commented Oct 17, 2019

I'm +1 on backporting even though it's a version change I think we need this in 2.4.X

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending removal of explicit dep & jenkins.
Thanks for doing the manual verification @ifilonenko :)

@holdenk
Copy link
Contributor

holdenk commented Oct 17, 2019

Actually if were going to update to 4.6.1 (and it looks like might as well) @ifilonenko would you have the cycles to manually verify with that version as well?

kubernetes-client already brings it and we don't need a specific version
@igorcalabria
Copy link
Contributor Author

@dongjoon-hyun bumped version to 4.6.1. Following other reviewers suggestions, I also removed explicit okhttp dep.

@srowen srowen changed the title [SPARK-27812][K8S] Bump K8S client version to 4.6.0 [SPARK-27812][K8S] Bump K8S client version to 4.6.1 Oct 17, 2019
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK pending tests

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17209/

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17209/

@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112221 has finished for PR 26093 at commit a5eb702.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Is this tested with Python 3 on the master branch?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

@dongjoon-hyun
Copy link
Member

Thank you so much everyone.
Especially, welcome, @igorcalabria . I added you to the Apache Spark contributor group.
Could you make a backporting PR against branch-2.4, @igorcalabria ?

dongjoon-hyun pushed a commit that referenced this pull request Oct 18, 2019
# What changes were proposed in this pull request?

Backport of #26093 to `branch-2.4`

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

Closes #26152 from igorcalabria/k8s-client-update-2.4.

Authored-by: igor.calabria <igor.calabria@ubee.in>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants