Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26019][PYSPARK] Allow insecure py4j gateways #23337
[SPARK-26019][PYSPARK] Allow insecure py4j gateways #23337
Changes from 3 commits
c891464
7b2d223
e83b160
9cc545b
c6e0811
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires the config to be set in the
SparkConf
object, instead of allowing it to be in a config file, right? I don't think the config file has been read at this point in the code.The issue I see with that is that it would require code changes to make this work, instead of being simply something to add to a config file in apps that currently pass in insecure gateways.
(But then, given that the config file is probably read by the JVM, you'd have a chicken-and-egg problem, so maybe this is ok...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if we just control this via environment variable instead? The gateway will already be in driver side so we could simply set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point about not reading a file. I can switch to an environment variable -- is that really better for a use case like zeppelin? neither solution seems ideal, but I would really like to keep some check here that forces users to opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either way is possible but I assume it's easier to handle environment variable instead? No strong opinion on this ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this more, and I realized that if we're requiring a code change for the end user, it sort of defeats the entire purpose of this change (eg. users need this because they can't upgrade zeppelin). One thing I could do is delay this check until we get the full spark conf back from java:
spark/python/pyspark/context.py
Lines 169 to 171 in 5ed89ce
then the check could be based on the real conf, which includes anything set in a file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just happened to know a bit due to a bit of related work lately .. not an expert but both indeed are pretty much feasible, and IMHO env is easier and cleaner.
cc @zjffdu and maybe .. @Leemoonsoo as well .. for sure.
The argument here is basically which one is better in Zeppelin PySpark interpreter side:
SparkConf
object (instead of allowing it to be in a config file)This conf or env might have to be set in Zeppelin side to allow insecure connection if I am not mistaken. We're not sure which one is preferred and easier in Zeppelin. To me, env looks preferred IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget to cc one who's committer at both projects :-) @felixcheung
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratch that.
Zeppelin is setting SparkConf before creating the SparkContext, here https://github.com/apache/zeppelin/blob/master/spark/interpreter/src/main/java/org/apache/zeppelin/spark/NewSparkInterpreter.java#L97 so I don't think it matters.
either is fine and perhaps SparkConf is more consistent with other config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, lets be clear on the end goal here. The point is for an end user to be able to enable the opt-in without them changing the code in zeppelin at all. If we're going to change the code in zeppelin, it should just be changed to do the right thing and create a secure gateway (as zeppelin already has changed in master, and I think even v0.8.0 now that I look more closely).
so looking at an old version of zeppelin, eg:
https://github.com/apache/zeppelin/blob/v0.7.3/spark/src/main/java/org/apache/zeppelin/spark/PySparkInterpreter.java#L205-L229
If I'm reading that correctly, it looks like environment variables the user has set when starting zeppelin will get passed through to the python command (I think that is what
EnvironmentUtils.getProcEnvironment()
does). But there isn't any way for the user to add additional confs to that command. I assume things other than zeppelin would work similarlyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to require the env variable PYSPARK_ALLOW_INSECURE_GATEWAY to be set to 1 / true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we are intentionally not documenting this config as we only expect zeppelin to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was my thinking. I could go either way -- documenting it would give a spot to warn about the security hole it opens up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Honestly, I still think insecure is a misusage of Spark and It should be removed. I'm going to merge this as an effort to help upgrading Spark easier in other projects like Zeppelin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part of the test really bothers me, so I'd like to explain to reviewers. Without this, the test passes -- but it passes even without the changes to the main code! Or rather, it only passes when its run as part of the entire suite, it would fail when run individually.
What's happening is that
SparkContext._gateway
andSparkContext._jvm
don't get reset by most tests (eg., they are not reset insc.stop()
), so a test running before this one will set those variables, and then this test will end up holding on to a gateway which does have theauth_token
set, and so the accumulator server would still work.Now that in itself sounds crazy to me, and seems like a problem for things like Zeppelin. I tried just adding these two lines into
sc.stop()
, but then when I ran all the tests, I got a lot ofjava.io.IOException: error=23, Too many open files in system
. So maybe something else is not getting properly cleaned up properly in the pyspark tests?I was hoping somebody else might have some ideas about what is going on or if there is a better way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually have made separate classes in this case it's closely related with stop and start tho .. (for instance, https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/test_session.py#L27-L73 and https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/test_dataframe.py#L680-L682
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's really answering my question. I don't have a problem calling start & stop, I'm wondering why
SparkContext._gateway
andSparkContext._jvm
don't get reset insc.stop()
. This means that if you have multiple spark contexts in one python session (as we do in our tests), they all reuse the same gateway:spark/python/pyspark/context.py
Lines 300 to 302 in 4d693ac
for normal use of spark, that's not a problem; but it seems like it would be a problem (a) in our tests and (b) for systems like zeppelin, that might have multiple spark contexts over the lifetime of the python session (I assume, anyway ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotya. In that case, we could consider simply move the test class to the top in a separate class as well but .. yes I suspect tests depending on its order isn't a great idea in a way as well. I'm okay as long as the tests pass. I can take a separate look for this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this logic is sort rough, in spark-testing-base for example in between tests where folks do not intend to reuse the same Spark context we also clear some extra properties (although we do reuse the gateway). I think for environments where folks want multiple SparkContexts from Python on the same machine they end up using multiple Python processes anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok its good to get some confirmation of this weird behavior ... but I feel like I still don't understand why we don't reset
SparkContext._gateway
andSparkContext._jvm
insc.stop()
; and why when I tried to make that change, I hit all those errors. if nothing else, any chance this is related to general flakiness in pyspark tests?