Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1690] Tolerating empty elements when saving Python RDD to text files #644

Closed
wants to merge 2 commits into from

Conversation

kanzhang
Copy link
Contributor

@kanzhang kanzhang commented May 5, 2014

Tolerate empty strings in PythonRDD

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@kanzhang
Copy link
Contributor Author

kanzhang commented May 5, 2014

Manually verified the patch on file with empty lines in the beginning, middle or end of file. Also tested empty file and file with only empty lines.

@mateiz
Copy link
Contributor

mateiz commented May 5, 2014

Can you add a test case for this? What file was it breaking on?

@kanzhang
Copy link
Contributor Author

kanzhang commented May 5, 2014

Any text file with empty lines in it will break, like Glenn reported in the JIRA - a file consists of ["foo", "", "bar"]. Unfortunately, I don't see an easy way to separate out the stdoutIterator logic and test it since it references readerException in the enclosing scope. Ideas welcome.

@kanzhang
Copy link
Contributor Author

kanzhang commented May 8, 2014

@mateiz just realized I could test it from Python side. Added a doctest. This makes Python API behave identical to Scala API.

@kanzhang kanzhang changed the title [SPARK-1690] Allow empty lines in PythonRDD [SPARK-1690] Tolerating empty elements when saving Python RDD to text files May 8, 2014
@pwendell
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -94,6 +94,7 @@ private[spark] class PythonRDD[T: ClassTag](
val obj = new Array[Byte](length)
stream.readFully(obj)
obj
case 0 => Array.empty[Byte]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though you could probably just change the if length > 0 to if length >= 0 above

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14871/

@pwendell
Copy link
Contributor

Okay I'll pull this in. Thanks!

asfgit pushed a commit that referenced this pull request May 10, 2014
… files

Tolerate empty strings in PythonRDD

Author: Kan Zhang <kzhang@apache.org>

Closes #644 from kanzhang/SPARK-1690 and squashes the following commits:

c62ad33 [Kan Zhang] Adding Python doctest
473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
(cherry picked from commit 6c2691d)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@asfgit asfgit closed this in 6c2691d May 10, 2014
@kanzhang kanzhang deleted the SPARK-1690 branch May 10, 2014 23:37
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
… files

Tolerate empty strings in PythonRDD

Author: Kan Zhang <kzhang@apache.org>

Closes apache#644 from kanzhang/SPARK-1690 and squashes the following commits:

c62ad33 [Kan Zhang] Adding Python doctest
473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes apache#644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
(cherry picked from commit 12738c1)

Signed-off-by: Josh Rosen <joshrosen@apache.org>
rshkv pushed a commit to rshkv/spark that referenced this pull request Feb 27, 2020
Not print conda environment as spark doesn't have arg logging and the env can therefore contain unsafe information.
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…che.kafka.clients.producer.ProducerConfig.getBoolean(Ljava/lang/String;)Ljava/lang/Boolean; (apache#644)
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…che.kafka.clients.producer.ProducerConfig.getBoolean(Ljava/lang/String;)Ljava/lang/Boolean; (apache#644)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants