Updating to spark 3.0.1 and hadoop 3.2.1 #141

lbergelson · 2020-06-19T21:22:44Z

Fixes #130, fixes #142

@heuermh Can you weigh in on this? I needed to make a weird change to use the RawLocalFileSystem in order to avoid a checksum issue. I'm not sure why we're getting the checksum failure though. I suspect it's not getting recomputed correctly after some operation but I don't know why.

If I don't force it to use the raw filesystem we get

htsjdk.samtools.util.RuntimeIOException: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/var/folders/q3/hw5cxmn52wq347lg7rb_mzlw0000gq/T/test1179670579977857255.vcf at 0

	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:88)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:75)
	at htsjdk.samtools.util.AbstractIterator.hasNext(AbstractIterator.java:44)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.<init>(AsciiLineReaderIterator.java:78)
	at htsjdk.tribble.readers.AsciiLineReaderIterator.<init>(AsciiLineReaderIterator.java:33)
	at org.disq_bio.disq.impl.formats.vcf.VcfSource.getFileHeader(VcfSource.java:80)
	at org.disq_bio.disq.HtsjdkVariantsRddStorage.read(HtsjdkVariantsRddStorage.java:96)
	at org.disq_bio.disq.HtsjdkVariantsRddStorage.read(HtsjdkVariantsRddStorage.java:80)
	at org.disq_bio.disq.HtsjdkVariantsRddTest.testReadAndWrite(HtsjdkVariantsRddTest.java:97)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at junitparams.internal.InvokeParameterisedMethod.evaluate(InvokeParameterisedMethod.java:234)
	at junitparams.internal.ParameterisedTestMethodRunner.runMethodInvoker(ParameterisedTestMethodRunner.java:47)
	at junitparams.internal.ParameterisedTestMethodRunner.runTestMethod(ParameterisedTestMethodRunner.java:40)
	at junitparams.internal.ParameterisedTestClassRunner.runParameterisedTest(ParameterisedTestClassRunner.java:146)
	at junitparams.JUnitParamsRunner.runChild(JUnitParamsRunner.java:446)
	at junitparams.JUnitParamsRunner.runChild(JUnitParamsRunner.java:393)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/var/folders/q3/hw5cxmn52wq347lg7rb_mzlw0000gq/T/test1179670579977857255.vcf at 0
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:260)
	at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:300)
	at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
	at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.disq_bio.disq.impl.file.HadoopFileSystemWrapper$SeekableHadoopStream.read(HadoopFileSystemWrapper.java:241)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at htsjdk.samtools.seekablestream.SeekableBufferedStream.read(SeekableBufferedStream.java:133)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.FilterInputStream.read(FilterInputStream.java:107)
	at htsjdk.tribble.readers.PositionalBufferedStream.fill(PositionalBufferedStream.java:132)
	at htsjdk.tribble.readers.PositionalBufferedStream.peek(PositionalBufferedStream.java:123)
	at htsjdk.tribble.readers.PositionalBufferedStream.read(PositionalBufferedStream.java:62)
	at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:134)
	at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:182)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:86)
	... 34 more

There may be other mechanisms to avoid this check. A better solution would be to make the check pass but I'm not sure why it's failing in the first place.

Would you support dropping support for spark 2 and scala 11? I'm in favor because it makes my life easier but I'm not sure what versions you need to support.

This would close #130

@tomwhite If you happen to have any insight into the checksum thing it might be valuable. I believe we had a similar issue in the hadoop-bam days but it went away in disq.

heuermh · 2020-06-21T20:34:52Z

Would you support dropping support for spark 2 and scala 11?

I would really like to, depends mostly on how soon AWS EMR and other cloud provider support for Spark 3 shows up.

Then as far as this particular issue goes, I will be updating all our Spark 3 related pull requests to use the 3.0 release version this week. I expect to run into other runtime issues, will investigate this along with everything else I find.

lbergelson · 2020-08-10T16:31:46Z

@heuermh Have you gotten a chance to take a look at this at all?

heuermh · 2020-08-10T16:35:52Z

Thanks for the ping! Yeah, we have released ADAM and downstream cross-building with Scala 2.12 and Spark 3. For Disq, going forward I would be fine with only releasing against Spark 3. I have not had a chance to investigate this issue specifically.

droazen

@lbergelson One comment, otherwise looks good to me

droazen · 2020-09-01T16:14:21Z

src/main/java/org/disq_bio/disq/impl/file/HadoopFileSystemWrapper.java

+      throws IOException {
+    final FileSystem fileSystem = p.getFileSystem(conf);
+    if (fileSystem instanceof LocalFileSystem) {
+      return ((LocalFileSystem) fileSystem).getRawFileSystem();


Add a comment explaining this special casing of LocalFileSystem (or, if we can't explain it, at least provide a comment with a reference for where the fix came from).

* closes #142

heuermh · 2020-10-07T21:07:23Z

From the travis failure log

[ERROR] Found 1 non-complying files, failing build
[ERROR] To fix formatting errors, run "mvn com.coveo:fmt-maven-plugin:format"
[ERROR] Non complying file: /home/travis/build/disq-bio/disq/src/main/java/org/
disq_bio/disq/impl/file/HadoopFileSystemWrapper.java

lbergelson · 2020-10-07T21:19:57Z

Woops! I always forget to run the linter locally.

lbergelson · 2020-10-07T22:14:54Z

I'm going to merge this. If we ever understand it better we should revisit it...

heuermh · 2020-10-08T12:52:50Z

Thank you, @lbergelson!

Updating to spark 3.0.0 and hadoop 3.2.1

d9cd52e

lbergelson force-pushed the lb_update_to_spark_hadoop_3 branch 2 times, most recently from 6aead20 to 5a31e40 Compare June 19, 2020 21:32

fix formatting and missed deprecated function

3e3a2b7

lbergelson force-pushed the lb_update_to_spark_hadoop_3 branch from 5a31e40 to 3e3a2b7 Compare June 19, 2020 22:15

droazen approved these changes Sep 1, 2020

View reviewed changes

heuermh mentioned this pull request Sep 9, 2020

Update Spark dependency to version 3.0.1 #142

Closed

lbergelson added 2 commits October 7, 2020 16:06

Updating to spark 3.0.1

ec39f78

* closes #142

Adding requested comment

1bc36ac

fix formatting

90b129b

lbergelson changed the title ~~Updating to spark 3.0.0 and hadoop 3.2.1~~ Updating to spark 3.0.1 and hadoop 3.2.1 Oct 7, 2020

lbergelson merged commit 4c44399 into master Oct 7, 2020

lbergelson deleted the lb_update_to_spark_hadoop_3 branch October 7, 2020 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating to spark 3.0.1 and hadoop 3.2.1 #141

Updating to spark 3.0.1 and hadoop 3.2.1 #141

lbergelson commented Jun 19, 2020 •

edited by heuermh

Loading

heuermh commented Jun 21, 2020

lbergelson commented Aug 10, 2020

heuermh commented Aug 10, 2020

droazen left a comment

droazen Sep 1, 2020

heuermh commented Oct 7, 2020

lbergelson commented Oct 7, 2020

lbergelson commented Oct 7, 2020

heuermh commented Oct 8, 2020

Updating to spark 3.0.1 and hadoop 3.2.1 #141

Updating to spark 3.0.1 and hadoop 3.2.1 #141

Conversation

lbergelson commented Jun 19, 2020 • edited by heuermh Loading

heuermh commented Jun 21, 2020

lbergelson commented Aug 10, 2020

heuermh commented Aug 10, 2020

droazen left a comment

Choose a reason for hiding this comment

droazen Sep 1, 2020

Choose a reason for hiding this comment

heuermh commented Oct 7, 2020

lbergelson commented Oct 7, 2020

lbergelson commented Oct 7, 2020

heuermh commented Oct 8, 2020

lbergelson commented Jun 19, 2020 •

edited by heuermh

Loading