Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-916] New strategy for writing header. #917

Merged
merged 1 commit into from
Jan 14, 2016

Conversation

fnothaft
Copy link
Member

Resolves #916. Not fully done yet; seems to work for BAM but not SAM. The SAM file header doesn't write correctly... Still debugging that. Needs more testing.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1047/

Build result: FAILURE

GitHub pull request #917 of commit 2c38bbd automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 8ff91b477e29e28d277a812a87673b43fbbff9a1 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 8ff91b477e29e28d277a812a87673b43fbbff9a1 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8ff91b477e29e28d277a812a87673b43fbbff9a1First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

This is good to go now. Ping for review and merge.

@fnothaft fnothaft added this to the 0.19.0 milestone Jan 12, 2016
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1048/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.11,1.2.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.2.1,centosTriggering ADAM-prb ? 1.0.4,2.11,1.2.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

OK, it seems like on Spark 1.2 we got a merging order that is inconsistent with the ordering we're expecting. I was worried about that... Let me look into this more and see if there is a way we can ensure a consistent merging order.

@fnothaft
Copy link
Member Author

Rebased and somewhat fixed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1050/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Jan 13, 2016

Single unit test failure

Couldn't find any files matching /tmp/bqsr1442238865989322732/bqsr1.bam

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1051/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, test this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1052/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, test this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1053/

Build result: FAILURE

GitHub pull request #917 of commit b8ec48e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0a51cd7242eb066b795ff1d414b695e3eff0c6b6 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0a51cd7242eb066b795ff1d414b695e3eff0c6b6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Pushed a fix; tests should pass on this run. From my side, this PR is good to go (assuming tests pass on @AmplabJenkins).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1054/

Build result: FAILURE

[...truncated 32 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Good news: the concat method doesn't exist in org.apache.hadoop.fs.FileSystem in the Hadoop 1.x stream. I've added code that calls concat via reflection. I'm testing this on the cluster now.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1055/

Build result: FAILURE

[...truncated 32 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.3.0,2.11,1.3.1,centosADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result FAILUREADAM-prb ? 1.0.4,2.10,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.2.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.2.1,centos completed with result SUCCESSADAM-prb ? 1.0.4,2.11,1.2.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the write-separate-header branch 3 times, most recently from eb165a7 to e9cedc1 Compare January 14, 2016 01:21
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1057/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1058/

Build result: FAILURE

GitHub pull request #917 of commit e9cedc1 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 5dade159e8469006b153c4b151caacb23d6f6952 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 5dade159e8469006b153c4b151caacb23d6f6952 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5dade159e8469006b153c4b151caacb23d6f6952First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

This all works, except for the concat operation, which fails on the cluster because the header doesn't take up an exact, full HDFS block. Anyways, we fall back on the single threaded writing code, which is not so hot from a performance perspective, but works correctly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1059/

Build result: FAILURE

GitHub pull request #917 of commit 386f759 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains 0c8180369152d9447e08c7c89edf1068950c3a00 # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision 0c8180369152d9447e08c7c89edf1068950c3a00 (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 0c8180369152d9447e08c7c89edf1068950c3a00 > git rev-list 5dade159e8469006b153c4b151caacb23d6f6952 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@fnothaft
Copy link
Member Author

Also, even when we fall back on the single threaded copy code, it takes ~1sec per file to merge the files. E.g., if we have ~2000 partitions, then it takes about 2000 seconds to merge the whole file together. This is unfortunate, but is much better than the ~30hrs latency @beaunorgeot had been seeing when doing something like samtools cat on a single machine.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1060/
Test PASSed.

@ryan-williams
Copy link
Member

Thanks for tackling this @fnothaft; I'm missing state on this, can you answer a few high-level questions:

  • do you know how the missing header issue is currently happening? Last I looked at this (many months ago), we had what we thought was a high-confidence fix, and yet I still saw it happen once (but not reproducibly).
  • in this fix, you're writing out the header to a path that is specified in the hadoop job config, and having each partition read the header in from that path?
  • it seems like all the existing {add,clear}Header code is still present, but has been moved around; how does that interact with the new writing/reading-header-file stuff?

@fnothaft
Copy link
Member Author

OK, this succeeded on the cluster. Save time is slower than I'd like (~40 min to cat NA12878 together) but it worked. I'm running another test now that sorts then saves, then I'll pull the file out of HDFS and run samtools index locally. If that succeeds, then we are good to go. I'll update when that's done; I'm guessing it'll probably finish sometime late tonight.

@fnothaft
Copy link
Member Author

samtools index worked on the cluster, so we are both writing a concatenated file and keeping the sort order. Yippee!

@ryan-williams

do you know how the missing header issue is currently happening? Last I looked at this (many months ago), we had what we thought was a high-confidence fix, and yet I still saw it happen once (but not reproducibly).

God, I still have no idea what's causing it. I agree—I thought our fix was high confidence, but the log files are pretty unambiguous that we're getting a partition that tries to write to disk but the header is not set.

in this fix, you're writing out the header to a path that is specified in the hadoop job config, and having each partition read the header in from that path?

Exactly. It turns out that this is how Hadoop-BAM's CLI uses the Hadoop-BAM output formats. This isn't obvious though... I got the idea for this just by reading the Hadoop-BAM CLI source code.

it seems like all the existing {add,clear}Header code is still present, but has been moved around; how does that interact with the new writing/reading-header-file stuff?

I only touched the headerless output formats, so they're still vulnerable to the transient header disappearance issue. I'll need to propagate the fix to all output formats later. We'll probably have to do the same thing for VCF as well.

@fnothaft
Copy link
Member Author

Anyways, this is good to merge now.

@heuermh
Copy link
Member

heuermh commented Jan 14, 2016

@ryan-williams can you review?

I'll take a look and run a few extra tests this morning. We would like to get this merged as soon as possible.


// concat files together
// we need to do this through reflection because the concat method is
// NOT in hadoop 1.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this is merged, we should create an issue to revisit it once we drop Hadoop 1.x support

@ryan-williams
Copy link
Member

Unfortunately I don't have much time to get in to the weeds on this one, but my main thought is:

if we don't know why the previous fix didn't fix the issue, even though we all still believe it should have from everything we know about the code, then I don't like complicating matters in this way.

Ultimately, I'll defer to y'all since it sounds like the goal here is more to unblock specific user(s) than to have a fix that we understand, but obviously that doesn't seem like a great state of affairs longer-term, especially if this introduces a slow-down to boot.

Can we deterministically repro the crash? I agree that something is still wrong, @danvk (on #721) and I (undocumented?) have both seen the bug since the last "fix", IIRC.

@fnothaft
Copy link
Member Author

The perf is only bad when saving BAM as a single file, and "bad" is relative to the FileSystem 'concat' method. This is a correctness fix (block ordering + header attach) relative to the old 'FileUtil.copyMerge' approach. I agree that we need to harmonize the single file save/many file save approaches, but that's for later.

Resolves bigdatagenomics#916. Makes several modifications that should eliminate the header
attach issue when writing back to SAM/BAM:

* Writes the SAM/BAM header as a single file.
* Instead of trying to attach the SAM/BAM header to the output format via a
  singleton object, we pass the path to the SAM/BAM header file via the Hadoop
  configuration.
* The output format reads the header from HDFS when creating the record writer.
* At the end, once we've written the full RDD and the header file, we merge all
  via Hadoop's FsUtil.
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1061/

Build result: FAILURE

GitHub pull request #917 of commit 7d4b409 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/917/merge^{commit} # timeout=10 > git branch -a --contains b12af513d5cb95c54fe156912eef38bc2a37e23c # timeout=10 > git rev-parse remotes/origin/pr/917/merge^{commit} # timeout=10Checking out Revision b12af513d5cb95c54fe156912eef38bc2a37e23c (origin/pr/917/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f b12af513d5cb95c54fe156912eef38bc2a37e23c > git rev-list 0c8180369152d9447e08c7c89edf1068950c3a00 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

This is good to go from my side. @heuermh were you able to complete your testing?

Just to follow up more on @ryan-williams' comments from earlier (sorry, I was on the run so I kept my comments brief), this fix only impacts the single file save code, and does not lead to a performance decrease over the current single file save code. This fix does three things:

  1. Writes the file header to a file in HDFS (or whichever underlying file system you are using; I'll use HDFS as a placeholder here for convenience).
  2. Adds code to the headerless output formats that ensures that they read the header from HDFS. This code will never fail unless writing the header to HDFS failed in the first place, in which case you've got bigger problems and I can't help you ;)
  3. Modifies the file merging code. This introduces a fast path to merging using the concat method, and a second path to merging that manually merges the files together. The second path (manual merge) is necessary anyways, because the prior method using FileUtil.copyMerge did not guarantee the ordering of files that were being merged, which meant that we'd occasionally get files where the header was at the end, etc.

The manual merge code path is much slower than the concat method, but the concat method has harsh, undocumented preconditions (specifically, all files being merged—except for the last one—must take up a full block in HDFS) and seems to throw a NotImplementedError on file systems that aren't HDFS, which is dandy. The manual merge code path is not slower than the FileUtil.copyMerge approach we were using previously.

As for the million dollar question: why does the header detach issue keep on happening, despite our best efforts to fix it? Frankly, I'm not entirely sure why it's failing, but the root cause is something along the lines of "when we set the header in the singleton object that accompanies the output format, it doesn't get consistently set across all of the JVMs where an instance of the actual output format will be created and executed." The new approach uses the Hadoop job configuration to pass the path to the header file. This conf is passed to the OutputFormat by Spark itself, so we can guarantee that the configuration seen by the OutputFormat will be consistent across the cluster. Since the header is written to the underlying file system (which again, if your underlying file system is not accessible across the whole cluster, then you're running some horrible frankencluster and I can't help you), we can guarantee that reading the header file from the path passed in the config will succeed.

I think a TL;DR for why this failed is that the obvious way of attaching the header (I call it obvious because the method is named setHeader) to Hadoop-BAM's OutputFormats is not guaranteed to work consistently in a distributed setting. I think that the Hadoop-BAM team knows this, because the Hadoop-BAM CLI wraps their own output formats to do this exact "write header to HDFS, set path in config, read header from config path when creating record reader" pattern.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1062/
Test PASSed.

heuermh added a commit that referenced this pull request Jan 14, 2016
[ADAM-916] New strategy for writing header.
@heuermh heuermh merged commit 49b614d into bigdatagenomics:master Jan 14, 2016
@heuermh
Copy link
Member

heuermh commented Jan 14, 2016

Tested ok for me, merged from the ski lift.

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants