Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify spark_eval scripts and improve documentation. #3580

Merged
merged 1 commit into from
Sep 27, 2017

Conversation

tomwhite
Copy link
Contributor

Should be straightforward, as the changes are very self-contained.

@codecov-io
Copy link

codecov-io commented Sep 14, 2017

Codecov Report

Merging #3580 into master will increase coverage by 0.005%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##              master    #3580       +/-   ##
==============================================
+ Coverage     79.715%   79.72%   +0.005%     
- Complexity     18188    18190        +2     
==============================================
  Files           1223     1223               
  Lines          66735    66735               
  Branches       10426    10426               
==============================================
+ Hits           53198    53201        +3     
+ Misses          9320     9319        -1     
+ Partials        4217     4215        -2
Impacted Files Coverage Δ Complexity Δ
...e/hellbender/engine/spark/SparkContextFactory.java 71.233% <0%> (-2.74%) 11% <0%> (ø)
...er/tools/spark/sv/discovery/AlignmentInterval.java 89.352% <0%> (+0.926%) 53% <0%> (+2%) ⬆️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 78.571% <0%> (+1.948%) 39% <0%> (ø) ⬇️

fi

# Create cluster
gcloud dataproc clusters create "$GCS_CLUSTER" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite Lets make these auto-deleting clusters now that that's an option. You can do it by switching to

gcloud beta dataproc clusters create ... and adding the new --max-idle and --max-age parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Copy small data to HDFS on GCS

${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, I didn't know about this bashism. Alternatively you can add gatk-launch to your path.

hadoop fs -mkdir -p $TARGET_DIR

# Download exome BAM (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/)
gsutil cp gs://broad-spark-eval-test-data/data/NA12878.ga2.exome.maq.raw.bam - | hadoop fs -put - $TARGET_DIR/NA12878.ga2.exome.maq.raw.bam
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the -m command for parallel download? That seems to speed things up for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nm, I didn't read far enough to see that it it was being streamed into hadoop....

```

For the whole genome data, run:

```bash
./prep_data_genome_gcs.sh
./genome_copy_hdfs.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these names don't seem to match the names of the copy scripts. I see copy_genome_to_hdfs.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,11 @@
#!/usr/bin/env bash

# Copy exome data to HDFS on GCS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you want to use this and when do you want to use copy_exome_to_hdfs.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some documentation to explain the difference.


# Copy exome data to HDFS on GCS

${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be explicitly setting the number of executors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to - at least it works fine without.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite A few comments and questions. This looks great. Setting up data and getting all the misc arguments wrangled the right way is always a pain, so having these standardized will make testing so much easier.

@tomwhite tomwhite merged commit 847048d into master Sep 27, 2017
@tomwhite tomwhite deleted the tw_spark_end_to_end branch September 27, 2017 09:02
@lbergelson
Copy link
Member

lbergelson commented Sep 27, 2017

@tomwhite Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants