-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify spark_eval scripts and improve documentation. #3580
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3580 +/- ##
==============================================
+ Coverage 79.715% 79.72% +0.005%
- Complexity 18188 18190 +2
==============================================
Files 1223 1223
Lines 66735 66735
Branches 10426 10426
==============================================
+ Hits 53198 53201 +3
+ Misses 9320 9319 -1
+ Partials 4217 4215 -2
|
fi | ||
|
||
# Create cluster | ||
gcloud dataproc clusters create "$GCS_CLUSTER" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite Lets make these auto-deleting clusters now that that's an option. You can do it by switching to
gcloud beta dataproc clusters create ...
and adding the new --max-idle
and --max-age
parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
# Copy small data to HDFS on GCS | ||
|
||
${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, I didn't know about this bashism. Alternatively you can add gatk-launch to your path.
hadoop fs -mkdir -p $TARGET_DIR | ||
|
||
# Download exome BAM (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/) | ||
gsutil cp gs://broad-spark-eval-test-data/data/NA12878.ga2.exome.maq.raw.bam - | hadoop fs -put - $TARGET_DIR/NA12878.ga2.exome.maq.raw.bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the -m command for parallel download? That seems to speed things up for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nm, I didn't read far enough to see that it it was being streamed into hadoop....
scripts/spark_eval/README.md
Outdated
``` | ||
|
||
For the whole genome data, run: | ||
|
||
```bash | ||
./prep_data_genome_gcs.sh | ||
./genome_copy_hdfs.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these names don't seem to match the names of the copy scripts. I see copy_genome_to_hdfs.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -0,0 +1,11 @@ | |||
#!/usr/bin/env bash | |||
|
|||
# Copy exome data to HDFS on GCS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you want to use this and when do you want to use copy_exome_to_hdfs.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some documentation to explain the difference.
|
||
# Copy exome data to HDFS on GCS | ||
|
||
${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be explicitly setting the number of executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it needs to - at least it works fine without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite A few comments and questions. This looks great. Setting up data and getting all the misc arguments wrangled the right way is always a pain, so having these standardized will make testing so much easier.
a40778e
to
ac0a0d6
Compare
@tomwhite Thank you! |
Should be straightforward, as the changes are very self-contained.