Simplify spark_eval scripts and improve documentation. #3580

tomwhite · 2017-09-14T11:09:19Z

Should be straightforward, as the changes are very self-contained.

codecov-io · 2017-09-14T11:56:48Z

Codecov Report

Merging #3580 into master will increase coverage by 0.005%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##              master    #3580       +/-   ##
==============================================
+ Coverage     79.715%   79.72%   +0.005%     
- Complexity     18188    18190        +2     
==============================================
  Files           1223     1223               
  Lines          66735    66735               
  Branches       10426    10426               
==============================================
+ Hits           53198    53201        +3     
+ Misses          9320     9319        -1     
+ Partials        4217     4215        -2

Impacted Files	Coverage Δ	Complexity Δ
...e/hellbender/engine/spark/SparkContextFactory.java	`71.233% <0%> (-2.74%)`	`11% <0%> (ø)`
...er/tools/spark/sv/discovery/AlignmentInterval.java	`89.352% <0%> (+0.926%)`	`53% <0%> (+2%)`	⬆️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java	`78.571% <0%> (+1.948%)`	`39% <0%> (ø)`	⬇️

lbergelson · 2017-09-14T14:34:22Z

scripts/spark_eval/run_gcs_cluster.sh

+fi
+
+# Create cluster
+gcloud dataproc clusters create "$GCS_CLUSTER" \


@tomwhite Lets make these auto-deleting clusters now that that's an option. You can do it by switching to

gcloud beta dataproc clusters create ... and adding the new --max-idle and --max-age parameters.

lbergelson · 2017-09-14T14:35:58Z

scripts/spark_eval/copy_small_to_hdfs_on_gcs.sh

+
+# Copy small data to HDFS on GCS
+
+${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \


Neat, I didn't know about this bashism. Alternatively you can add gatk-launch to your path.

lbergelson · 2017-09-14T14:36:52Z

scripts/spark_eval/copy_exome_to_hdfs.sh

+hadoop fs -mkdir -p $TARGET_DIR
+
+# Download exome BAM (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/)
+gsutil cp gs://broad-spark-eval-test-data/data/NA12878.ga2.exome.maq.raw.bam - | hadoop fs -put - $TARGET_DIR/NA12878.ga2.exome.maq.raw.bam


Should we use the -m command for parallel download? That seems to speed things up for me.

Oh, nm, I didn't read far enough to see that it it was being streamed into hadoop....

lbergelson · 2017-09-14T14:41:56Z

scripts/spark_eval/README.md

 ```

 For the whole genome data, run:

 ```bash
-./prep_data_genome_gcs.sh
+./genome_copy_hdfs.sh


these names don't seem to match the names of the copy scripts. I see copy_genome_to_hdfs.sh

lbergelson · 2017-09-14T14:43:28Z

scripts/spark_eval/copy_exome_to_hdfs_on_gcs.sh

@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+# Copy exome data to HDFS on GCS


When do you want to use this and when do you want to use copy_exome_to_hdfs.sh?

Added some documentation to explain the difference.

lbergelson · 2017-09-14T14:45:04Z

scripts/spark_eval/copy_exome_to_hdfs_on_gcs.sh

+
+# Copy exome data to HDFS on GCS
+
+${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \


Should this be explicitly setting the number of executors?

I don't think it needs to - at least it works fine without.

lbergelson

@tomwhite A few comments and questions. This looks great. Setting up data and getting all the misc arguments wrangled the right way is always a pain, so having these standardized will make testing so much easier.

Addressed feedback

lbergelson · 2017-09-27T15:12:27Z

@tomwhite Thank you!

tomwhite requested a review from lbergelson September 14, 2017 11:09

lbergelson reviewed Sep 14, 2017

View reviewed changes

lbergelson previously requested changes Sep 14, 2017

View reviewed changes

lbergelson assigned tomwhite Sep 14, 2017

Simplify spark_eval scripts and improve documentation.

ac0a0d6

tomwhite force-pushed the tw_spark_end_to_end branch from a40778e to ac0a0d6 Compare September 26, 2017 16:14

tomwhite merged commit 847048d into master Sep 27, 2017

tomwhite deleted the tw_spark_end_to_end branch September 27, 2017 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify spark_eval scripts and improve documentation. #3580

Simplify spark_eval scripts and improve documentation. #3580

tomwhite commented Sep 14, 2017

codecov-io commented Sep 14, 2017 •

edited

Loading

lbergelson Sep 14, 2017

tomwhite Sep 26, 2017

lbergelson Sep 14, 2017

lbergelson Sep 14, 2017

lbergelson Sep 14, 2017

lbergelson Sep 14, 2017

tomwhite Sep 26, 2017

lbergelson Sep 14, 2017

tomwhite Sep 26, 2017

lbergelson Sep 14, 2017

tomwhite Sep 26, 2017

lbergelson left a comment

lbergelson commented Sep 27, 2017 •

edited

Loading


		# Copy small data to HDFS on GCS

		${GATK_HOME:-../..}/gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \

		@@ -0,0 +1,11 @@
		#!/usr/bin/env bash

		# Copy exome data to HDFS on GCS

Simplify spark_eval scripts and improve documentation. #3580

Simplify spark_eval scripts and improve documentation. #3580

Conversation

tomwhite commented Sep 14, 2017

codecov-io commented Sep 14, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

lbergelson commented Sep 27, 2017 • edited Loading

codecov-io commented Sep 14, 2017 •

edited

Loading

lbergelson commented Sep 27, 2017 •

edited

Loading