Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Parallelize download reference #1065

Merged
merged 6 commits into from
Jan 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 42 additions & 27 deletions BALSAMIC/workflows/reference.smk
Original file line number Diff line number Diff line change
Expand Up @@ -194,36 +194,51 @@ download_content = [reference_genome_url, dbsnp_url, hc_vcf_1kg_url,
delly_mappability_findex_url, ascat_gccorrection_url, ascat_chryloci_url, clinvar_url,
somalier_sites_url]

download_dict = dict([(ref.get_output_file, ref) for ref in download_content])

def download_reference_file(output_file):
import requests

ref = download_dict[output_file]
log_file = output_file + ".log"

if ref.url.scheme == "gs":
cmd = "export TMPDIR=/tmp; gsutil cp -L {} {} -".format(log_file,ref.url)
else:
cmd = "wget -a {} -O - {}".format(log_file,ref.url)

if ref.secret:
try:
response = requests.get(ref.url,headers={'Authorization': 'Basic %s' % ref.secret})
download_url = response.json()["url"]
except:
LOG.error("Unable to download {}".format(ref.url))
raise
cmd = "curl -o - '{}'".format(download_url)

if ref.gzip:
cmd += " | gunzip "

cmd += " > {}".format(output_file)
shell(cmd)
ref.write_md5

ref_subdirs = set([ref.output_path for ref in download_content])
ref_files = set([ref.output_file for ref in download_content])

wildcard_constraints:
ref_subdir="|".join(ref_subdirs),
ref_file = "|".join(ref_files),
fevac marked this conversation as resolved.
Show resolved Hide resolved


rule download_reference:
output:
expand("{output}", output=[ref.get_output_file for ref in download_content])
Path("{ref_subdir}","{ref_file}").as_posix(),
run:
import requests

for ref in download_content:
output_file = ref.get_output_file
log_file = output_file + ".log"

if ref.url.scheme == "gs":
cmd = "export TMPDIR=/tmp; gsutil cp -L {} {} -".format(log_file, ref.url)
else:
cmd = "wget -a {} -O - {}".format(log_file, ref.url)

if ref.secret:
try:
response = requests.get(ref.url, headers={'Authorization': 'Basic %s' % ref.secret })
download_url = response.json()["url"]
except:
LOG.error("Unable to download {}".format(ref.url))
raise
cmd = "curl -o - '{}'".format(download_url)

if ref.gzip:
cmd += " | gunzip "

cmd += " > {}".format(output_file)
shell(cmd)
ref.write_md5
download_reference_file(output[0])




##########################################################
# Preprocess refseq file by fetching relevant columns and
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ Added:
^^^^^^
* Added somalier integration and relatedness check: https://github.com/Clinical-Genomics/BALSAMIC/pull/1017

Changed:
^^^^^^^^
* Parallelize download of reference files https://github.com/Clinical-Genomics/BALSAMIC/pull/1065


Fixed:
^^^^^^
* test_write_json failing locally https://github.com/Clinical-Genomics/BALSAMIC/pull/1063
Expand Down