Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed concatenation of many VCF files #309

Merged
merged 3 commits into from
Dec 9, 2020

Conversation

szymonwieloch
Copy link

@szymonwieloch szymonwieloch commented Nov 26, 2020

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Description

Fix for a hazard that took place during VCF file concatenation. This issue is quite difficult to understand, so please let me elaborate on it. These are the two lines causing a problem:

set -euo pipefail
FIRSTVCF=$(ls *.vcf | head -n 1)

The hazard comes from the fact that sometimes the head app closes before ls and sometimes it's the opposite. When you have a small numbef of *.vcf files, then usually ls quickly prints everything to stdout and closes. Then head runs and reads the content from stdin (piped from ls) and then it closes too. No error is reported. However things work in a different way when you have many *.vcf files. in this case it takes some time for ls to print everything to stdout. head runs earlier, reads the input, prints the first line and closes. head is optimized to exit as soon as its work is complete so that when you head 100GB file it only reads the first line and not the whole file. It does not wait for closing stdint to exit like most other aplications. As soon as head get one line from stdin and prints it to the stdout, it closes itself and its stdin. However the ls is still trying to write to its stdout that is piped to the now-closed stdin of head. This does not work and ls exists with the 141 code (broken pipe).

This hazard clearly depends on the number of *.vcf files and probably also on implementation details of your OS, ls and head. In my company we had a huge sample that resulted in multiple VCF files and obviously concatenation was failing in a consistent way. When you run

ls *.vcf | head -n1

a broken pipe is expected and is not a problem. The script should not exit in this situation like it does at the moment.

@maxulysse maxulysse merged commit e83e674 into nf-core:dev Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants