Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed on fastqs having identical filename but different path #431

Open
yunhailuo opened this issue Nov 3, 2023 · 3 comments
Open

Failed on fastqs having identical filename but different path #431

yunhailuo opened this issue Nov 3, 2023 · 3 comments

Comments

@yunhailuo
Copy link

Hi @leepc12 , first, thank you for the great pipeline here 🙏

Describe the bug

The failure is partially due to my input fastqs having identical file name and differs as different file on parental directories (path). For example,

        "/redacted/path/R9921/results/sample_7_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R1.fastq.gz"

These two technical replicates, when getting processed for cutadapt and merge, are overwritten and merged on top of each other. This leads to a bad fastqs which triggers error later on.

OS/Platform

  • OS/Platform: Slurm HPC with singularity (converted from the docker image)
  • Conda version: N/A
  • Pipeline version: v2.2.2
  • Caper version: N/A

Input JSON file

Paste contents of your input JSON file.

{
    "atac.enable_idr": "True",
    "atac.fastqs_rep1_R1": [
        "/redacted/path/R9921/results/sample_7_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R1.fastq.gz"
    ],
    "atac.fastqs_rep1_R2": [
        "/redacted/path/R9921/results/sample_7_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_7_R2.fastq.gz"
    ],
    "atac.fastqs_rep2_R1": [
        "/redacted/path/R9921/results/sample_8_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_8_R1.fastq.gz"
    ],
    "atac.fastqs_rep2_R2": [
        "/redacted/path/R9921/results/sample_8_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_8_R2.fastq.gz"
    ],
    "atac.fastqs_rep3_R1": [
        "/redacted/path/R9921/results/sample_9_R1.fastq.gz",
        "/redacted/path/R9935/results/sample_9_R1.fastq.gz"
    ],
    "atac.fastqs_rep3_R2": [
        "/redacted/path/R9921/results/sample_9_R2.fastq.gz",
        "/redacted/path/R9935/results/sample_9_R2.fastq.gz"
    ],
    "atac.idr_thresh": "0.05",
    "atac.paired_end": true,
    "atac.genome_tsv": "/somewhere/hg38.tsv",
    "atac.pval_thresh": "0.01",
    "atac.sample_name": "redacted_sample_name",
    "atac.smooth_win": "150"
}

Troubleshooting result

The failure is on filter task with the following direct complaint:

Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 438, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 344, in main
    args.nth, args.mem_gb, args.out_dir)
  File "/software/atac-seq-pipeline/src/encode_task_filter.py", line 178, in rm_unmapped_lowq_reads_pe
    'No reads found aftering filtering "samtools fixmate"d PE BAM with '
ValueError: No reads found aftering filtering "samtools fixmate"d PE BAM with "samtools view -F 1804 -f 2". Reads are not properly paired even though mapping rate is good?
...

Honestly, it's still not clear to my how this end up with bad reads pairing. But I noticed that the reads are messed up as 1) having duplicated reads and 2) having parts where PE are mis-aligned. I managed to trace back to the align task at encode_task_trim_adapter.py. My reading of that script is it tries to first cutadapt for each fastqs then merge trimmed fastqs from the same biological replicates, and it triggers shell command to do so.

One important thing there is what was supplied as outputs for shell command is just filename with not particular path, i.e. all under current working directory. Because my input (tech-rep) fastqs from the same bio-rep having identical filenames and differ as different file only on parental directory (path), their cutadapt results all have the same file name thus, I think, gets overwrites on each other. And during merged where fastqs are supplied as a list, a single file (b/c of same filename) gets passed three times and three identical copies are merged together. I think this is where things get messed up.

If this makes sense and is correct in root cause analysis, I'm wondering if it's possible to improve the handling of intermediate results by giving cutadapt results temporary unique names? To be clear, Cromwell handles this problem OK as it localize each file to their unique directory (thus fine even their original name is kept).

@leepc12
Copy link
Contributor

leepc12 commented Nov 6, 2023

Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs.
cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2.
Please rename R2 fastq and try again.

@yunhailuo
Copy link
Author

Yeah, this looks like an issue of conflicting file names between and R1 and R2 fastqs. cutadapt in encode_task_trim_adapter.py will try to write on the same output filename for R1 and R2. Please rename R2 fastq and try again.

Hi @leepc12 , thank you for the quick check and sorry for my delay 😄 Renaming works fine. I think I should clarify the reason I raise this issue is to see if there are any interests in updating the pipeline here to handle such cases so that renaming inputs is not needed? It's like a fixed pattern from our data source. So such renaming will be a little extra task prior to WDL for us. I'm wondering handling this inside the ATAC pipeline is proper.

@kloot
Copy link

kloot commented Dec 31, 2023

Have also observed this as an issue. FYI when running on slurm this resulted im an oom error (for encode_task_filter.py), so the problem wasn't obvious until I ran the same fastq files interactively and didn't get the oom error.
Workaround by renaming worked fine.

I second @yunhailuo's suggestion to handle this in the pipeline

Thanks for the great tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants