Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rmdup: dup-num-file not created if no duplicated reads #436

Closed
12 tasks done
fgvieira opened this issue Jan 23, 2024 · 5 comments
Closed
12 tasks done

rmdup: dup-num-file not created if no duplicated reads #436

fgvieira opened this issue Jan 23, 2024 · 5 comments

Comments

@fgvieira
Copy link

fgvieira commented Jan 23, 2024

Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

Prerequisites

  • Make sure you've installed the correct executable binary file.
    For Mac users, Please download
    • seqkit_darwin_amd64.tar.gz for Mac with Intel CPUs.
    • seqkit_darwin_arm64.tar.gz for Mac with M series CPUs.
  • Make sure you are using the latest version by seqkit version -u.
  • Read the usage and examples for the specific subcommand.

Describe your issue in detail

  • Please copy and paste the command you ran and the error information if reported.
  • It would be more helpful to provide as much information as you can:
    • Are you running on a personal computer or a server?
    • What's the operating system, and how much RAM (memory) is available?
    • Show the types and sizes of input files with file xxx and ls -lh xxx.
    • Show some lines of input files with head -n 5 xxx or zcat xxx.gz | head -n 5.
  • Provide a reproducible example.
    • Has this problem happened many times?
    • Or it only failed with this input file or/and these command/parameters.

I am running seqkit on a RedHat server:

seqkit rmdup --threads 10  --dup-num-file dup.tsv --ignore-case --by-seq  --out-file collapsed.fastq.gz collapsed.rmdup.fastq.gz

But seqkit rmdup does not create the dup-num-file (dup.tsv) file if there are no duplicated reads in the input file.

Input file is a FASTQ:

$ zcat collapsed.fastq.gz | head -n 8
@T0_RID60_S1_CM000682.2_ngsngs:13496936-13497014_length:79_mod0000 F2 R1 merged_79_0
TAAGGAAGCAGTGGAAAAAGAATAAATGCTGTAGATGAGGACAAGAAATTAGTTGAACTTTAATAAACTTCAAATGACT
+
CCCGGGGGG=GGGJJJGJGJJGJJJJJJGJCJJC=GJJJJJJGG1JGGGJJCGJJJG=JGGCGJCCJJGJJJGJGCCJG
@T2_RID60_S1_CM000666.2_ngsngs:130549431-130549518_length:88_mod0000 F3 R1 merged_88_0
TTTGCTCATATTTTGTGAAGTATTTTTATATCTGTATTCATGAATGATATTGCCATGCAATTGTCTTTTATTTTAATAATCTTGTCTT
+
CC8G=GGGGGGGGJJGJJJJJJJJJGCGGJJGJCJJJJJGJG8J1J=GJCJGJJJJJ(GGJGJGGJGGGGJJJJGJGGGCJJCJGGCJ
@shenwei356
Copy link
Owner

seqkit rmdup does not create the dup-num-file (dup.tsv) file if there are no duplicated reads in the input file.

Oh, yes, it's designed to act like this. https://github.com/shenwei356/seqkit/blob/master/seqkit/cmd/rmdup.go#L181

@fgvieira
Copy link
Author

I can see the logic of not creating the file if there are no duplicated reads but, when using seqkit on a snakemake workflow, sometimes it crashes because the file is not created.
An alternative would be to create an empty file if there are no duplicates.

@shenwei356
Copy link
Owner

Well, I can change the behaviour. But it should be easy to detect if a file exists in snakemake with something like os.path.exists. If the file does not exist, just skip the downstream steps.

@shenwei356
Copy link
Owner

@fgvieira
Copy link
Author

Nice! It seems to work!
thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants