-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seqkit sana fails on valid FASTQ #429
Comments
This was also reported here: #408. I just checked the code (written by @botond-sipos ) and found the method did not consider cases where the I might rewrite it If I have time (not now, sorry). |
Thanks! If you point me to the relevant code, I can have a quick look to see how it might be fixed. I see where the error is thrown, but where is the matching/checking code? |
Note also that #golang FASTQ parsers exist e.g. https://pkg.go.dev/github.com/TimothyStiles/poly@v0.29.2/io/fastq |
The code is here: https://github.com/shenwei356/seqkit/blob/v2.6.1/seqkit/cmd/sana.go#L274-L279 Other subcommands in seqkit use my parser here: https://github.com/shenwei356/bio/blob/master/seqio/fastx/reader.go . It looks complicated because I want to seamlessly parse both FASTA and FASTQ with single- or multiple-line sequence/quality. |
@mw55309 My apologies for the delayed response! Unfortunately, sana currently doesn't support FASTQ files with identifiers in the separator lines. I'll update the help message in a PR to clarify this. As @shenwei356 mentioned, it's difficult to reliably distinguish separator lines with identifiers from quality lines within sana's parsing system. While other Golang FASTQ parsers exist, they typically stop at the first error. Sana's parser is designed to be more forgiving, skipping malformed lines and continuing to process the rest of the file. I hope this explains sana's current limitations! |
I think the logic that makes sense to me is:
So can fix the problem by keeping a record of what the previous line category is? |
Yes. We have to think about it. It's feasible. |
@mw55309 I started working on parts of the suggested logic. It is a quick hack, but I implemented the lookback to decide the state of the separator lines. You can find it here, I would appreciate it if you tested it and let me know if it meets the expectations. If all goes well I will submit a PR after implementing some tests. What we definitely cannot support in the current parsing framework is multiline sequences and qualities. The parser architecture is complicated by the fact that is an "online" parser capable of streaming records from files, which are being written. This feature is used in the scat subcommand. |
Fixed in v2.8.1 |
e.g. this is valid FASTQ from the SRA:
output:
However, remove cany characters after the + sign, and it validates fine:
output:
If you want the actual files, they are here:
https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmwgsqc/v2/SRS1041116.tar.bz2
The text was updated successfully, but these errors were encountered: