-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with sequence of length 1 and quality '+' #408
Comments
@botond-sipos might help. I tried but failed to understand the code logic. |
This is an unfortunate edge case. The parser does not rely on the 4-line structure of the fastq files, hence it needs a way to classify the input lines (see here). |
Thanks for your explanation. This in itself certainly is not a big issue but I suspect it could have something to do with the output of seqkit sana and seqkit seq -m being corrupted in some cases. There were instances when the above operations led to outputs that seqkit stats complained as invalid fastx; for the same input, the output was okay when I used seqtk but somehow seqkit seq -m corrupted it. I noticed that in all these cases the input fastq had sequences of length 1. Unfortunately, I don't have time at the moment to investigate more about this. I also cannot rule out the possibility that it's due to something else. Anyway, I hope this information could be useful in some way if something similar is ever observed by others in the future. Edit: I realized my fasta file is getting corrupted after copying from one storage system to another, this increases the likelihood that the abovementioned issue is not due to seqkit. You probably don't have to worry about it. |
Fixed in v2.8.1. |
Prerequisites
seqkit version
Describe your issue
Problem:
seqkit sana flags a sequence of length 1 and having the quality string of '+' as problematic for some reasons. This does not happen when the quality value is some other valid values which I have tried or when the sequence is longer than 1bp.
Example:
echo -e '@seq\nA\n+\n+\n' | seqkit sana
[INFO] File: - Discarded line: Invalid line states! 1: @seq
[INFO] File: - Discarded line: Invalid line states! 2: A
[INFO] File: - Discarded line: Invalid line states! 3: +
[INFO] File: - Discarded line: Invalid line states! 4: +
[INFO] File: - Pass records: 0 Discarded lines: 4
echo -e '@seq\nA\n+\n?\n' | seqkit sana
[INFO] File: - Pass records: 1 Discarded lines: 0
@seq
A
+
?
echo -e '@seq\nAA\n+\n++\n' | seqkit sana
[INFO] File: - Pass records: 1 Discarded lines: 0
@seq
AA
+
++
The text was updated successfully, but these errors were encountered: