Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with sequence of length 1 and quality '+' #408

Closed
4 tasks done
dehui333 opened this issue Sep 18, 2023 · 5 comments
Closed
4 tasks done

Issue with sequence of length 1 and quality '+' #408

dehui333 opened this issue Sep 18, 2023 · 5 comments

Comments

@dehui333
Copy link

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
  • provide a reproducible example

Problem:

seqkit sana flags a sequence of length 1 and having the quality string of '+' as problematic for some reasons. This does not happen when the quality value is some other valid values which I have tried or when the sequence is longer than 1bp.

Example:

echo -e '@seq\nA\n+\n+\n' | seqkit sana
[INFO] File: - Discarded line: Invalid line states! 1: @seq
[INFO] File: - Discarded line: Invalid line states! 2: A
[INFO] File: - Discarded line: Invalid line states! 3: +
[INFO] File: - Discarded line: Invalid line states! 4: +
[INFO] File: - Pass records: 0 Discarded lines: 4

echo -e '@seq\nA\n+\n?\n' | seqkit sana
[INFO] File: - Pass records: 1 Discarded lines: 0
@seq
A
+
?

echo -e '@seq\nAA\n+\n++\n' | seqkit sana
[INFO] File: - Pass records: 1 Discarded lines: 0
@seq
AA
+
++

@shenwei356
Copy link
Owner

@botond-sipos might help. I tried but failed to understand the code logic.

@botond-sipos
Copy link
Contributor

This is an unfortunate edge case. The parser does not rely on the 4-line structure of the fastq files, hence it needs a way to classify the input lines (see here).
In the case described in the thread, lines containing a single '+' are classified as separator lines and hence the record will have two consecutive separator lines which is invalid. Unfortunately, I cannot fix this as there is little else to rely on when classifying separator vs. quality lines.
Please consider this a known bug.

@dehui333
Copy link
Author

dehui333 commented Sep 25, 2023

Thanks for your explanation. This in itself certainly is not a big issue but I suspect it could have something to do with the output of seqkit sana and seqkit seq -m being corrupted in some cases.

There were instances when the above operations led to outputs that seqkit stats complained as invalid fastx; for the same input, the output was okay when I used seqtk but somehow seqkit seq -m corrupted it. I noticed that in all these cases the input fastq had sequences of length 1.

Unfortunately, I don't have time at the moment to investigate more about this. I also cannot rule out the possibility that it's due to something else. Anyway, I hope this information could be useful in some way if something similar is ever observed by others in the future.

Edit: I realized my fasta file is getting corrupted after copying from one storage system to another, this increases the likelihood that the abovementioned issue is not due to seqkit. You probably don't have to worry about it.

@shenwei356
Copy link
Owner

It's fixed by @botond-sipos .

@shenwei356
Copy link
Owner

Fixed in v2.8.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants