-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The warning and behaviour regarding illegal characters in taxa names could be improved #90
Comments
Also, apologies if this has been fixed in a more recent version. |
Thanks for your feedback, @terrycojones ! First of all, I totally agree that developers shouldn't be lazy such that users can be :) Now regarding your specific situation: Just before the error message you posted above, you should see the lines like these:
Although I do not print the exact offset of the offending characters here, it should be quite obvious that FASTA comment was considered as part of taxon name, and by reading the NOTE message below, it's straightforward that spaces are the problem. Moreover, RAxML_NG will create an MSA file with fixed taxon names for you ( Regarding FASTA comment treatment, I'm aware of the issue. Unfortunately, the underlying problem is the lack of (strict!) standards for bionforfmatic file formats (be it FASTA, PHYLIP or NEWICK). As a result, there are multiple "flavors" of each format, since every tool (or a custom perl script...) will print whatever they please, introducing extra spaces, newlines, weird symbols. comments etc. In case of FASTA, the line after ">" is called "description". I could not find any "official" source which states that what goes after the first space is a comment (see eg. https://en.wikipedia.org/wiki/FASTA_format or https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp). So the following FASTA is perfectly valid:
Given all that, I would rather make this description cutting behavior optional, something like "--msa-format FASTA+comment". Ideally, there should be a heuristic to autodetect "FASTA" vs. "FASTA+comment" format. Any ideas how this might work? |
Hey - thanks, you gave me a smile :-) I think I would just go for something simpler than an Another option in generating your error output would be to print out the offending ids with a line under each to show the location of the issue. So:
That would handle most cases (by which I mean it wouldn't handle unprintable characters). At least in Python I would print using the OK, enough words... Thanks for the reply. It's super nice to see people being careful in writing bioinformatics code :-) Terry |
I've also come across this issue.
The page https://www.ncbi.nlm.nih.gov/blast/fasta.shtml is not an official description of the FASTA format - it only lists what formats BLAST accepts, FASTA is one of them. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/ would be slightly more to the point although it refers to a subformat of FASTA used for genbank submissions - it explicitly mentions that Here is a sample of just the first pages from a google search:
|
The way the definition line works is not an unfortunate convention; it is the way that Bill Pearson (the author of the FASTA alignment program) created it decades ago, apparently (I did not search for the original docs, but it's been like that since at least the 1990s). And it is extremely useful and informative. Nearly all bioinformatics software I've come across respects that format that considers the first "word" after > as the identifier and everything else as a free-form "comment" that can be used any way the user wants. The old RAxML certainly did work fine with any FASTA file like that. Now I came here to check why RAxML-NG was choking on my file that works everywhere else... It is true that there is an annoying amount of FASTA "flavors" out there, but I think the simple >id [optional comments] format is by far the most common one (also, ignoring spaces in the sequence area). Considering the whole line as the ID is definitely bad and very seldom seen, in my experience. |
…. Alternatively, spaces can be converted to underscores with `--msa-format fasta_longlabels`.
Thanks for your feedback! Starting with raxml-ng 1.2.0, the default behavior is to truncate FASTA sequence names after the first space. It is also possible to replace spaces with underscores by adding |
Hi all
I just got this in the raxml-ng (version 0.7.0) output:
There are a couple of things that could be improved. You could tell the user which character is illegal in the name and give its offset. That would allow people to instantly find and fix these things themselves without having to puzzle over which of the chars is causing the error and where it is.
I know that will probably sound trivial and unnecessary, but I always think the programmer should go the extra yard to give the user the exact reason why something has failed.
The reason this caused me some pause is that I was given a "standard" FASTA file to run just now and it failed as above. I didn't see why because what I consider the 'id' in the sequences are the characters up to the first space. That (unfortunate) behaviour is standard in the bioinformatics world. I think it would make more sense for RAxML-NG to follow this approach too, and drop everything from the first space onwards. Of course this could result in duplicates and that can be checked for.
I personally think the convention of separating things at the first space is really unfortunate. But one has to live with the FASTA produced by other tools. RAxML-NG should probably do the same, IMO.
Anyway, just some friendly feedback! Thanks :-)
The text was updated successfully, but these errors were encountered: