Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Philosopher 4.1.1 generates empty files due to the incompatibility of the fasta file #537

Closed
fazeliniah opened this issue Nov 23, 2021 · 23 comments
Assignees

Comments

@fazeliniah
Copy link

Dear developer team,
I am trying to use the fasta database generated in this paper: https://www.nature.com/articles/s41587-021-01021-3. In brief the new fasta file has additional~320K proteins resulted from different RNA sequencing data.
The v.16 of MSFragger handled these data very nicely. Unfortunately I don't know why I can't replicate the analysis in v.17.1. The job finished without any errors but the list of the peptide/protein are empty. For validation I am using the peptideprophet (for unspecific search).
I have put the output of the analysis in here: https://www.dropbox.com/sh/ciq36i6shg79d6z/AAA1l-4QX1t5ZjJjXtbPmp91a?dl=0
Thank you again as always for your great program and support.

@fcyu fcyu transferred this issue from Nesvilab/MSFragger Nov 23, 2021
@fcyu
Copy link
Member

fcyu commented Nov 23, 2021

Everything looks good except there is no entries in the tsv files.

Felipe @prvst , can you take a look? They said that FragPipe 16, which implied the Philosopher 4.0.0, worked well.

Thanks,

Fengchao

@prvst
Copy link

prvst commented Nov 25, 2021

@fazeliniah are you running Philosopher v4.1.1?

@fazeliniah
Copy link
Author

I am using Philosopher version 4.1.0

@fazeliniah
Copy link
Author

I just tried the v.4.1.1 and got the same issue.

@prvst
Copy link

prvst commented Dec 3, 2021

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this.

Added to v4.1.2

@prvst prvst closed this as completed Dec 3, 2021
@fcyu fcyu changed the title MF17.1 and HLA analysis Philosopher 4.1.1 generates empty files due to the incompatibility of the fasta file Dec 3, 2021
@fcyu fcyu pinned this issue Dec 3, 2021
@fcyu
Copy link
Member

fcyu commented Dec 3, 2021

Hi Felipe @prvst ,

The interact-*.pep.xml from Percolator won't have such replacement. Will your changes break the Percolator related workflows?

BTW, what kind of characters does PeptideProphet replaced?

Best,

Fengchao

@fcyu fcyu reopened this Dec 3, 2021
@prvst
Copy link

prvst commented Dec 3, 2021

Sorry, I forgot about Percolator. If the parsing rules are different, then yes, it will brake the logic. PeptideProphet replaces the pipe character ( | ) by an empty space. This is the only one I'm aware of at this moment, I don't know if the same thing happens with other special characters

@fcyu
Copy link
Member

fcyu commented Dec 3, 2021

Thanks for the info. I think we can do the same for Percolator. Let me see if I can find the code in PeptideProphet to get all of the characters to be replaced.

Best,

Fengchao

@fazeliniah
Copy link
Author

Hi Fengchao and team,
Hope you all had a wonderful holiday.
I am just wondering if there is an update for this issue.
Thanks

@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

I guess you need to check with Felipe @prvst about the fixed Philosopher.

Best,

Fengchao

@prvst
Copy link

prvst commented Jan 3, 2022

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this.

Added to v4.1.2

Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.

@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?

@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

Hi Felipe @prvst ,

Yes, please check the code here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Common/util.cpp#l533. The XMLEscape(const string& s) function is used by the RefreshParser.cpp: https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Parsers/RefreshParser/RefreshParser.cpp#l1660

However, I could not find any code replacing | with space, can you confirm that it is replaced?

BTW, I think it might not be a good idea using the protein description as part of the ID. There are tools modifying or truncating the protein description in different ways in writing the result. You will not be able to map proteins back to the fasta file.

Best,

Fengchao

@prvst
Copy link

prvst commented Jan 3, 2022

However, I could not find any code replacing | with space, can you confirm that it is replaced?

Yes, the description is modified

@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

OK, actually, PeptideProphet does not replace |:

image

But ProteinProphet does:

image

I will read the ProteinProphet code then.

Best,

Fengchao

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this.
Added to v4.1.2

Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.

@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?

@guoci
Copy link
Member

guoci commented Jan 3, 2022

It is here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Validation/ProteinProphet/ProteinProphet.cpp#l7256

@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

Thanks @guoci , it does have more rules than replacing | with . I can add them to MSFragger so that downstream tools will no need to make any changes.

Best,

Fengchao

@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

Hi @fazeliniah ,

Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.

Best,

Fengchao

@fcyu fcyu unpinned this issue Jan 3, 2022
@fcyu
Copy link
Member

fcyu commented Jan 3, 2022

Sorry that I forgot one more thing.

With this change in MSFragger, we don't need to change Percolator and other tools because the the protein descriptions have already been cleaned up at the very beginning (ProteinProphet won't change the protein descriptions anymore).

But, Philosopher still needs to have the same cleaning up rules in load the fasta file, otherwise, it will not be able to map the proteins in pep.xml back to the fasta file.

Felipe @prvst , can you make the changes according to the cleanUpProteinDescription function pointed out by Guo Ci, and send the fixed Philosopher?

Thanks,

Fengchao

Hi @fazeliniah ,

Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.

Best,

Fengchao

@fazeliniah
Copy link
Author

Hi Fengchao,
I tested the MSFragger 3.4 and 3.5 and they both work nicely with our HLA peptidome project. The issue was related to our RNA-derived fasta database. The presence of new characters in the header (e.g. +, -, *, ~) and some duplicate sequences were the main issue. Thank you again for all your help.
Thanks

@anesvi
Copy link
Collaborator

anesvi commented Jan 24, 2022

I observed the same issue with standard search using GenCode database. Need to fix for the next release

@fcyu
Copy link
Member

fcyu commented Dec 12, 2022

I observed the same issue with standard search using GenCode database. Need to fix for the next release

@prvst @anesvi Is it fixed?

Best,

Fengchao

@prvst
Copy link

prvst commented Dec 12, 2022

fixed

@fcyu
Copy link
Member

fcyu commented Dec 12, 2022

Thanks.

@fcyu fcyu closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants