Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gptmd approach update #2419

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

trishorts
Copy link
Contributor

@trishorts trishorts commented Sep 18, 2024

GPTMD is promiscuous in the addition of potential modifications to the xml database. This PR reduces the number of candidate modifications added to those that produce the highest score for each possible PTM. The high level details of the new algorithm are as follows:

  1. Perform notch based search
  2. Find modifications that match each notch
  3. Find motifs for each modification and create a corresponding peptideWithSetModifications
  4. Fragment each peptideWithSetModifications and compute the MetaMorpheus score.
  5. Choose a subset of peptideWithSetMods having the highest score and add those localized modifications to the new xml

For bottom up:
six mann A549 files with human fasta.
old method added 200513 mods; new method added 128449 mods
old method 102324 psms; new 103546
old 39283 peptides; new 39277
old 6042 proteins; new 6012

For top down:
14 fractions x 2 techreps jurkate td files from sean dai paper
old method added 19188 mods; new method added 11013 mods
old method 23688 psms; new 24022
old 904 proteoforms; new 899
old 279 proteins; new 273

Additonal updates:

  1. Eliminated output of candidate psms in .psmtsv
  2. GPTMD database created with parallelization
  3. Eliminated PEP from the FDR analysis

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.89%. Comparing base (8c8fe5f) to head (51b1c58).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2419      +/-   ##
==========================================
+ Coverage   93.67%   93.89%   +0.21%     
==========================================
  Files         141      141              
  Lines       21925    22034     +109     
  Branches     3007     3020      +13     
==========================================
+ Hits        20539    20688     +149     
+ Misses        934      902      -32     
+ Partials      452      444       -8     
Files with missing lines Coverage Δ
MetaMorpheus/EngineLayer/Gptmd/GptmdEngine.cs 97.19% <100.00%> (+7.85%) ⬆️
MetaMorpheus/EngineLayer/MetaMorpheusEngine.cs 92.26% <ø> (+4.22%) ⬆️
...ModificationAnalysis/ModificationAnalysisEngine.cs 100.00% <100.00%> (+4.76%) ⬆️
MetaMorpheus/TaskLayer/GPTMDTask/GPTMDTask.cs 93.67% <100.00%> (+11.85%) ⬆️

... and 1 file with indirect coverage changes

@@ -62,18 +79,13 @@ protected override MyTaskResults RunSpecific(string OutputFolder, List<DbForTask
ProseCreatedWhileRunning.Append("precursor mass tolerance(s) = {" + tempSearchMode.ToProseString() + "}; ");

ProseCreatedWhileRunning.Append("product mass tolerance = " + CommonParameters.ProductMassTolerance + ". ");
ProseCreatedWhileRunning.Append("The combined search database contained " + proteinList.Count(p => !p.IsDecoy) + " non-decoy protein entries including " + proteinList.Where(p => p.IsContaminant).Count() + " contaminant sequences. ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we got rid of this prose line?


// if a variant protein and the mod is on the variant, index to the variant protein sequence
if (modIsOnVariant)
if (CommonParameters.DissociationType == DissociationType.Autodetect)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line, if it is autodetect, we fragement with all possible fragmentation types. On line 120, if it is autodetect, we fragment with the type as indicated in the scan header. What is the reason for two different approaches?

@@ -217,13 +217,13 @@ public static void TestModificationInfoListInProteinGroupsOutput()
int totalNumberOfMods = proteins.Sum(p => p.OneBasedPossibleLocalizedModifications.Count + p.SequenceVariations.Sum(sv => sv.OneBasedModifications.Count));

//tests that modifications are being done correctly
Assert.AreEqual(0, totalNumberOfMods);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change of an expected value with the comment above, "test that mods are being done correctly" is worrisome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants