Skip to content

Commit

Permalink
Transcriptomics Digestion and Fragmentation (#801)
Browse files Browse the repository at this point in the history
* Added in base classes

* Implemented all tests

* Made initial tests pass

* Removed unnecessary namespaces

* Expanded test coverage

* Responded to Alex Comments

* Add RNA support: loading, parsing, and decoy generation

Introduced support for handling RNA data within the UsefulProteomicsDatabases project. Key changes include:

- Added `Transcriptomics\TestData` folder to `Test.csproj`.
- Changed access modifiers in `ProteinDbLoader.cs` to internal.
- Added `using` directives for `Transcriptomics` in `ProteinXmlEntry.cs`.
- Introduced methods `ParseRnaEndElement` and `ParseRnaEntryEndElement` in `ProteinXmlEntry.cs`.
- Modified `ParseAnnotatedMods` to check for RNA modifications.
- Added project reference to `Transcriptomics.csproj` in `UsefulProteomicsDatabases.csproj`.
- Added `ClassExtensions.cs` with `CreateNew` method for nucleic acids.
- Added `RnaDbLoader.cs` for RNA database loading.
- Added `RnaDecoyGenerator.cs` for generating decoy RNA sequences.

* Add new properties and caching to oligo digestion

Updated `using` directives in `TestDigestion.cs` and `OligoWithSetMods.cs` to include necessary namespaces. Added assertions in `TestDigestion.cs` for `SequenceWithChemicalFormulas` and `FullSequenceWithMassShift`. Changed `namespace` in `OligoWithSetMods.cs` to `Transcriptomics.Digestion`. Implemented and cached `SequenceWithChemicalFormulas` property in `OligoWithSetMods.cs`.

* Add RNA sequence and database handling and related test cases

- Added new files `ModomicsUnmodifiedTrimmed.fasta` and `ModomicsUnmodifiedTrimmed.fasta.gz` to `Test.csproj` with `CopyToOutputDirectory` set to `PreserveNewest`.
- Removed the `Transcriptomics\TestData` folder from `Test.csproj`.
- Introduced `Transcribe` method in `ClassExtensions.cs` for DNA to RNA transcription.
- Added summary comment to `NucleolyticOligo` class in `NucleolyticOligo.cs`.
- Added `ApplyRegex` method in `FastaHeaderFieldRegex.cs`.
- Introduced `ProteinDbWriter` class in `ProteinDbWriter.cs` for writing protein and nucleic acid databases.
- Modified `GetModsForThisProtein` to `GetModsForThisBioPolymer` in `ProteinDbWriter.cs`.
- Added `RnaDbLoader` class in `RnaDbLoader.cs` for RNA FASTA header detection and sequence loading.
- Updated user dictionary in `mzLib.sln.DotSettings` with new terms.
- Added test cases in `TestDbLoader.cs` for RNA database loading and header detection.
- Introduced `TestDecoyGeneration` class in `TestDecoyGenerator.cs` for RNA decoy generation tests.
- Added RNA sequence file `ModomicsUnmodifiedTrimmed.fasta` and its compressed version.

* Refactor and enhance RNA and oligo handling in tests

- Added `using` directives for `Transcriptomics.Digestion` and `UsefulProteomicsDatabases.Transcriptomics` in `TestDecoyGenerator.cs`.
- Introduced `TestCreateNew` in `TestDecoyGenerator.cs` to verify RNA and oligo creation.
- Added `using` directive for `MzLibUtil` in `TestDigestion.cs`.
- Added a test in `TestDigestion.cs` for exception handling with invalid sequences.
- Added `using` directives for `Omics` and related namespaces in `TestFragmentation.cs`.
- Modified `TestFragmentation_Modified` in `TestFragmentation.cs` to use `OligoWithSetMods` directly and added assertions.
- Updated `ClassExtensions.cs` to allow setting `isDecoy` in new `RNA` objects.
- Refactored `OligoWithSetMods.cs` to return a dictionary from `GetModsAfterDeserialization`.
- Updated `OligoWithSetMods.cs` to initialize `_allModsOneIsNterminus` using the returned dictionary.

* Broke out TerminusSpecificProductTypes class and removed unnecessary namespaces

* Update ProteinXmlEntry.cs

* Added gene name to RNA constructore

* Added gene name to RNA constructore

* Refactor and enhance exception handling and tests

Refactored constructors, improved exception handling, and added comprehensive tests across multiple files. Key changes include:

- `MzLibException.cs`: Updated constructor to include `innerException`.
- `TestDecoyGenerator.cs`: Added assertions for `CreateNew` method.
- `TestDigestion.cs`: Added assertions and new test for RNA digestion exception.
- Refactored modification lists and added various tests for modifications.
- `TestNucleicAcid.cs`: Refactored methods, adjusted precision, and updated terminus assignments.
- `NucleolyticOligo.cs`: Changed parameter types, updated comments, and improved variable names.
- `OligoWithSetMods.cs`: Enhanced exception messages and updated modification location checks.
- `NucleicAcid.cs`: Added `using` directive, changed exception type, and refactored methods.
- `mzLib.sln.DotSettings`: Updated user dictionary entries.

* Add test data files and methods for RNA sequence handling

Added new test data files (`20mer1.fasta`, `20mer1.fasta.gz`, `20mer1.xml`, `20mer1.xml.gz`) to the `Transcriptomics\TestData` directory in the `Test.csproj` file, ensuring they are copied to the output directory. Introduced `TestDbReadingDifferentExtensions` in `TestDbLoader.cs` to verify RNA database reading from various formats. Added `TestDigestionMaxIsoforms` in `TestDigestion.cs` to test RNA sequence digestion with max isoforms. Updated `WriteNucleicAcidXmlDatabase` in `ProteinDbWriter.cs` with remarks for future implementation. Added a TODO in `RnaDecoyGenerator.cs` regarding palindromic sequences' impact on fragment ions. Included new RNA sequence data in test files for validation.

* Added test coverage to the localize method within BioPolymerWithSetMods

---------

Co-authored-by: Nic Bollis <nbollis@wisc.edu>
  • Loading branch information
nbollis and Nic Bollis authored Oct 15, 2024
1 parent 983c3b0 commit 6c18e9f
Show file tree
Hide file tree
Showing 40 changed files with 4,678 additions and 62 deletions.
1 change: 1 addition & 0 deletions mzLib/Chemistry/ClassExtensions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ public static double ToMass(this double massToChargeRatio, int charge)
return Math.Abs(charge) * massToChargeRatio - charge * Constants.ProtonMass;
}

public static double? RoundedDouble(this double myNumber, int places = 9) => RoundedDouble(myNumber as double?, places);
public static double? RoundedDouble(this double? myNumber, int places = 9)
{
if (myNumber != null)
Expand Down
5 changes: 5 additions & 0 deletions mzLib/MassSpectrometry/Enums/DissociationType.cs
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,11 @@ public enum DissociationType
/// </summary>
LowCID,

/// <summary>
/// activated ion electron photo detachment dissociation
/// </summary>
aEPD,

Unknown,
AnyActivationType,
Custom,
Expand Down
9 changes: 2 additions & 7 deletions mzLib/MzLibUtil/MzLibException.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@
namespace MzLibUtil
{
[Serializable]
public class MzLibException : Exception
{
public MzLibException(string message)
: base(message)
{
}
}
public class MzLibException(string message, Exception innerException = null)
: Exception(message, innerException);
}
19 changes: 6 additions & 13 deletions mzLib/Omics/Fragmentation/FragmentationTerminus.cs
Original file line number Diff line number Diff line change
@@ -1,19 +1,12 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Omics.Fragmentation
namespace Omics.Fragmentation
{
public enum FragmentationTerminus
{
Both, //N- and C-terminus
N, //N-terminus only
C, //C-terminus only
{
Both, //N- and C-terminus
N, //N-terminus only
C, //C-terminus only
None, //used for internal fragments, could be used for top down intact mass?
FivePrime, // 5' for NucleicAcids
ThreePrime, // 3' for NucleicAcids
}

}
}
162 changes: 161 additions & 1 deletion mzLib/Omics/Fragmentation/Oligo/DissociationTypeCollection.cs

Large diffs are not rendered by default.

141 changes: 141 additions & 0 deletions mzLib/Omics/Fragmentation/Oligo/TerminusSpecificProductTypes.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Omics.Fragmentation.Oligo
{
public static class TerminusSpecificProductTypes
{
public static List<ProductType> GetRnaTerminusSpecificProductTypes(
this FragmentationTerminus fragmentationTerminus)
{
return ProductIonTypesFromSpecifiedTerminus[fragmentationTerminus];
}

/// <summary>
/// The types of ions that can be generated from an oligo fragment, based on the terminus of the fragment
/// </summary>
public static Dictionary<FragmentationTerminus, List<ProductType>> ProductIonTypesFromSpecifiedTerminus = new Dictionary<FragmentationTerminus, List<ProductType>>
{
{
FragmentationTerminus.FivePrime, new List<ProductType>
{
ProductType.a, ProductType.aWaterLoss, ProductType.aBaseLoss,
ProductType.b, ProductType.bWaterLoss, ProductType.bBaseLoss,
ProductType.c, ProductType.cWaterLoss, ProductType.cBaseLoss,
ProductType.d, ProductType.dWaterLoss, ProductType.dBaseLoss,
}
},
{
FragmentationTerminus.ThreePrime, new List<ProductType>
{
ProductType.w, ProductType.wWaterLoss, ProductType.wBaseLoss,
ProductType.x, ProductType.xWaterLoss, ProductType.xBaseLoss,
ProductType.y, ProductType.yWaterLoss, ProductType.yBaseLoss,
ProductType.z, ProductType.zWaterLoss, ProductType.zBaseLoss,
}
},
{
FragmentationTerminus.Both, new List<ProductType>
{

ProductType.a, ProductType.aWaterLoss, ProductType.aBaseLoss,
ProductType.b, ProductType.bWaterLoss, ProductType.bBaseLoss,
ProductType.c, ProductType.cWaterLoss, ProductType.cBaseLoss,
ProductType.d, ProductType.dWaterLoss, ProductType.dBaseLoss,
ProductType.w, ProductType.wWaterLoss, ProductType.wBaseLoss,
ProductType.x, ProductType.xWaterLoss, ProductType.xBaseLoss,
ProductType.y, ProductType.yWaterLoss, ProductType.yBaseLoss,
ProductType.z, ProductType.zWaterLoss, ProductType.zBaseLoss,
ProductType.M
}

},
{
FragmentationTerminus.None, new List<ProductType>()
}
};


public static FragmentationTerminus GetRnaTerminusType(this ProductType fragmentType)
{
switch (fragmentType)
{
case ProductType.a:
case ProductType.aWaterLoss:
case ProductType.aBaseLoss:
case ProductType.b:
case ProductType.bWaterLoss:
case ProductType.bBaseLoss:
case ProductType.c:
case ProductType.cWaterLoss:
case ProductType.cBaseLoss:
case ProductType.d:
case ProductType.dWaterLoss:
case ProductType.dBaseLoss:
case ProductType.w:
case ProductType.wWaterLoss:
case ProductType.wBaseLoss:
case ProductType.x:
case ProductType.xWaterLoss:
case ProductType.xBaseLoss:
case ProductType.y:
case ProductType.yWaterLoss:
case ProductType.yBaseLoss:
case ProductType.z:
case ProductType.zWaterLoss:
case ProductType.zBaseLoss:
case ProductType.M:
return ProductTypeToFragmentationTerminus[fragmentType];

case ProductType.aStar:
case ProductType.aDegree:
case ProductType.bAmmoniaLoss:
case ProductType.yAmmoniaLoss:
case ProductType.zPlusOne:
case ProductType.D:
case ProductType.Ycore:
case ProductType.Y:
default:
throw new ArgumentOutOfRangeException(nameof(fragmentType), fragmentType, null);
}
}


/// <summary>
/// The terminus of the oligo fragment that the product ion is generated from
/// </summary>
public static Dictionary<ProductType, FragmentationTerminus> ProductTypeToFragmentationTerminus = new Dictionary<ProductType, FragmentationTerminus>
{
{ ProductType.a, FragmentationTerminus.FivePrime },
{ ProductType.aWaterLoss, FragmentationTerminus.FivePrime },
{ ProductType.aBaseLoss, FragmentationTerminus.FivePrime },
{ ProductType.b, FragmentationTerminus.FivePrime },
{ ProductType.bWaterLoss, FragmentationTerminus.FivePrime },
{ ProductType.bBaseLoss, FragmentationTerminus.FivePrime },
{ ProductType.c, FragmentationTerminus.FivePrime },
{ ProductType.cWaterLoss, FragmentationTerminus.FivePrime },
{ ProductType.cBaseLoss, FragmentationTerminus.FivePrime },
{ ProductType.d, FragmentationTerminus.FivePrime },
{ ProductType.dWaterLoss, FragmentationTerminus.FivePrime },
{ ProductType.dBaseLoss, FragmentationTerminus.FivePrime },

{ ProductType.w, FragmentationTerminus.ThreePrime },
{ ProductType.wWaterLoss, FragmentationTerminus.ThreePrime },
{ ProductType.wBaseLoss, FragmentationTerminus.ThreePrime },
{ ProductType.x, FragmentationTerminus.ThreePrime },
{ ProductType.xWaterLoss, FragmentationTerminus.ThreePrime },
{ ProductType.xBaseLoss, FragmentationTerminus.ThreePrime },
{ ProductType.y, FragmentationTerminus.ThreePrime },
{ ProductType.yWaterLoss, FragmentationTerminus.ThreePrime },
{ ProductType.yBaseLoss, FragmentationTerminus.ThreePrime },
{ ProductType.z, FragmentationTerminus.ThreePrime },
{ ProductType.zWaterLoss, FragmentationTerminus.ThreePrime },
{ ProductType.zBaseLoss, FragmentationTerminus.ThreePrime },

{ ProductType.M, FragmentationTerminus.Both }
};
}
}
11 changes: 10 additions & 1 deletion mzLib/Omics/IBioPolymerWithSetMods.cs
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,16 @@ public void Fragment(DissociationType dissociationType, FragmentationTerminus fr
public void FragmentInternally(DissociationType dissociationType, int minLengthOfFragments,
List<Product> products);

public IBioPolymerWithSetMods Localize(int j, double massToLocalize);
/// <summary>
/// Outputs a duplicate IBioPolymerWithSetMods with a localized mass shift, replacing a modification when present
/// <remarks>
/// Used to localize an unknown mass shift in the MetaMorpheus Localization Engine
/// </remarks>
/// </summary>
/// <param name="indexOfMass">The index of the modification in the AllModOneIsNTerminus Dictionary - 2 (idk why -2)</param>
/// <param name="massToLocalize">The mass to add to the BioPolymer</param>
/// <returns></returns>
public IBioPolymerWithSetMods Localize(int indexOfMass, double massToLocalize);

public static string GetBaseSequenceFromFullSequence(string fullSequence)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -613,17 +613,17 @@ public void FragmentInternally(DissociationType dissociationType, int minLengthO
}
}

public IBioPolymerWithSetMods Localize(int j, double massToLocalize)
public IBioPolymerWithSetMods Localize(int indexOfMass, double massToLocalize)
{
var dictWithLocalizedMass = new Dictionary<int, Modification>(AllModsOneIsNterminus);
double massOfExistingMod = 0;
if (dictWithLocalizedMass.TryGetValue(j + 2, out Modification modToReplace))
if (dictWithLocalizedMass.TryGetValue(indexOfMass + 2, out Modification modToReplace))
{
massOfExistingMod = (double)modToReplace.MonoisotopicMass;
dictWithLocalizedMass.Remove(j + 2);
dictWithLocalizedMass.Remove(indexOfMass + 2);
}

dictWithLocalizedMass.Add(j + 2, new Modification(_locationRestriction: "Anywhere.", _monoisotopicMass: massToLocalize + massOfExistingMod));
dictWithLocalizedMass.Add(indexOfMass + 2, new Modification(_locationRestriction: "Anywhere.", _monoisotopicMass: massToLocalize + massOfExistingMod));

var peptideWithLocalizedMass = new PeptideWithSetModifications(Protein, _digestionParams, OneBasedStartResidueInProtein, OneBasedEndResidueInProtein,
CleavageSpecificityForFdrCategory, PeptideDescription, MissedCleavages, dictWithLocalizedMass, NumFixedMods);
Expand Down
18 changes: 18 additions & 0 deletions mzLib/Test/Test.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,24 @@
</None>
<None Update="FileReadingTests\SearchResults\VariantCrossTest.psmtsv">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\20mer1.fasta">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\20mer1.fasta.gz">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\20mer1.xml">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\20mer1.xml.gz">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\ModomicsUnmodifiedTrimmed.fasta">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="Transcriptomics\TestData\ModomicsUnmodifiedTrimmed.fasta.gz">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="DataFiles\centroid_1x_MS1_4x_autoMS2.d\**">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
Expand Down
2 changes: 2 additions & 0 deletions mzLib/Test/Transcriptomics/TestData/20mer1.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
>id:2|Name:20mer1|SOterm:20mer1|Type:tRNA|Subtype:Ala|Feature:VGC|Cellular_Localization:freezer|Species:standard
GUACUGCCUCUAGUGAAGCA
Binary file added mzLib/Test/Transcriptomics/TestData/20mer1.fasta.gz
Binary file not shown.
17 changes: 17 additions & 0 deletions mzLib/Test/Transcriptomics/TestData/20mer1.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="utf-8"?>
<mzLibProteinDb>
<entry>
<accession>20mer1</accession>
<name>20mer1</name>
<protein>
<recommendedName>
<fullName>20mer1</fullName>
</recommendedName>
</protein>
<gene />
<organism>
<name type="scientific">standard</name>
</organism>
<sequence length="20">GUACUGCCUCUAGUGAAGCA</sequence>
</entry>
</mzLibProteinDb>
Binary file added mzLib/Test/Transcriptomics/TestData/20mer1.xml.gz
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
>id:1|Name:tdbR00000010|SOterm:SO:0000254|Type:tRNA|Subtype:Ala|Feature:VGC|Cellular_Localization:prokaryotic cytosol|Species:Escherichia coli
GGGGCUAUAGCUCAGCUGGGAGAGCGCCUGCUUUGCACGCAGGAGGUCUGCGGUUCGAUCCCGCAUAGCUCCACCA
>id:2|Name:tdbR00000008|SOterm:SO:0000254|Type:tRNA|Subtype:Ala|Feature:GGC|Cellular_Localization:prokaryotic cytosol|Species:Escherichia coli
GGGGCUAUAGCUCAGCUGGGAGAGCGCUUGCAUGGCAUGCAAGAGGUCAGCGGUUCGAUCCCGCUUAGCUCCACCA
>id:3|Name:tdbR00000356|SOterm:SO:0001036|Type:tRNA|Subtype:Arg|Feature:ICG|Cellular_Localization:prokaryotic cytosol|Species:Escherichia coli
GCAUCCGUAGCUCAGCUGGAUAGAGUACUCGGCUACGAACCGAGCGGUCGGAGGUUCGAAUCCUCCCGGAUGCACCA
>id:4|Name:tdbR00000359|SOterm:SO:0001036|Type:tRNA|Subtype:Arg|Feature:CCG|Cellular_Localization:prokaryotic cytosol|Species:Escherichia coli
GCGCCCGUAGCUCAGCUGGAUAGAGCGCUGCCCUCCGGAGGCAGAGGUCUCAGGUUCGAAUCCUGUCGGGCGCGCCA
>id:5|Name:tdbR00000358|SOterm:SO:0001036|Type:tRNA|Subtype:Arg|Feature:UCU|Cellular_Localization:prokaryotic cytosol|Species:Escherichia coli
GCGCCCUUAGCUCAGUUGGAUAGAGCAACGACCUUCUAAGUCGUGGGCCGCAGGUUCGAAUCCUGCAGGGCGCGCCA
Binary file not shown.
Loading

0 comments on commit 6c18e9f

Please sign in to comment.