External db refactor (#48)

* renamed * INFO * renamed from_ncbi to `get_id` * renamed * removed unused arg in get_ncbi_entry * removed get_ncbi_entry() * cleaning * cleaning * working tax fetcher * removed old * added mocker as dev dependency * added tests for NCBITaxonomyFetcher * finished * introduced fetch method * introduced fetcher * cleaned * working * refactored NCBIProteinFetcher * deleted old test notebooks * added fetchers to structure * typos and new syntax * added MMSEQS2 to ToolImages * added mmseqs2 * fixed wring logging level * added deprecation * to_faster writes file * updateed notebooks
PyEED · Mar 11, 2024 · 2635094 · 2635094
1 parent bb7498e
commit 2635094
Show file tree

Hide file tree

Showing 33 changed files with 5,114 additions and 4,885 deletions.
diff --git a/.github/workflows/integration.yaml b/.github/workflows/integration.yaml
@@ -1,6 +1,6 @@
 name: Integration Tests (MySQL)
 
-on: [push]
+on: [release]
 
 jobs:
   test:

diff --git a/docs/quick_start/alignments.md b/docs/quick_start/alignments.md
@@ -14,14 +14,14 @@ Both alignment objects contain the following attributes:
 Besides the `Alignment` object, PyEED also provides a `PairwiseAlignment` object, containing the alignment score, identity, similarity, gaps, and mismatches of a pairwise alignment.
 
 
-Before running the alignment, an `Alignment` needs to be created. This can be done by passing a list of `ProteinInfo` or `DNAInfo` objects to the constructor. The alignment can then be run by calling the `align()` method, passing the alignment method as an argument. The method returns the alignment object, containing the aligned sequences.
+Before running the alignment, an `Alignment` needs to be created. This can be done by passing a list of `ProteinInfo` or `DNAInfo` objects to the constructor. The alignment can then be run by calling the `align`()` method and passing the alignment method as an argument. The method returns the alignment object, containing the aligned sequences.
 
 ``` py
 from pyeed.core import ProteinInfo, Alignment
 
 # Get two ProteinInfo objects
-tem1 = ProteinInfo.from_ncbi("QGC48744.1")
-tem109 = ProteinInfo.from_ncbi("AAT46413.1")
+tem1 = ProteinInfo.get_id("QGC48744.1")
+tem109 = ProteinInfo.get_id("AAT46413.1")
 
 # Create an Alignment
 alignment = Alignment(input_sequences=[tem1, tem109])
@@ -33,8 +33,8 @@ Alternatively, the `from_sequneces()` class method can be used to create an alig
 from pyeed.core import ProteinInfo, Alignment
 
 # Get two ProteinInfo objects
-tem1 = ProteinInfo.from_ncbi("QGC48744.1")
-tem109 = ProteinInfo.from_ncbi("AAT46413.1")
+tem1 = ProteinInfo.get_id("QGC48744.1")
+tem109 = ProteinInfo.get_id("AAT46413.1")
 list_of_sequences = [tem1, tem109]
 
 # Create an Alignment
@@ -50,8 +50,8 @@ alignment = Alignment.from_sequences(list_of_sequences)
     from pyeed.aligners import PairwiseAligner
 
     # Get two ProteinInfo objects
-    tem1 = ProteinInfo.from_ncbi("QGC48744.1")
-    tem109 = ProteinInfo.from_ncbi("AAT46413.1")
+    tem1 = ProteinInfo.get_id("QGC48744.1")
+    tem109 = ProteinInfo.get_id("AAT46413.1")
 
     # Create and run alignment
     alignment = PairwiseAlignment([tem1, tem109], aligner=PairwiseAligner, mode="local")
@@ -63,8 +63,8 @@ alignment = Alignment.from_sequences(list_of_sequences)
     from pyeed.aligners import PairwiseAligner
 
     # Get two ProteinInfo objects
-    tem1 = ProteinInfo.from_ncbi("QGC48744.1")
-    tem109 = ProteinInfo.from_ncbi("AAT46413.1")
+    tem1 = ProteinInfo.get_id("QGC48744.1")
+    tem109 = ProteinInfo.get_id("AAT46413.1")
 
     # Create and run alignment
     alignment = PairwiseAlignment([tem1, tem109], aligner=PairwiseAligner, mode="global")
@@ -80,7 +80,7 @@ alignment = Alignment.from_sequences(list_of_sequences)
 
     # Get sequences
     ncbi_accessions = ["QGC48744.1", "AAT46413.1", "AAT46414.1", "AAT46415.1"]
-    sequences = ProteinInfo.from_ncbi(ncbi_accessions)
+    sequences = ProteinInfo.get_ids(ncbi_accessions)
 
     # Create and run alignment
     alignment = Alignment.from_sequences(sequences, aligner=PairwiseAligner, mode="global)
@@ -99,7 +99,7 @@ Most sequence alignment tools are implemented as command line tools, which need
 
     # Get sequences
     ncbi_accessions = ["QGC48744.1", "AAT46413.1", "AAT46414.1", "AAT46415.1"]
-    sequences = ProteinInfo.from_ncbi(ncbi_accessions)
+    sequences = ProteinInfo.get_ids(ncbi_accessions)
 
     # Create and run alignment
     alignment = Alignment.from_sequences(sequences, aligner=ClustalOmega)

diff --git a/docs/quick_start/basics.md b/docs/quick_start/basics.md
@@ -7,34 +7,34 @@ A sequence object can be created by passing a sequence string to the constructor
 === "Protein"
 
     ``` py
-    from pyEED.core import ProteinInfo
+    from pyeed.core import ProteinInfo
 
     protein = ProteinInfo(sequence="MTEITAAMVKELREDKAVQLLREKGLGK")
     ```
 
 === "DNA"
 
     ``` py
-    from pyEED.core import DNAInfo
+    from pyeed.core import DNAInfo
 
     dna = DNAInfo(sequence="ATGCGTACGTCGATCGATCGATCGATCGATCGATCGATCGATCGTAGTC")
     ```
 
 
 ## 🔎 Search for a sequence
 
-Besides adding sequence information manually, PyEED also allows to search for sequences in the NCBI and UniProt databases. Therefore, the `from_db()` method can be used. In addition to the sequence itself, the method also returns the sequence's annotations and maps them to the corresponding attributes of the sequence object.
+Besides adding sequence information manually, PyEED also allows searching for sequences in the NCBI and UniProt databases. Therefore, the `get_id()` method can be used. In addition to the sequence itself, the method also returns the sequence's annotations and maps them to the corresponding attributes of the sequence object.
 
 === "Protein"
 
     ``` py
-    protein = ProteinInfo.from_db("UCS38941.1")
+    protein = ProteinInfo.get_id("UCS38941.1")
     ```
 
 === "DNA"
 
     ``` py
-    dna = DNAInfo.from_db("NC_000913.3")
+    dna = DNAInfo.get_id("NC_000913.3")
     ```
 
 Alternatively, the sequence can be initiated from a sequence string, triggering a BLAST search in the NCBI database. If the sequence is found, the sequence object is filled with the corresponding information.
@@ -55,7 +55,8 @@ Alternatively, the sequence can be initiated from a sequence string, triggering
 
 ### To file
 
-The sequence can be stored in a `FASTA`, `JSON`, `YAML`, or `XML`file format. Therefore, the respective method can be used.
+The sequence can be stored in a `FASTA`, `JSON`, `YAML`, or `XML` file format. Therefore, the respective method can be used.
+The file path is passed as an argument to the method.
 
 === "FASTA"
 

diff --git a/docs/quick_start/blast.md b/docs/quick_start/blast.md
@@ -1,14 +1,13 @@
 # Using BLAST
 
 ## Using NCBI BLAST
-
-NCBI serves offer a web interface for blasting. With PyEED this can be programmatically accessed. A BLAST search can be initiated by calling the `ncbi_blast()` method on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
+NCBI offers a web interface for blasting. With PyEED this can be programmatically accessed. A BLAST search can be initiated by calling the `ncbi_blast()` method on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
 
 ``` py
 from pyEED.core import ProteinInfo
 
 # Create a ProteinInfo object
-protein = ProteinInfo.from_db("UCS38941.1")
+protein = ProteinInfo.get_id("UCS38941.1")
 
 # Perform a BLAST search
 blast_results = protein.ncbi_blast()
@@ -22,6 +21,13 @@ blast_results = protein.ncbi_blast()
 
 ## Using BLAST with a local database
 
-Building a local BLAST database is a good way to speed up BLAST searches. PyEED allows to perform BLAST searches on local databases. The `local_blast()` method can be called on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
+Building a local BLAST database is a good way to speed up BLAST searches. PyEED allows BLAST searches against local databases. The `blastp()` method can be called on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
+
+``` py
 
-``` py
+    blast_results = protein.blastp(
+        db_path="/PATH/TO/LOCAL/BLAST/DB",
+        n_hits=200,
+        e_value=0.001,
+        word_size=3,
+    )
diff --git a/docs/quick_start/networks.md b/docs/quick_start/networks.md
@@ -22,7 +22,7 @@ A `SequenceNetwork` is created using a list of `PairwiseAlignment` objects, and
         "WP_048165429.1",
         "ACS90033.1",
     ]
-    mats = ProteinInfo.from_ncbi(mat_accessions)
+    mats = ProteinInfo.get_ids(mat_accessions)
 
     # Create pairwise alignments between all sequences
     alignments = Alignment.from_sequences(mats, aligner=PairwiseAligner)
@@ -59,7 +59,7 @@ A `SequenceNetwork` is created using a list of `PairwiseAlignment` objects, and
         "WP_048165429.1",
         "ACS90033.1",
     ]
-    mats = ProteinInfo.from_ncbi(mat_accessions)
+    mats = ProteinInfo.get_ids(mat_accessions)
 
     # Create pairwise alignments between all sequences
     alignments = Alignment.from_sequences(mats, aligner=PairwiseAligner)