Skip to content

Commit

Permalink
External db refactor (#48)
Browse files Browse the repository at this point in the history
* renamed

* INFO

* renamed from_ncbi to `get_id`

* renamed

* removed unused arg in get_ncbi_entry

* removed get_ncbi_entry()

* cleaning

* cleaning

* working tax fetcher

* removed old

* added mocker as dev dependency

* added tests for NCBITaxonomyFetcher

* finished

* introduced fetch method

* introduced fetcher

* cleaned

* working

* refactored NCBIProteinFetcher

* deleted old test notebooks

* added fetchers to structure

* typos and new syntax

* added MMSEQS2 to ToolImages

* added mmseqs2

* fixed wring logging level

* added deprecation

* to_faster writes file

* updateed notebooks
  • Loading branch information
haeussma authored Mar 11, 2024
1 parent bb7498e commit 2635094
Show file tree
Hide file tree
Showing 33 changed files with 5,114 additions and 4,885 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/integration.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Integration Tests (MySQL)

on: [push]
on: [release]

jobs:
test:
Expand Down
22 changes: 11 additions & 11 deletions docs/quick_start/alignments.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ Both alignment objects contain the following attributes:
Besides the `Alignment` object, PyEED also provides a `PairwiseAlignment` object, containing the alignment score, identity, similarity, gaps, and mismatches of a pairwise alignment.


Before running the alignment, an `Alignment` needs to be created. This can be done by passing a list of `ProteinInfo` or `DNAInfo` objects to the constructor. The alignment can then be run by calling the `align()` method, passing the alignment method as an argument. The method returns the alignment object, containing the aligned sequences.
Before running the alignment, an `Alignment` needs to be created. This can be done by passing a list of `ProteinInfo` or `DNAInfo` objects to the constructor. The alignment can then be run by calling the `align`()` method and passing the alignment method as an argument. The method returns the alignment object, containing the aligned sequences.

``` py
from pyeed.core import ProteinInfo, Alignment

# Get two ProteinInfo objects
tem1 = ProteinInfo.from_ncbi("QGC48744.1")
tem109 = ProteinInfo.from_ncbi("AAT46413.1")
tem1 = ProteinInfo.get_id("QGC48744.1")
tem109 = ProteinInfo.get_id("AAT46413.1")

# Create an Alignment
alignment = Alignment(input_sequences=[tem1, tem109])
Expand All @@ -33,8 +33,8 @@ Alternatively, the `from_sequneces()` class method can be used to create an alig
from pyeed.core import ProteinInfo, Alignment

# Get two ProteinInfo objects
tem1 = ProteinInfo.from_ncbi("QGC48744.1")
tem109 = ProteinInfo.from_ncbi("AAT46413.1")
tem1 = ProteinInfo.get_id("QGC48744.1")
tem109 = ProteinInfo.get_id("AAT46413.1")
list_of_sequences = [tem1, tem109]

# Create an Alignment
Expand All @@ -50,8 +50,8 @@ alignment = Alignment.from_sequences(list_of_sequences)
from pyeed.aligners import PairwiseAligner

# Get two ProteinInfo objects
tem1 = ProteinInfo.from_ncbi("QGC48744.1")
tem109 = ProteinInfo.from_ncbi("AAT46413.1")
tem1 = ProteinInfo.get_id("QGC48744.1")
tem109 = ProteinInfo.get_id("AAT46413.1")

# Create and run alignment
alignment = PairwiseAlignment([tem1, tem109], aligner=PairwiseAligner, mode="local")
Expand All @@ -63,8 +63,8 @@ alignment = Alignment.from_sequences(list_of_sequences)
from pyeed.aligners import PairwiseAligner

# Get two ProteinInfo objects
tem1 = ProteinInfo.from_ncbi("QGC48744.1")
tem109 = ProteinInfo.from_ncbi("AAT46413.1")
tem1 = ProteinInfo.get_id("QGC48744.1")
tem109 = ProteinInfo.get_id("AAT46413.1")

# Create and run alignment
alignment = PairwiseAlignment([tem1, tem109], aligner=PairwiseAligner, mode="global")
Expand All @@ -80,7 +80,7 @@ alignment = Alignment.from_sequences(list_of_sequences)

# Get sequences
ncbi_accessions = ["QGC48744.1", "AAT46413.1", "AAT46414.1", "AAT46415.1"]
sequences = ProteinInfo.from_ncbi(ncbi_accessions)
sequences = ProteinInfo.get_ids(ncbi_accessions)

# Create and run alignment
alignment = Alignment.from_sequences(sequences, aligner=PairwiseAligner, mode="global)
Expand All @@ -99,7 +99,7 @@ Most sequence alignment tools are implemented as command line tools, which need

# Get sequences
ncbi_accessions = ["QGC48744.1", "AAT46413.1", "AAT46414.1", "AAT46415.1"]
sequences = ProteinInfo.from_ncbi(ncbi_accessions)
sequences = ProteinInfo.get_ids(ncbi_accessions)

# Create and run alignment
alignment = Alignment.from_sequences(sequences, aligner=ClustalOmega)
Expand Down
13 changes: 7 additions & 6 deletions docs/quick_start/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,34 +7,34 @@ A sequence object can be created by passing a sequence string to the constructor
=== "Protein"

``` py
from pyEED.core import ProteinInfo
from pyeed.core import ProteinInfo

protein = ProteinInfo(sequence="MTEITAAMVKELREDKAVQLLREKGLGK")
```

=== "DNA"

``` py
from pyEED.core import DNAInfo
from pyeed.core import DNAInfo

dna = DNAInfo(sequence="ATGCGTACGTCGATCGATCGATCGATCGATCGATCGATCGATCGTAGTC")
```


## 🔎 Search for a sequence

Besides adding sequence information manually, PyEED also allows to search for sequences in the NCBI and UniProt databases. Therefore, the `from_db()` method can be used. In addition to the sequence itself, the method also returns the sequence's annotations and maps them to the corresponding attributes of the sequence object.
Besides adding sequence information manually, PyEED also allows searching for sequences in the NCBI and UniProt databases. Therefore, the `get_id()` method can be used. In addition to the sequence itself, the method also returns the sequence's annotations and maps them to the corresponding attributes of the sequence object.

=== "Protein"

``` py
protein = ProteinInfo.from_db("UCS38941.1")
protein = ProteinInfo.get_id("UCS38941.1")
```

=== "DNA"

``` py
dna = DNAInfo.from_db("NC_000913.3")
dna = DNAInfo.get_id("NC_000913.3")
```

Alternatively, the sequence can be initiated from a sequence string, triggering a BLAST search in the NCBI database. If the sequence is found, the sequence object is filled with the corresponding information.
Expand All @@ -55,7 +55,8 @@ Alternatively, the sequence can be initiated from a sequence string, triggering

### To file

The sequence can be stored in a `FASTA`, `JSON`, `YAML`, or `XML`file format. Therefore, the respective method can be used.
The sequence can be stored in a `FASTA`, `JSON`, `YAML`, or `XML` file format. Therefore, the respective method can be used.
The file path is passed as an argument to the method.

=== "FASTA"

Expand Down
16 changes: 11 additions & 5 deletions docs/quick_start/blast.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# Using BLAST

## Using NCBI BLAST

NCBI serves offer a web interface for blasting. With PyEED this can be programmatically accessed. A BLAST search can be initiated by calling the `ncbi_blast()` method on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
NCBI offers a web interface for blasting. With PyEED this can be programmatically accessed. A BLAST search can be initiated by calling the `ncbi_blast()` method on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.

``` py
from pyEED.core import ProteinInfo

# Create a ProteinInfo object
protein = ProteinInfo.from_db("UCS38941.1")
protein = ProteinInfo.get_id("UCS38941.1")

# Perform a BLAST search
blast_results = protein.ncbi_blast()
Expand All @@ -22,6 +21,13 @@ blast_results = protein.ncbi_blast()

## Using BLAST with a local database

Building a local BLAST database is a good way to speed up BLAST searches. PyEED allows to perform BLAST searches on local databases. The `local_blast()` method can be called on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.
Building a local BLAST database is a good way to speed up BLAST searches. PyEED allows BLAST searches against local databases. The `blastp()` method can be called on a `ProteinInfo` object. The method returns the found sequences as a list of `ProteinInfo` objects.

``` py

``` py
blast_results = protein.blastp(
db_path="/PATH/TO/LOCAL/BLAST/DB",
n_hits=200,
e_value=0.001,
word_size=3,
)
4 changes: 2 additions & 2 deletions docs/quick_start/networks.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ A `SequenceNetwork` is created using a list of `PairwiseAlignment` objects, and
"WP_048165429.1",
"ACS90033.1",
]
mats = ProteinInfo.from_ncbi(mat_accessions)
mats = ProteinInfo.get_ids(mat_accessions)

# Create pairwise alignments between all sequences
alignments = Alignment.from_sequences(mats, aligner=PairwiseAligner)
Expand Down Expand Up @@ -59,7 +59,7 @@ A `SequenceNetwork` is created using a list of `PairwiseAlignment` objects, and
"WP_048165429.1",
"ACS90033.1",
]
mats = ProteinInfo.from_ncbi(mat_accessions)
mats = ProteinInfo.get_ids(mat_accessions)

# Create pairwise alignments between all sequences
alignments = Alignment.from_sequences(mats, aligner=PairwiseAligner)
Expand Down
Loading

0 comments on commit 2635094

Please sign in to comment.