Skip to content

Commit

Permalink
PDBManager - Bug fixes, adding necessary changes to export only fir…
Browse files Browse the repository at this point in the history
…st PDB model, and merging-in latest updates from `master` (#311)

* add PDB manager #270

* add download method

* add clustering utilities

* `PDBManager` - Bug fixes, adding necessary changes to export only first PDB model, and merging-in latest updates from `master` (#309)

* Fix graph sequence (atomistic graphs in `initialise_graph_with_metadata` had duplicated residues)  (#268)

* Fix param name typo in function docstring

* fix: atomistic graph only has sequence residues for CA atom in `initialise_graph_with_metadata`

* fix: avoid changing dataframe when extracting rows

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add: test sequence feature in graphs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix graph sequence feature (#268)

* fix matplotlib deprecation

* fix test bug

* change build to ubuntu-latest

* remove unecessary selection

---------

Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Arian Jamasb <arjamasb@gmail.com>

* Add dataset splits functionality and add new documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve merge conflicts with remote

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused test

* Address lingering SonarCloud concerns

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add deposition date parsing

* remove pdb.py

* add chain extraction util

* add chain writing method

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* After fixing merge conflicts, add more filters and add time-based splits

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix up SonarCloud concerns

* Improve verbiage surrounding PDB resolutions

* Simplify code and improve variable names

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Track names of splits in df_splits

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix column naming during merging of DataFrame splits

* add additional properties

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor clustering to allow file caching and overwriting

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add description to assert statements

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add extra documentation around clustering function, and address small formatting issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add method to write selection to CSV

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* improve from_fasta documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Enable code reuse for length filters

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor documentation changes to FASTA write-out function

* Add ability to perform most API calls for a subset of splits

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update .gitignore

* Fix missing download call, and add more documentation to download functions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix small bug when merging different splits together

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug in length filtering functions, fix print bugs in utils, and add ability to write-out PDB files after selecting a subset of chains to include in them

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix string formatting

* Update PDB write-out logic and documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add PDB download workaround for PDBs that can no longer be downloaded

* Make exception more specific

* Add TQDM for data split exporting

* Add improved error message for non standard node funcs #274 (#275)

* Add improved error message for non standard node funcs #274

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up unused files and move docs from root (#276)

* clean up unused files and move docs from root

* remove setup.cfg

* prelim path support #269 (#277)

* prelim path support #269

* fix import error

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update changelog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Switch to miniconda for build (#278)

* switch to miniconda for build

* update docker build

* switch to checkout v3

* Improve altloc handling (#263)

* Fix bug in `add_k_nn_edges`.

`kneighbors_graph(X=dist_mat, ...)` is wrong since `X` may not be a distance matrix. This leads to wrong results which may be similar to correct ones.

* Extend `add_k_nn_edges`.

* Add types to docstring

* Update changelog

* Add `kind_name` argument

* Test `filter_distmat`

* Set default value of `long_interaction_threshold` to 0

* Fix filtering bug in `add_k_nn_edges`

* Test `add_k_nn_edges`

* Refactor with `add_edge`

* Fix bug for empty `edges_to_excl`

* Improve `convert_nx_to_pyg`

* Fix bug in `plot_pyg_data`

* Test `convert_nx_to_pyg` on multimers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update `CHANGELOG.md`

* Fix version in `CHANGELOG.md`

* Handle corner cases

* Handle NaNs in coordinatess

* Add PyG install to CI

* typo in CI config

* bump torch versions in CI

* make pyg-related tests conditional pyg installation

* Try fixing graph attributes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix typo and extend amino acid 3to1, 1to3 mappings

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Adapt imports of amino acid codes

* add semicolon to version

* remove wildcard version number for pyyaml

* fix typo

* fix additonal typos

* Extend aggregation to vectors

* Implement `aggregate_feature_over_residues`

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add docstring and aggregation type

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* import literal from typing extensions

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add missing `median` in exception message

* Fix `nullcontext`

* fix dataset test

* fix division by zero errors in edge colouring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update changlelog

* Separate and improve `remove_alt_locs`

Removal of alt_locs is separeted from removal of insertions. Additionaly, now alt_locs with hihger occupancies are left

* Test `remove_alt_locs`

* Rename test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Set `insertions=True` by default

* Make `alt_locs` configurable (TODO `include` case)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use typing_extensions literal for 3.7 compatibility

* use typing extensions literal for 3.7 compatibility

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* improve hbond donor/acceptor assignment robustnness

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* replace trailing ":" in insertions

* fix test and hbond granularity inference

* Add altloc identifer to node ID

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test

* fix test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* actually fix test

* update changelog

* Fix typo

---------

Co-authored-by: Arian Jamasb <arjamasb@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Df processing #216 (#222)

* docstrings and df processing funcs #216

* dcstrings

* add test

* lint test

* fix test

* fix typo in test

* Update changelog

* fix typo in test

* fix broken test

* fix broken test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add hetatm removal to test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use atomic granularity

* fix syntax error

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bugs in test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test

* typo

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Minor patch `convert_nx_to_pyg` #280 (#281)

* nx_to_pyg bug fix #280

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update changelog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Arian Jamasb <arjamasb@gmail.com>

* changes for 1.6.0 (#279)

* changes for 1.6.0

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Enable PDBManager root to be set to an arbitrary location

* add initial tests

* update changelog

* add tutorial notebook

* Allow all chains in a complex to be exported together

* add module-level import

* Remove old, unused PDBManager prototype file

* add parsing & checks for unavailable PDB structures

* fix download checker

* actually fix download checker

* add availability filter

* FoldComp ML Datasets (#284)

* add foldcomp dataset util

* clean up

* add import warnings

* add foldcomp dataset extra dependencies

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* exclude foldcomp from notebook tests. download too big :(

* update changelog

* add lightning datamodule wrapper

* add transform functionality

* docs: add new module to API reference

* update notebook

* fix: fix paths issue on setup

* add foldcomp dataset tutorial to docs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add stage param to setup

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Default to export model 1's chains only in PDBManager, and clean-up notebook and utilities

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add tutorial nblink

* add tutorial to datasets sections

* mv pdb data to ml API

* rm pyg dataset import

* rm unused code

* fix annotation

* add MMTF download format

* refactor dependency utils

* refactor graphein.utils.utils.import_message

* refactor graphein.protein.utils.is_tool

* update .gitignore

* ignore cif too

* ignore cif too

* ignore foldcomp files

* catch straggling erroneous imports

* ignore mol2

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update folding utils

* add max batch option

* add foldcomp utils

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add notebook updates [WIP]

* move manager class into graphein.ml

* remove datasets init

* fix import util refactor I didn't catch

* add PDBmanager to __init__

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix oligomeric filtering

* update notebook

* fix dataset init

* fix protein.coord renaming in tensor module

* add try/except to pyg-related datasets

* add try/except to pyg-related datasets

* add mmseqs to CI build

* rollback dssp install to conda

* ignore pdb manager notebook in minimal tests

* fix code smell

* fix metrics

* shorten line lengths

* add minimum scipy version

* remove python 3.7 from CI

* Add Torch 2.0.0 to CI

* add note about multiple split strategies

* add torch cluster install to CI

* update dockerfile to torch 2.0

* switch docker pytorch 1.13 for VMD python version conflict

* switch out torchtyping for jaxtyping

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update tensor shape syntax for jaxtyping

* remove torch-dependent tests from minimal install testing

* update test ignores

* install dssp from apt, rather than conda in docker

* update typing extensions version

* Update citation (#287)

* update citation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Support MMTF & rename pdb_path to path throughout (#293)

* rename pdb_path to path throughout

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* install from biopandas bleeding edge

* fix bleeding edge biopandas install

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update to bleeding edge biopandas

* [pre-commit.ci] pre-commit autoupdate (#294)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/psf/black: 23.1.0 → 23.3.0](psf/black@23.1.0...23.3.0)

* pin pandas to <2.0.0

* Bump AF2 version

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Arian Jamasb <arjamasb@gmail.com>

* update path in notebooks

* Add missing import #296 (#297)

* update changelog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Prep for 1.7.0 release (#292)

* update version string

* update readme

* update doc version

* update changelog

* Add autopublish workflow (#298)

* Add autopublish workflow

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update version for 1.7.0

* update workflow version

* remove rogue print statement (#302)

* Consistent conversion to undirected graphs (#301)

* Fix `convert_nx_to_pyg` to return undirected graph

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix symmetrization of edges of different kinds

* Clean

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix case when `edge_index` is not desired

* Test directed/undirected conversion consistency

* Update contributors

* Update CHANGELOG.md

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add graphein install to tutorial notebook #306

* Tensor fixes (#307)

* add PSW to nonstandard residues

* improve insertion and non-standard residue handling

* refactor chain selection

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused verbosity arg

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix chain selection in tests

* fix chain selection in tutorial notebook

* fix notebook chain selection

* fix chain selection typehint

* Update changelog

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add NLW as a nonstandard residue

* Export only first model of each downloaded PDB file, and typecast model_id column to str to avoid to_pdb() errors

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Track split names for edge cases in dataset splitting

* Add fix for scenario where downloaded PDB files do not contain ATOMs for an entry's listed chains

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Cam <73625486+kamurani@users.noreply.github.com>
Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Arian Jamasb <arjamasb@gmail.com>
Co-authored-by: Anton Bushuiev <67932762+anton-bushuiev@users.noreply.github.com>
Co-authored-by: Ryan Greenhalgh <35999546+rg314@users.noreply.github.com>

* Add structure format parameter to allow mmtf manipulation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update changelog

---------

Co-authored-by: Alex Morehead <acmwhb@missouri.edu>
Co-authored-by: Cam <73625486+kamurani@users.noreply.github.com>
Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Anton Bushuiev <67932762+anton-bushuiev@users.noreply.github.com>
Co-authored-by: Ryan Greenhalgh <35999546+rg314@users.noreply.github.com>
  • Loading branch information
7 people authored Apr 28, 2023
1 parent af2b2e0 commit e982aa1
Show file tree
Hide file tree
Showing 3 changed files with 97 additions and 14 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

#### Bugfixes
* Adds missing `stage` parameter to `graphein.ml.datasets.foldcomp_data.FoldCompDataModule.setup()`. [#310](https://github.com/a-r-j/graphein/pull/310)
* Ensures exproting groups of PDB chains with PDBManager selects the first model for multu-model structures. [#311](https://github.com/a-r-j/graphein/pull/311)
* Fixes bug with exporting PDBs with only one splitting strategy in PDBManager [#311](https://github.com/a-r-j/graphein/pull/311)
* Fixes incorrect jaxtyping syntax for variable size dimensions [#312](https://github.com/a-r-j/graphein/pull/312)

#### Other Changes
Expand Down Expand Up @@ -41,7 +43,6 @@
* Missing `os` import fixed in [#297(https://github.com/a-r-j/graphein/pull/297). Fixes [#296](https://github.com/a-r-j/graphein/issues/296)



### 1.6.0 - 18/03/2023

#### New Features
Expand Down
83 changes: 71 additions & 12 deletions graphein/ml/datasets/pdb_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from tqdm import tqdm

from graphein.protein.utils import (
cast_pdb_column_to_type,
download_pdb_multiprocessing,
extract_chains_to_file,
read_fasta,
Expand All @@ -29,6 +30,7 @@ class PDBManager:
def __init__(
self,
root_dir: str = ".",
structure_format: str = "pdb",
splits: Optional[List[str]] = None,
split_ratios: Optional[List[float]] = None,
split_time_frames: Optional[List[np.datetime64]] = None,
Expand All @@ -39,6 +41,9 @@ def __init__(
:param root_dir: The directory in which to store all PDB entries,
defaults to ``"."``.
:type root_dir: str, optional
:param structure_format: Whether to use ``.pdb`` or ``.mmtf`` file.
Defaults to ``"pdb"``.
:type structure_format: str, optional
:param splits: A list of names corresponding to each dataset split,
defaults to ``None``.
:type splits: Optional[List[str]], optional
Expand Down Expand Up @@ -81,6 +86,8 @@ def __init__(
if not os.path.exists(self.pdb_dir):
os.makedirs(self.pdb_dir)

self.structure_format = structure_format

self.pdb_seqres_archive_filename = Path(self.pdb_sequences_url).name
self.pdb_seqres_filename = Path(self.pdb_seqres_archive_filename).stem
self.ligand_map_filename = Path(self.ligand_map_url).name
Expand Down Expand Up @@ -1210,6 +1217,7 @@ def split_clusters(
self.df_splits[split], df_split, split
)
else:
df_split.split = split
self.df_splits[split] = df_split
df_splits[split] = self.df_splits[split]

Expand Down Expand Up @@ -1341,6 +1349,7 @@ def split_df_into_time_frames(
(df.deposition_date >= start_datetime)
& (df.deposition_date < end_datetime)
]
df_split.split = split
df_splits[split] = df_split
start_datetime = end_datetime

Expand Down Expand Up @@ -1412,6 +1421,7 @@ def split_by_deposition_date(
self.df_splits[split], df_split, split
)
else:
df_split.split = split
self.df_splits[split] = df_split
df_splits[split] = self.df_splits[split]

Expand Down Expand Up @@ -1528,9 +1538,15 @@ def write_chains(

# Check we have all source PDB files
downloaded = os.listdir(self.pdb_dir)
downloaded = [f for f in downloaded if f.endswith(".pdb")]
downloaded = [
f for f in downloaded if f.endswith(f".{self.structure_format}")
]

to_download = [k for k in df.keys() if f"{k}.pdb" not in downloaded]
to_download = [
k
for k in df.keys()
if f"{k}.{self.structure_format}" not in downloaded
]
if len(to_download) > 0:
log.info(f"Downloading {len(to_download)} PDB files...")
download_pdb_multiprocessing(
Expand All @@ -1542,7 +1558,9 @@ def write_chains(
log.info("Extracting chains...")
paths = []
for k, v in tqdm(df.items()):
in_file = os.path.join(self.pdb_dir, f"{k}.pdb")
in_file = os.path.join(
self.pdb_dir, f"{k}.{self.structure_format}"
)
paths.append(
extract_chains_to_file(
in_file, v, out_dir=self.pdb_dir, models=models
Expand Down Expand Up @@ -1708,7 +1726,9 @@ def write_out_pdb_chain_groups(
out_dir: str,
split: str,
merge_fn: Callable,
atom_df_name: str = "ATOM",
max_num_chains_per_pdb_code: int = 1,
models: List[int] = [1],
):
"""Record groups of PDB codes and associated chains
as collated PDB files.
Expand All @@ -1724,9 +1744,15 @@ def write_out_pdb_chain_groups(
:type split: str
:param merge_fn: The PDB code-chain grouping function to use.
:type merge_fn: Callable
:param atom_df_name: Name of the DataFrame by which to access
ATOM entries within a PandasPdb object.
:type atom_df_name: str, defaults to ``ATOM``
:param max_num_chains_per_pdb_code: Maximum number of chains
to collate into a matching PDB file.
:type max_num_chains_per_pdb_code: int, optional
:param models: List of indices of models from which to extract chains,
defaults to ``[1]``.
:type models: List[int], optional
"""
if len(df) > 0:
split_dir = Path(out_dir) / split
Expand All @@ -1737,27 +1763,49 @@ def write_out_pdb_chain_groups(
df_merged = df_merged.reset_index(drop=True)

for _, entry in tqdm(df_merged.iterrows()):
pdb_code, chains = entry["pdb"], entry["chain"]
chains = (
chains
if max_num_chains_per_pdb_code == -1
else chains[:max_num_chains_per_pdb_code]
)
entry_pdb_code, entry_chains = entry["pdb"], entry["chain"]

input_pdb_filepath = Path(pdb_dir) / f"{pdb_code}.pdb"
output_pdb_filepath = split_dir / f"{pdb_code}.pdb"
input_pdb_filepath = (
Path(pdb_dir) / f"{entry_pdb_code}.{self.structure_format}"
)
output_pdb_filepath = (
split_dir / f"{entry_pdb_code}.{self.structure_format}"
)

if not os.path.exists(str(output_pdb_filepath)):
try:
pdb = PandasPdb().read_pdb(str(input_pdb_filepath))
pdb = (
PandasPdb()
.read_pdb(str(input_pdb_filepath))
.get_models(models)
)
except FileNotFoundError:
log.info(
f"Failed to load {str(input_pdb_filepath)}. Perhaps it is not longer available to download from the PDB?"
)
continue
# work around int-typing bug for `model_id` within version `0.5.0.dev0` of BioPandas -> appears when calling `to_pdb()`
cast_pdb_column_to_type(
pdb, column_name="model_id", type=str
)
# select only from chains available in the PDB file
pdb_atom_chains = (
pdb.df[atom_df_name].chain_id.unique().tolist()
)
chains = [
chain
for chain in entry_chains
if chain in pdb_atom_chains
]
chains = (
chains
if max_num_chains_per_pdb_code == -1
else chains[:max_num_chains_per_pdb_code]
)
pdb_chains = self.select_pdb_by_criterion(
pdb, "chain_id", chains
)
# export selected chains within the same PDB file
pdb_chains.to_pdb(str(output_pdb_filepath))

def write_df_pdbs(
Expand All @@ -1767,6 +1815,7 @@ def write_df_pdbs(
out_dir: str = "collated_pdb",
splits: Optional[List[str]] = None,
max_num_chains_per_pdb_code: int = 1,
models: List[int] = [1],
):
"""Write the given selection as a collection of PDB files.
Expand All @@ -1784,6 +1833,9 @@ def write_df_pdbs(
:param max_num_chains_per_pdb_code: Maximum number of chains
to collate into a matching PDB file.
:type max_num_chains_per_pdb_code: int, optional
:param models: List of indices of models from which to extract chains,
defaults to ``[1]``.
:type models: List[int], optional
"""
out_dir = Path(pdb_dir) / out_dir
os.makedirs(out_dir, exist_ok=True)
Expand All @@ -1798,6 +1850,7 @@ def write_df_pdbs(
split=split,
merge_fn=self.merge_pdb_chain_groups,
max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
models=models,
)
else:
self.write_out_pdb_chain_groups(
Expand All @@ -1807,13 +1860,15 @@ def write_df_pdbs(
split="full",
merge_fn=self.merge_pdb_chain_groups,
max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
models=models,
)

def export_pdbs(
self,
pdb_dir: str,
splits: Optional[List[str]] = None,
max_num_chains_per_pdb_code: int = 1,
models: List[int] = [1],
force: bool = False,
):
"""Write the selection as a collection of PDB files.
Expand All @@ -1826,6 +1881,9 @@ def export_pdbs(
:param max_num_chains_per_pdb_code: Maximum number of chains
to collate into a matching PDB file.
:type max_num_chains_per_pdb_code: int, optional
:param models: List of indices of models from which to extract chains,
defaults to ``[1]``.
:type models: List[int], optional
:param force: Whether to raise an error if the download selection
contains PDBs which are not available in PDB format.
"""
Expand All @@ -1841,5 +1899,6 @@ def export_pdbs(
split_dfs,
splits=splits,
max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
models=models,
)
log.info("Done writing selection of PDB chains")
25 changes: 24 additions & 1 deletion graphein/protein/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from functools import lru_cache, partial
from pathlib import Path
from shutil import which
from typing import Any, Dict, List, Optional, Tuple, Union
from typing import Any, Dict, List, Optional, Tuple, Type, Union
from urllib.error import HTTPError
from urllib.request import urlopen

Expand Down Expand Up @@ -516,6 +516,27 @@ def esmfold(
f.write(cif)


def cast_pdb_column_to_type(
pdb: PandasPdb, column_name: str, type: Type
) -> PandasPdb:
"""Casts a specified column within a PandasPdb object to a given type
and returns the typecasted PandasPdb object.
:param pdb: Input PandasPdb object.
type pdb: PandasPdb
:param column_name: Name of column to typecast.
:type column_name: str
:param type: Type to which to cast the specified column.
:type type: Type
:return: Typecasted PandasPdb object.
:rtype: PandasPdb
"""
for key in pdb.df:
if column_name in pdb.df[key]:
pdb.df[key][column_name] = pdb.df[key][column_name].apply(type)
return pdb


def extract_chains_to_file(
pdb_file: str, chains: List[str], out_dir: str, models: List[int] = [1]
) -> List[str]:
Expand Down Expand Up @@ -544,6 +565,8 @@ def extract_chains_to_file(
fname = fname.split(".")[0]

ppdb = PandasPdb().read_pdb(pdb_file).get_models(models)
# work around int-typing bug for `model_id` within version `0.5.0.dev0` of BioPandas -> appears when calling `to_pdb()`
cast_pdb_column_to_type(ppdb, column_name="model_id", type=str)

out_files = []
for chain in chains:
Expand Down

0 comments on commit e982aa1

Please sign in to comment.