PDBManager - Bug fixes, adding necessary changes to export only fir…

…st PDB model, and merging-in latest updates from `master` (#311) * add PDB manager #270 * add download method * add clustering utilities * `PDBManager` - Bug fixes, adding necessary changes to export only first PDB model, and merging-in latest updates from `master` (#309) * Fix graph sequence (atomistic graphs in `initialise_graph_with_metadata` had duplicated residues) (#268) * Fix param name typo in function docstring * fix: atomistic graph only has sequence residues for CA atom in `initialise_graph_with_metadata` * fix: avoid changing dataframe when extracting rows * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add: test sequence feature in graphs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix graph sequence feature (#268) * fix matplotlib deprecation * fix test bug * change build to ubuntu-latest * remove unecessary selection --------- Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Arian Jamasb <arjamasb@gmail.com> * Add dataset splits functionality and add new documentation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Resolve merge conflicts with remote * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused test * Address lingering SonarCloud concerns * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add deposition date parsing * remove pdb.py * add chain extraction util * add chain writing method * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * After fixing merge conflicts, add more filters and add time-based splits * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix up SonarCloud concerns * Improve verbiage surrounding PDB resolutions * Simplify code and improve variable names * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Track names of splits in df_splits * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix column naming during merging of DataFrame splits * add additional properties * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor clustering to allow file caching and overwriting * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add description to assert statements * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add extra documentation around clustering function, and address small formatting issues * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add method to write selection to CSV * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improve from_fasta documentation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Enable code reuse for length filters * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor documentation changes to FASTA write-out function * Add ability to perform most API calls for a subset of splits * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update .gitignore * Fix missing download call, and add more documentation to download functions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small bug when merging different splits together * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug in length filtering functions, fix print bugs in utils, and add ability to write-out PDB files after selecting a subset of chains to include in them * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix string formatting * Update PDB write-out logic and documentation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add PDB download workaround for PDBs that can no longer be downloaded * Make exception more specific * Add TQDM for data split exporting * Add improved error message for non standard node funcs #274 (#275) * Add improved error message for non standard node funcs #274 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * clean up unused files and move docs from root (#276) * clean up unused files and move docs from root * remove setup.cfg * prelim path support #269 (#277) * prelim path support #269 * fix import error * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update changelog --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Switch to miniconda for build (#278) * switch to miniconda for build * update docker build * switch to checkout v3 * Improve altloc handling (#263) * Fix bug in `add_k_nn_edges`. `kneighbors_graph(X=dist_mat, ...)` is wrong since `X` may not be a distance matrix. This leads to wrong results which may be similar to correct ones. * Extend `add_k_nn_edges`. * Add types to docstring * Update changelog * Add `kind_name` argument * Test `filter_distmat` * Set default value of `long_interaction_threshold` to 0 * Fix filtering bug in `add_k_nn_edges` * Test `add_k_nn_edges` * Refactor with `add_edge` * Fix bug for empty `edges_to_excl` * Improve `convert_nx_to_pyg` * Fix bug in `plot_pyg_data` * Test `convert_nx_to_pyg` on multimers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update `CHANGELOG.md` * Fix version in `CHANGELOG.md` * Handle corner cases * Handle NaNs in coordinatess * Add PyG install to CI * typo in CI config * bump torch versions in CI * make pyg-related tests conditional pyg installation * Try fixing graph attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix typo and extend amino acid 3to1, 1to3 mappings * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adapt imports of amino acid codes * add semicolon to version * remove wildcard version number for pyyaml * fix typo * fix additonal typos * Extend aggregation to vectors * Implement `aggregate_feature_over_residues` * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add docstring and aggregation type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * import literal from typing extensions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing `median` in exception message * Fix `nullcontext` * fix dataset test * fix division by zero errors in edge colouring * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changlelog * Separate and improve `remove_alt_locs` Removal of alt_locs is separeted from removal of insertions. Additionaly, now alt_locs with hihger occupancies are left * Test `remove_alt_locs` * Rename test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Set `insertions=True` by default * Make `alt_locs` configurable (TODO `include` case) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use typing_extensions literal for 3.7 compatibility * use typing extensions literal for 3.7 compatibility * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improve hbond donor/acceptor assignment robustnness * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * replace trailing ":" in insertions * fix test and hbond granularity inference * Add altloc identifer to node ID * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test * fix test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * actually fix test * update changelog * Fix typo --------- Co-authored-by: Arian Jamasb <arjamasb@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Df processing #216 (#222) * docstrings and df processing funcs #216 * dcstrings * add test * lint test * fix test * fix typo in test * Update changelog * fix typo in test * fix broken test * fix broken test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add hetatm removal to test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use atomic granularity * fix syntax error * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bugs in test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test * typo --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Minor patch `convert_nx_to_pyg` #280 (#281) * nx_to_pyg bug fix #280 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changelog --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Arian Jamasb <arjamasb@gmail.com> * changes for 1.6.0 (#279) * changes for 1.6.0 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Enable PDBManager root to be set to an arbitrary location * add initial tests * update changelog * add tutorial notebook * Allow all chains in a complex to be exported together * add module-level import * Remove old, unused PDBManager prototype file * add parsing & checks for unavailable PDB structures * fix download checker * actually fix download checker * add availability filter * FoldComp ML Datasets (#284) * add foldcomp dataset util * clean up * add import warnings * add foldcomp dataset extra dependencies * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * exclude foldcomp from notebook tests. download too big :( * update changelog * add lightning datamodule wrapper * add transform functionality * docs: add new module to API reference * update notebook * fix: fix paths issue on setup * add foldcomp dataset tutorial to docs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add stage param to setup --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Default to export model 1's chains only in PDBManager, and clean-up notebook and utilities * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add tutorial nblink * add tutorial to datasets sections * mv pdb data to ml API * rm pyg dataset import * rm unused code * fix annotation * add MMTF download format * refactor dependency utils * refactor graphein.utils.utils.import_message * refactor graphein.protein.utils.is_tool * update .gitignore * ignore cif too * ignore cif too * ignore foldcomp files * catch straggling erroneous imports * ignore mol2 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update folding utils * add max batch option * add foldcomp utils * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add notebook updates [WIP] * move manager class into graphein.ml * remove datasets init * fix import util refactor I didn't catch * add PDBmanager to __init__ * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix oligomeric filtering * update notebook * fix dataset init * fix protein.coord renaming in tensor module * add try/except to pyg-related datasets * add try/except to pyg-related datasets * add mmseqs to CI build * rollback dssp install to conda * ignore pdb manager notebook in minimal tests * fix code smell * fix metrics * shorten line lengths * add minimum scipy version * remove python 3.7 from CI * Add Torch 2.0.0 to CI * add note about multiple split strategies * add torch cluster install to CI * update dockerfile to torch 2.0 * switch docker pytorch 1.13 for VMD python version conflict * switch out torchtyping for jaxtyping * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update tensor shape syntax for jaxtyping * remove torch-dependent tests from minimal install testing * update test ignores * install dssp from apt, rather than conda in docker * update typing extensions version * Update citation (#287) * update citation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Support MMTF & rename pdb_path to path throughout (#293) * rename pdb_path to path throughout * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * install from biopandas bleeding edge * fix bleeding edge biopandas install * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to bleeding edge biopandas * [pre-commit.ci] pre-commit autoupdate (#294) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/psf/black: 23.1.0 → 23.3.0](psf/black@23.1.0...23.3.0) * pin pandas to <2.0.0 * Bump AF2 version --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Arian Jamasb <arjamasb@gmail.com> * update path in notebooks * Add missing import #296 (#297) * update changelog --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Prep for 1.7.0 release (#292) * update version string * update readme * update doc version * update changelog * Add autopublish workflow (#298) * Add autopublish workflow * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update version for 1.7.0 * update workflow version * remove rogue print statement (#302) * Consistent conversion to undirected graphs (#301) * Fix `convert_nx_to_pyg` to return undirected graph * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix symmetrization of edges of different kinds * Clean * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix case when `edge_index` is not desired * Test directed/undirected conversion consistency * Update contributors * Update CHANGELOG.md --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add graphein install to tutorial notebook #306 * Tensor fixes (#307) * add PSW to nonstandard residues * improve insertion and non-standard residue handling * refactor chain selection * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused verbosity arg * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix chain selection in tests * fix chain selection in tutorial notebook * fix notebook chain selection * fix chain selection typehint * Update changelog --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add NLW as a nonstandard residue * Export only first model of each downloaded PDB file, and typecast model_id column to str to avoid to_pdb() errors * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Track split names for edge cases in dataset splitting * Add fix for scenario where downloaded PDB files do not contain ATOMs for an entry's listed chains * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Cam <73625486+kamurani@users.noreply.github.com> Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Arian Jamasb <arjamasb@gmail.com> Co-authored-by: Anton Bushuiev <67932762+anton-bushuiev@users.noreply.github.com> Co-authored-by: Ryan Greenhalgh <35999546+rg314@users.noreply.github.com> * Add structure format parameter to allow mmtf manipulation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update changelog --------- Co-authored-by: Alex Morehead <acmwhb@missouri.edu> Co-authored-by: Cam <73625486+kamurani@users.noreply.github.com> Co-authored-by: Cam <73625486+cimranm@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Anton Bushuiev <67932762+anton-bushuiev@users.noreply.github.com> Co-authored-by: Ryan Greenhalgh <35999546+rg314@users.noreply.github.com>
a-r-j · Apr 28, 2023 · e982aa1 · e982aa1
1 parent af2b2e0
commit e982aa1
Show file tree

Hide file tree

Showing 3 changed files with 97 additions and 14 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,8 @@
 
 #### Bugfixes
 * Adds missing `stage` parameter to `graphein.ml.datasets.foldcomp_data.FoldCompDataModule.setup()`. [#310](https://github.com/a-r-j/graphein/pull/310)
+* Ensures exproting groups of PDB chains with PDBManager selects the first model for multu-model structures. [#311](https://github.com/a-r-j/graphein/pull/311)
+* Fixes bug with exporting PDBs with only one splitting strategy in PDBManager [#311](https://github.com/a-r-j/graphein/pull/311)
 * Fixes incorrect jaxtyping syntax for variable size dimensions [#312](https://github.com/a-r-j/graphein/pull/312)
 
 #### Other Changes
@@ -41,7 +43,6 @@
 * Missing `os` import fixed in [#297(https://github.com/a-r-j/graphein/pull/297). Fixes [#296](https://github.com/a-r-j/graphein/issues/296)
 
 
-
 ### 1.6.0 - 18/03/2023
 
 #### New Features

diff --git a/graphein/ml/datasets/pdb_data.py b/graphein/ml/datasets/pdb_data.py
@@ -16,6 +16,7 @@
 from tqdm import tqdm
 
 from graphein.protein.utils import (
+    cast_pdb_column_to_type,
     download_pdb_multiprocessing,
     extract_chains_to_file,
     read_fasta,
@@ -29,6 +30,7 @@ class PDBManager:
     def __init__(
         self,
         root_dir: str = ".",
+        structure_format: str = "pdb",
         splits: Optional[List[str]] = None,
         split_ratios: Optional[List[float]] = None,
         split_time_frames: Optional[List[np.datetime64]] = None,
@@ -39,6 +41,9 @@ def __init__(
         :param root_dir: The directory in which to store all PDB entries,
             defaults to ``"."``.
         :type root_dir: str, optional
+        :param structure_format: Whether to use ``.pdb`` or ``.mmtf`` file.
+            Defaults to ``"pdb"``.
+        :type structure_format: str, optional
         :param splits: A list of names corresponding to each dataset split,
             defaults to ``None``.
         :type splits: Optional[List[str]], optional
@@ -81,6 +86,8 @@ def __init__(
         if not os.path.exists(self.pdb_dir):
             os.makedirs(self.pdb_dir)
 
+        self.structure_format = structure_format
+
         self.pdb_seqres_archive_filename = Path(self.pdb_sequences_url).name
         self.pdb_seqres_filename = Path(self.pdb_seqres_archive_filename).stem
         self.ligand_map_filename = Path(self.ligand_map_url).name
@@ -1210,6 +1217,7 @@ def split_clusters(
                         self.df_splits[split], df_split, split
                     )
                 else:
+                    df_split.split = split
                     self.df_splits[split] = df_split
                 df_splits[split] = self.df_splits[split]
 
@@ -1341,6 +1349,7 @@ def split_df_into_time_frames(
                 (df.deposition_date >= start_datetime)
                 & (df.deposition_date < end_datetime)
             ]
+            df_split.split = split
             df_splits[split] = df_split
             start_datetime = end_datetime
 
@@ -1412,6 +1421,7 @@ def split_by_deposition_date(
                         self.df_splits[split], df_split, split
                     )
                 else:
+                    df_split.split = split
                     self.df_splits[split] = df_split
                 df_splits[split] = self.df_splits[split]
 
@@ -1528,9 +1538,15 @@ def write_chains(
 
         # Check we have all source PDB files
         downloaded = os.listdir(self.pdb_dir)
-        downloaded = [f for f in downloaded if f.endswith(".pdb")]
+        downloaded = [
+            f for f in downloaded if f.endswith(f".{self.structure_format}")
+        ]
 
-        to_download = [k for k in df.keys() if f"{k}.pdb" not in downloaded]
+        to_download = [
+            k
+            for k in df.keys()
+            if f"{k}.{self.structure_format}" not in downloaded
+        ]
         if len(to_download) > 0:
             log.info(f"Downloading {len(to_download)} PDB files...")
             download_pdb_multiprocessing(
@@ -1542,7 +1558,9 @@ def write_chains(
         log.info("Extracting chains...")
         paths = []
         for k, v in tqdm(df.items()):
-            in_file = os.path.join(self.pdb_dir, f"{k}.pdb")
+            in_file = os.path.join(
+                self.pdb_dir, f"{k}.{self.structure_format}"
+            )
             paths.append(
                 extract_chains_to_file(
                     in_file, v, out_dir=self.pdb_dir, models=models
@@ -1708,7 +1726,9 @@ def write_out_pdb_chain_groups(
         out_dir: str,
         split: str,
         merge_fn: Callable,
+        atom_df_name: str = "ATOM",
         max_num_chains_per_pdb_code: int = 1,
+        models: List[int] = [1],
     ):
         """Record groups of PDB codes and associated chains
         as collated PDB files.
@@ -1724,9 +1744,15 @@ def write_out_pdb_chain_groups(
         :type split: str
         :param merge_fn: The PDB code-chain grouping function to use.
         :type merge_fn: Callable
+        :param atom_df_name: Name of the DataFrame by which to access
+            ATOM entries within a PandasPdb object.
+        :type atom_df_name: str, defaults to ``ATOM``
         :param max_num_chains_per_pdb_code: Maximum number of chains
             to collate into a matching PDB file.
         :type max_num_chains_per_pdb_code: int, optional
+        :param models: List of indices of models from which to extract chains,
+            defaults to ``[1]``.
+        :type models: List[int], optional
         """
         if len(df) > 0:
             split_dir = Path(out_dir) / split
@@ -1737,27 +1763,49 @@ def write_out_pdb_chain_groups(
             df_merged = df_merged.reset_index(drop=True)
 
             for _, entry in tqdm(df_merged.iterrows()):
-                pdb_code, chains = entry["pdb"], entry["chain"]
-                chains = (
-                    chains
-                    if max_num_chains_per_pdb_code == -1
-                    else chains[:max_num_chains_per_pdb_code]
-                )
+                entry_pdb_code, entry_chains = entry["pdb"], entry["chain"]
 
-                input_pdb_filepath = Path(pdb_dir) / f"{pdb_code}.pdb"
-                output_pdb_filepath = split_dir / f"{pdb_code}.pdb"
+                input_pdb_filepath = (
+                    Path(pdb_dir) / f"{entry_pdb_code}.{self.structure_format}"
+                )
+                output_pdb_filepath = (
+                    split_dir / f"{entry_pdb_code}.{self.structure_format}"
+                )
 
                 if not os.path.exists(str(output_pdb_filepath)):
                     try:
-                        pdb = PandasPdb().read_pdb(str(input_pdb_filepath))
+                        pdb = (
+                            PandasPdb()
+                            .read_pdb(str(input_pdb_filepath))
+                            .get_models(models)
+                        )
                     except FileNotFoundError:
                         log.info(
                             f"Failed to load {str(input_pdb_filepath)}. Perhaps it is not longer available to download from the PDB?"
                         )
                         continue
+                    # work around int-typing bug for `model_id` within version `0.5.0.dev0` of BioPandas -> appears when calling `to_pdb()`
+                    cast_pdb_column_to_type(
+                        pdb, column_name="model_id", type=str
+                    )
+                    # select only from chains available in the PDB file
+                    pdb_atom_chains = (
+                        pdb.df[atom_df_name].chain_id.unique().tolist()
+                    )
+                    chains = [
+                        chain
+                        for chain in entry_chains
+                        if chain in pdb_atom_chains
+                    ]
+                    chains = (
+                        chains
+                        if max_num_chains_per_pdb_code == -1
+                        else chains[:max_num_chains_per_pdb_code]
+                    )
                     pdb_chains = self.select_pdb_by_criterion(
                         pdb, "chain_id", chains
                     )
+                    # export selected chains within the same PDB file
                     pdb_chains.to_pdb(str(output_pdb_filepath))
 
     def write_df_pdbs(
@@ -1767,6 +1815,7 @@ def write_df_pdbs(
         out_dir: str = "collated_pdb",
         splits: Optional[List[str]] = None,
         max_num_chains_per_pdb_code: int = 1,
+        models: List[int] = [1],
     ):
         """Write the given selection as a collection of PDB files.
 
@@ -1784,6 +1833,9 @@ def write_df_pdbs(
         :param max_num_chains_per_pdb_code: Maximum number of chains
             to collate into a matching PDB file.
         :type max_num_chains_per_pdb_code: int, optional
+        :param models: List of indices of models from which to extract chains,
+            defaults to ``[1]``.
+        :type models: List[int], optional
         """
         out_dir = Path(pdb_dir) / out_dir
         os.makedirs(out_dir, exist_ok=True)
@@ -1798,6 +1850,7 @@ def write_df_pdbs(
                     split=split,
                     merge_fn=self.merge_pdb_chain_groups,
                     max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
+                    models=models,
                 )
         else:
             self.write_out_pdb_chain_groups(
@@ -1807,13 +1860,15 @@ def write_df_pdbs(
                 split="full",
                 merge_fn=self.merge_pdb_chain_groups,
                 max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
+                models=models,
             )
 
     def export_pdbs(
         self,
         pdb_dir: str,
         splits: Optional[List[str]] = None,
         max_num_chains_per_pdb_code: int = 1,
+        models: List[int] = [1],
         force: bool = False,
     ):
         """Write the selection as a collection of PDB files.
@@ -1826,6 +1881,9 @@ def export_pdbs(
         :param max_num_chains_per_pdb_code: Maximum number of chains
             to collate into a matching PDB file.
         :type max_num_chains_per_pdb_code: int, optional
+        :param models: List of indices of models from which to extract chains,
+            defaults to ``[1]``.
+        :type models: List[int], optional
         :param force: Whether to raise an error if the download selection
             contains PDBs which are not available in PDB format.
         """
@@ -1841,5 +1899,6 @@ def export_pdbs(
             split_dfs,
             splits=splits,
             max_num_chains_per_pdb_code=max_num_chains_per_pdb_code,
+            models=models,
         )
         log.info("Done writing selection of PDB chains")
diff --git a/graphein/protein/utils.py b/graphein/protein/utils.py
@@ -10,7 +10,7 @@
 from functools import lru_cache, partial
 from pathlib import Path
 from shutil import which
-from typing import Any, Dict, List, Optional, Tuple, Union
+from typing import Any, Dict, List, Optional, Tuple, Type, Union
 from urllib.error import HTTPError
 from urllib.request import urlopen
 
@@ -516,6 +516,27 @@ def esmfold(
             f.write(cif)
 
 
+def cast_pdb_column_to_type(
+    pdb: PandasPdb, column_name: str, type: Type
+) -> PandasPdb:
+    """Casts a specified column within a PandasPdb object to a given type
+    and returns the typecasted PandasPdb object.
+
+    :param pdb: Input PandasPdb object.
+    type pdb: PandasPdb
+    :param column_name: Name of column to typecast.
+    :type column_name: str
+    :param type: Type to which to cast the specified column.
+    :type type: Type
+    :return: Typecasted PandasPdb object.
+    :rtype: PandasPdb
+    """
+    for key in pdb.df:
+        if column_name in pdb.df[key]:
+            pdb.df[key][column_name] = pdb.df[key][column_name].apply(type)
+    return pdb
+
+
 def extract_chains_to_file(
     pdb_file: str, chains: List[str], out_dir: str, models: List[int] = [1]
 ) -> List[str]:
@@ -544,6 +565,8 @@ def extract_chains_to_file(
     fname = fname.split(".")[0]
 
     ppdb = PandasPdb().read_pdb(pdb_file).get_models(models)
+    # work around int-typing bug for `model_id` within version `0.5.0.dev0` of BioPandas -> appears when calling `to_pdb()`
+    cast_pdb_column_to_type(ppdb, column_name="model_id", type=str)
 
     out_files = []
     for chain in chains: