From 910372d45aa6c34974a79430abf525c377be73af Mon Sep 17 00:00:00 2001 From: Richard Harris Date: Mon, 13 Mar 2023 13:12:35 +0000 Subject: [PATCH] 0.1.11 changelog --- CHANGELOG.md | 2 +- docs/docs/api.rst | 1 - docs/docs/install.rst | 20 +++---- docs/docs/tools.rst | 118 ++++++++++++++++++++++++------------------ 4 files changed, 78 insertions(+), 63 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5a6e86e..8013760 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,7 +4,7 @@ All notable changes, updates, and fixes to pod5 will be documented here The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## Unreleased +## [0.1.11] 2023-03-13 - Python API supports python 3.11 - Changed python API dependency version `pyarrow~=11.0.0` from `8.0.0` to support python 3.11 - Changed python API dependency version `hdf5~=8.0.0` from `v7.0.0` to support python 3.11 diff --git a/docs/docs/api.rst b/docs/docs/api.rst index 79c1633..cb781cb 100644 --- a/docs/docs/api.rst +++ b/docs/docs/api.rst @@ -13,7 +13,6 @@ on how to install the pod5-format packages. Reading POD5 Files ======================== - To use the module to open a POD5 file, create a :class:`~pod5.reader.Reader`. It is strongly recommended that users use python's `with statement `_ diff --git a/docs/docs/install.rst b/docs/docs/install.rst index e12cea4..54f3570 100644 --- a/docs/docs/install.rst +++ b/docs/docs/install.rst @@ -27,8 +27,8 @@ Developing with Conan --------------------- For this development process you will need `conan `_ installed. -You can install `conan` using `pip` or your platforms' package manager (e.g. `brew`) -You will also need the python `setuptools_scm` package to manage pod5 versioning: +You can install ``conan`` using ``pip`` or your platforms' package manager (e.g. ``brew``) +You will also need the python ``setuptools_scm`` package to manage pod5 versioning: .. code-block:: console @@ -49,7 +49,7 @@ You will also need the python `setuptools_scm` package to manage pod5 versioning # Creates _version.py and cmake/POD5Version.cmake which are ignored by git $ pip install setuptools_scm $ python -m setuptools_scm - $ python -m pod5_make_version.py + $ python -m pod5_make_version .. code-block:: console @@ -108,17 +108,17 @@ Python Development Setup ======================== After completing the required :ref:`build ` stages above, -to create a Python virtual environment for development use the `Makefile` in -the `python` directory to install the two pod5 python packages including all `dev` -dependencies such as `pre-commit` and `black`. +to create a Python virtual environment for development use the ``Makefile`` in +the ``python`` directory to install the two pod5 python packages including all ``dev`` +dependencies such as ``pre-commit`` and ``black``. .. code-block:: console $ cd python $ make install -Note this will completely `clean` the existing python virtual environment. If one only -requires an update to the python environment entry-points (e.g. `pod5 inspect`) simply +Note this will completely ``clean`` the existing python virtual environment. If one only +requires an update to the python environment entry-points (e.g. ``pod5 inspect``) simply run: .. code-block:: console @@ -129,8 +129,8 @@ run: Installing Pre-commit Hooks --------------------------- -The project uses `pre-commit` to ensure code is consistently formatted, you can set this -up using `pip` but if you chose to use the supplied `Makefile` this will be done +The project uses ``pre-commit`` to ensure code is consistently formatted, you can set this +up using ``pip`` but if you chose to use the supplied ``Makefile`` this will be done automatically as :ref:`detailed above ` .. code-block:: console diff --git a/docs/docs/tools.rst b/docs/docs/tools.rst index 0c6fa5a..0abb0bd 100644 --- a/docs/docs/tools.rst +++ b/docs/docs/tools.rst @@ -4,7 +4,7 @@ Tools The ``pod5`` package provides the following tools for inspecting and manipulating -POD5 files as well as converting between `.pod5` and `.fast5` file formats. +POD5 files as well as converting between ``.pod5`` and ``.fast5`` file formats. .. contents:: Entry-Points :local: @@ -14,8 +14,8 @@ POD5 files as well as converting between `.pod5` and `.fast5` file formats. Pod5 inspect ============ -The `pod5 inspect` tool can be used to extract details and summaries of -the contents of `.pod5` files. There are two programs for users within `pod5 inspect` +The ``pod5 inspect`` tool can be used to extract details and summaries of +the contents of ``.pod5`` files. There are two programs for users within ``pod5 inspect`` and these are read and reads .. code-block:: console @@ -27,7 +27,7 @@ and these are read and reads pod5 inspect reads ------------------ -Inspect all reads and print a csv table of the details of all reads in the given `.pod5` files. +Inspect all reads and print a csv table of the details of all reads in the given ``.pod5`` files. .. code-block:: console @@ -78,11 +78,10 @@ Inspect the pod5 file, find a specific read and print its details. ... - pod5 merge ========== -`pod5 merge` is a tool for merging multiple `.pod5` files into one monolithic pod5 file. +``pod5 merge`` is a tool for merging multiple ``.pod5`` files into one monolithic pod5 file. The contents of the input files are checked for duplicate read_ids to avoid accidentally merging identical reads. To override this check set the argument @@ -106,40 +105,57 @@ accidentally merging identical reads. To override this check set the argument pod5 filter =========== -`pod5 filter` is an alternative to `pod5 subset` where reads are subset from -one or more input `.pod5` files using a list of read ids provided using the `--ids` argument. +``pod5 filter`` is a simpler alternative to ``pod5 subset`` where reads are subset from +one or more input ``.pod5`` files using a list of read ids provided using the ``--ids`` argument +and writing those reads to a *single* ``--output`` file. -An important difference between `pod5 subset` and `pod5 filter` is that `--output` -specifies a directory in `subset` but a filepath in `filter`. This is because there is -only one output file in `pod5 filter`. +See ``pod5 subset`` for more advanced subsetting. .. code-block:: console pod5 filter example.pod5 --output filtered.pod5 --ids read_ids.txt -The `--ids` filtering text file must be a simple list of valid UUID read_ids with +The ``--ids`` filtering text file must be a simple list of valid UUID read_ids with one read_id per line. The only valid exceptions are: - Empty lines - Trailing / Leading whitespace -- Lines beginning with a `#` (hash / pound symbol) to allow for comments -- The text `read_id` to allow for the header from `pod5 inspect reads` +- Lines beginning with a ``#`` (hash / pound symbol) to allow for comments +- The text ``read_id`` to allow for the header from ``pod5 inspect reads`` + + +.. note:: + + The ``filter`` and ``subset`` tool will assert that any requested read_ids are + present in the inputs. If a requested read_id is missing from the inputs + then the tool will issue the following error: + + .. code-block:: + + POD5 has encountered an error: 'Missing read_ids from inputs but --missing_ok not set' + + To disable this warning then set the '-M / --missing_ok' flag. + +.. warning:: + When supplying multiple input files to 'filter' or 'subset', the tools is + effectively performing a ``merge`` operation. The 'merge' tool is better suited + for handling very large numbers of input files. pod5 subset =========== -`pod5 subset` is a tool for subsetting reads in `.pod5` files into one or more -output `.pod5` files. See also `pod5 filter` +``pod5 subset`` is a tool for subsetting reads in ``.pod5`` files into one or more +output ``.pod5`` files. See also ``pod5 filter`` -The `pod5 subset` tool requires a *mapping* which defines which read_ids should be +The ``pod5 subset`` tool requires a *mapping* which defines which read_ids should be written to which output. There are multiple ways of specifying this mapping which are -defined in either a `.csv` or `.json` file or by using a `--table` (csv or tsv) +defined in either a ``.csv`` or ``.json`` file or by using a ``--table`` (csv or tsv) and instructions on how to interpret it. -`pod5 subset` aims to be a generic tool to subset from multiple inputs to multiple outputs. -If your use-case is to `filter` read_ids from one or more inputs into a single output -then `pod5 filter` might be a more appropriate tool as the only input is a list of read_ids. +``pod5 subset`` aims to be a generic tool to subset from multiple inputs to multiple outputs. +If your use-case is to ``filter`` read_ids from one or more inputs into a single output +then ``pod5 filter`` might be a more appropriate tool as the only input is a list of read_ids. .. code-block:: console @@ -155,21 +171,21 @@ then `pod5 filter` might be a more appropriate tool as the only input is a list .. important:: - Care should be taken to ensure that when providing multiple input `.pod5` files to `pod5 subset` + Care should be taken to ensure that when providing multiple input ``.pod5`` files to ``pod5 subset`` that there are no read_id UUID clashes. If a duplicate read_id is detected an exception - will be raised unless the `--duplicate_ok` argument is set. If `--duplicate_ok` is + will be raised unless the ``--duplicate_ok`` argument is set. If ``--duplicate_ok`` is set then both reads will be written to the output, although this is not recommended. Creating a Subset Mapping ------------------------------ -The `.csv` or `.json` inputs should define a mapping of destination filename to an array +The ``.csv`` or ``.json`` inputs should define a mapping of destination filename to an array of read_ids which will be written to the destination. Subset Mapping (.csv) +++++++++++++++++++++++ -The example below shows a `.csv` subset mapping. Note that the output filename can be +The example below shows a ``.csv`` subset mapping. Note that the output filename can be specified on multiple lines. This allows multi-line specifications to avoid excessively long lines. .. code-block:: text @@ -182,9 +198,9 @@ specified on multiple lines. This allows multi-line specifications to avoid exce Subset Mapping (.json) +++++++++++++++++++++++++++ -See below an example of a `.json` subset mapping. This file must of course be well-formatted -`json` in addition to the formatting standard required by the tool. The formatting requirements -for the `.json` mapping are that keys should be unique filenames mapped to an array +See below an example of a ``.json`` subset mapping. This file must of course be well-formatted +``json`` in addition to the formatting standard required by the tool. The formatting requirements +for the ``.json`` mapping are that keys should be unique filenames mapped to an array of read_id strings. .. code-block:: json @@ -203,16 +219,16 @@ of read_id strings. Subset Mapping from Table ++++++++++++++++++++++++++++++++ -`pod5 subset` can dynamically generate output targets and collect associated reads -based on a text file containing a table (csv or tsv) parsible by `pandas`. -This table file could be the output from `pod5 inspect reads` or from a sequencing summary. +``pod5 subset`` can dynamically generate output targets and collect associated reads +based on a text file containing a table (csv or tsv) parsible by ``pandas``. +This table file could be the output from ``pod5 inspect reads`` or from a sequencing summary. The table must contain a header row and a series of columns on which to group unique collections of values. Internally this process uses the `pandas.Dataframe.groupby `_ -function where the `by` parameter is the sequence of column names specified with -the `--columns` argument. +function where the ``by`` parameter is the sequence of column names specified with +the ``--columns`` argument. -Given the following example `--table` file, observe the resultant outputs given various +Given the following example ``--table`` file, observe the resultant outputs given various arguments: .. code-block:: text @@ -248,9 +264,9 @@ Output Filename Templating When subsetting using a table the output filename is generated from a template string. The automatically generated template is the sequential concatenation of -`column_name-column_value` followed by the `.pod5` file extension. +``column_name-column_value`` followed by the ``.pod5`` file extension. -The user can set their own filename template using the `--template` argument. +The user can set their own filename template using the ``--template`` argument. This argument accepts a string in the `Python f-string style `_ where the subsetting variables are used for keyword placeholder substitution. Keywords should be placed within curly-braces. For example: @@ -269,29 +285,29 @@ Keywords should be placed within curly-braces. For example: barcode_b.subset.pod5 # Contains: read_b, read_c barcode_c.subset.pod5 # Contains: read_d -Example subsetting from `pod5 inspect reads` +Example subsetting from ``pod5 inspect reads`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The `pod5 inspect reads` tool will output a csv table summarising the content of the -specified `.pod5` file which can be used for subsetting. The example below shows -how to split a `.pod5` file by the well field. +The ``pod5 inspect reads`` tool will output a csv table summarising the content of the +specified ``.pod5`` file which can be used for subsetting. The example below shows +how to split a ``.pod5`` file by the well field. .. code-block:: console - # Create the csv table from inspect reads, skipping the first line (File: ...) - $ pod5 inspect reads example.pod5 | awk 'NR>1' > table.csv + # Create the csv table from inspect reads + $ pod5 inspect reads example.pod5 > table.csv $ pod5 subset example.pod5 --table table.csv --columns well Miscellaneous ~~~~~~~~~~~~~~ To disable the `tqdm `_ progress bar set the environment -variable `POD5_PBAR=0`. +variable ``POD5_PBAR=0``. pod5 repack =========== -`pod5 repack` will simply repack `.pod5` files into one-for-one output files of the same name. +``pod5 repack`` will simply repack ``.pod5`` files into one-for-one output files of the same name. .. code-block:: console @@ -301,14 +317,14 @@ pod5 repack pod5 convert fast5 ======================= -The `pod5 convert fast5` tool takes one or more `.fast5` files and converts them -to one or more `.pod5` files. +The ``pod5 convert fast5`` tool takes one or more ``.fast5`` files and converts them +to one or more ``.pod5`` files. .. warning:: - Some content previously stored in `.fast5` files is **not** compatible with the POD5 + Some content previously stored in ``.fast5`` files is **not** compatible with the POD5 format and will not be converted. This includes all analyses stored in the - `.fast5` file. + ``.fast5`` file. .. important:: @@ -354,9 +370,9 @@ to one or more `.pod5` files. pod5 convert to_fast5 ===================== -The `pod5 convert to_fast5` tool takes one or more `.pod5` files and converts them -to multiple `.fast5` files. The default behaviour is to write 4000 reads per output file -but this can be controlled with the `--file-read-count` argument. +The ``pod5 convert to_fast5`` tool takes one or more ``.pod5`` files and converts them +to multiple ``.fast5`` files. The default behaviour is to write 4000 reads per output file +but this can be controlled with the ``--file-read-count`` argument. .. code-block:: console