Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Clump #211

Merged
merged 26 commits into from
May 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1455887
Fixed docs and logic in simgenotype
mlamkin7 Jan 31, 2023
91aa7cc
Updated formatting
mlamkin7 Jan 31, 2023
84f3938
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Feb 8, 2023
e3f6160
Added new subcommand clump
mlamkin7 Feb 8, 2023
eee218e
Preliminary clump updates for SNPs
mlamkin7 Feb 16, 2023
fc979b4
Interim updates
mlamkin7 Feb 23, 2023
a11a6e3
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Feb 23, 2023
12de9b0
Completed SNP clumping
mlamkin7 Feb 28, 2023
2b29939
SNPs completed
mlamkin7 Mar 6, 2023
708c470
Completed SNP clumping for MLS and pearson R
mlamkin7 Mar 15, 2023
276572c
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Mar 15, 2023
27fef95
Clump STR implementation
mlamkin7 Mar 20, 2023
0875e72
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Mar 20, 2023
0da6c79
Changed MLS solution to exact solution
mlamkin7 Mar 20, 2023
6ce082c
Fixed comments
mlamkin7 Mar 22, 2023
dc224e7
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Mar 22, 2023
dd561db
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Mar 27, 2023
8fdfa3b
Merge branch 'main' of https://github.com/CAST-genomics/haptools into…
mlamkin7 Apr 3, 2023
d9ba53f
Merged simphenotypeTR
mlamkin7 Apr 25, 2023
46a6264
Completed Tests
mlamkin7 May 2, 2023
fd91011
Removed extraneous debug message
mlamkin7 May 2, 2023
6c62b75
Added clump to Docs
mlamkin7 May 2, 2023
684b5b3
Removed debug line in docs
mlamkin7 May 2, 2023
53bad78
Fixed Black formatting
mlamkin7 May 2, 2023
852a435
Removed scipy and used numpy instead
mlamkin7 May 2, 2023
9d84d91
Added clump to toctree
mlamkin7 May 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/api/haptools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,11 @@ haptools.index module
:members:
:undoc-members:
:show-inheritance:

haptools.clump module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.clump
:members:
:undoc-members:
:show-inheritance:
108 changes: 108 additions & 0 deletions docs/commands/clump.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
.. _commands-clump:


clump
=====

Clump a set of variants specified as a :doc:`.hap file </formats/genotypes>`.

The ``clump`` command creates a clump file joining SNPs or STRs in LD with one another.

Usage
~~~~~
.. code-block:: bash

haptools clump \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
--summstats-snps PATH \
--gts-snps PATH \
--summstats-strs PATH \
--gts-strs PATH \
--clump-field TEXT \
--clump-id-field TEXT \
--clump-chrom-field TEXT \
--clump-pos-field TEXT \
--clump-p1 FLOAT \
--clump-p2 FLOAT \
--clump-r2 FLOAT \
--clump-kb FLOAT \
--ld [Exact|Pearson] \
--out PATH

Examples
~~~~~~~~
.. code-block:: bash

haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--out test_snps.clump

You can use ``--ld [Exact|Pearson]`` to indicate which type of LD calculation you'd like to perform. ``Exact`` utilizes an exact cubic solution adopted from `CubeX <https://github.com/t0mrg/cubex>`_ whereas ``Pearson`` utilizes a Pearson R calculation. Note ``Exact`` only works on SNPs and not any other variant type eg. STRs. The default value is ``Pearson``.

.. code-block:: bash

haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump

You can modify thresholds and values used in the clumping process. ``--clump-p1`` is the largest value of a p-value to consider being an index variant for a clump. ``--clump-p2`` dictates the maximum p-value any variant can have to be considered when clumping. ``--clump-r2`` is the R squared threshold where being greater than this value implies the candidate variant is in LD with the index variant. ``--clump-kb`` is the maximum distance upstream or downstream from the index variant to collect candidate variants for LD comparison. For example, ``--clump-kb 100`` implies all variants 100 Kb upstream and 100 Kb downstream from the variant will be considered.

.. code-block:: bash

haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--clump-p1 0.001 \
--clump-p2 0.05 \
--clump-r2 0.7 \
--clump-kb 200.5 \
--out test_snps.clump

You can also input STRs when calculating clumps. They can be used together with SNPs or alone.

.. code-block:: bash

haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump

.. code-block:: bash

haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump

All files used in these examples are described :doc:`here </project_info/example_files>`.


Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: clump
3 changes: 3 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Commands

* :doc:`haptools index </commands/index>`: Sort, compress, and index our custom file format for haplotypes.

* :doc:`haptools clump </commands/clump>`: Convert variants in LD with one another into clumps.

* :doc:`haptools ld </commands/ld>`: Compute Pearson's correlation coefficient between a target haplotype and a set of haplotypes.

.. figure:: https://drive.google.com/uc?id=1c0i_Hjms7579s24zRsKp5yMs7BxNHed_
Expand Down Expand Up @@ -91,6 +93,7 @@ There is an option to *Cite this repository* on the right sidebar of `the reposi
commands/karyogram.rst
commands/transform.rst
commands/index.rst
commands/clump.rst
commands/ld.rst

.. toctree::
Expand Down
124 changes: 124 additions & 0 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -883,6 +883,130 @@ def index(
index_haps(haplotypes, sort, output, log)


@main.command(short_help="Clump summary stat files.")
@click.option(
"--summstats-snps",
type=click.Path(path_type=Path),
help="File to load snps summary statistics",
)
@click.option(
"--summstats-strs",
type=click.Path(path_type=Path),
help="File to load strs summary statistics",
)
@click.option(
"--gts-snps", type=click.Path(path_type=Path), help="SNP genotypes (VCF or PGEN)"
)
@click.option("--gts-strs", type=click.Path(path_type=Path), help="STR genotypes (VCF)")
@click.option(
"--clump-p1", type=float, default=0.0001, help="Max pval to start a new clump"
)
@click.option(
"--clump-p2", type=float, default=0.01, help="Filter for pvalue less than"
)
@click.option(
"--clump-id-field", type=str, default="SNP", help="Column header of the variant ID"
)
@click.option(
"--clump-field", type=str, default="P", help="Column header of the p-values"
)
@click.option(
"--clump-chrom-field",
type=str,
default="CHR",
help="Column header of the chromosome",
)
@click.option(
"--clump-pos-field", type=str, default="POS", help="Column header of the position"
)
@click.option(
"--clump-kb",
type=float,
default=250,
help="clump kb radius",
)
@click.option(
"--clump-r2",
type=float,
default=0.5,
help="r^2 threshold",
)
@click.option(
"--ld",
type=click.Choice(["Exact", "Pearson"]),
default="Pearson",
show_default=True,
help=(
"Calculation type to infer LD, Exact Solution or "
"Pearson R. (Exact|Pearson). Note the Exact Solution "
"works best when all three genotypes are present (0,1,2) in "
"the variants being compared."
),
)
@click.option(
"--out",
type=click.Path(path_type=Path),
required=True,
help="Output filename",
)
@click.option(
"-v",
"--verbosity",
type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]),
default="INFO",
show_default=True,
help="The level of verbosity desired",
)
def clump(
summstats_snps: Path,
summstats_strs: Path,
gts_snps: Path,
gts_strs: Path,
clump_p1: float,
clump_p2: float,
clump_id_field: str,
clump_field: str,
clump_chrom_field: str,
clump_pos_field: str,
clump_kb: float,
clump_r2: float,
ld: str,
out: Path,
verbosity: str = "CRITICAL",
):
"""
Performs clumping on datasets with SNPs, SNPs and STRs, and STRs.
Clumping is the process of identifying SNPs or STRs that are highly
correlated with one another and concatenating them all together into
a single "clump" in order to not repeat the same effect size due to
LD.
"""
from .logging import getLogger
from .clump import clumpstr

log = getLogger(name="clump", level=verbosity)
log.debug(f"Loading SNPs from {summstats_snps} {gts_snps}")
log.debug(f"Loading STRs from {summstats_strs} {gts_strs}")

clumpstr(
summstats_snps,
summstats_strs,
gts_snps,
gts_strs,
clump_p1,
clump_p2,
clump_id_field,
clump_field,
clump_chrom_field,
clump_pos_field,
clump_kb,
clump_r2,
ld,
out,
log,
)


if __name__ == "__main__":
# run the CLI if someone tries 'python -m haptools' on the command line
main(prog_name="haptools")
Loading