Skip to content

Commit

Permalink
feat!: update VCFAnnotator + AlleleTranslator (#265)
Browse files Browse the repository at this point in the history
* Updates `VCFAnnotator` and `AlleleTranslator` with changes done in the `main` branch. 
  *  `VCFAnnotator` changes: 
    * Use click
    * Allow for additional info fields: `VRS_Allele_IDs`, `VRS_Starts`, `VRS_Ends`, `VRS_States`
    * Add docs for how to use the tool
  * `AlleleTranslator` changes
      * Update `gnomad_re` pattern
      * Sets default `fmt` in `translate_from`
      * Validates `_from_gnomad` actual reference sequence matches expected reference sequence
      * Use valid examples in `translate_from` methods docstring
      * `assembly_name` can now be set via kwargs
 * Removes pylint check in GH Actions. Will add back in #262
  • Loading branch information
korikuzma committed Sep 26, 2023
1 parent 1c76bb7 commit 65aee0a
Show file tree
Hide file tree
Showing 32 changed files with 13,767 additions and 807 deletions.
3 changes: 0 additions & 3 deletions .github/workflows/python-cqa.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,6 @@ jobs:
- name: Install dependencies
run: |
pip install -e .[dev,extras]
- name: Pylint static analysis
run: |
pylint src/ga4gh/{core,vrs}
- name: Test with pytest
run: |
python -m pytest
54 changes: 54 additions & 0 deletions docs/extras/vcf_annotator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# VCF Annotator

The [VCF Annotator tool](../../src/ga4gh/vrs/extras/vcf_annotation.py) provides utility for annotating VCF's with VRS Allele IDs.

## How to use

*Note:\
The examples run from the root of the vrs-python directory and assumes that `input.vcf.gz` lives in the current directory*

To see the help page:

```commandline
python3 -m src.ga4gh.vrs.extras.vcf_annotation --help
```

### Use local SeqRepo Data Proxy with default root directory

The tool uses a SeqRepo data proxy. By default, the local instance at `/usr/local/share/seqrepo/latest` is used.

Example of how to run:

```commandline
python3 -m src.ga4gh.vrs.extras.vcf_annotation --vcf_in input.vcf.gz --vcf_out output.vcf.gz --vrs_pickle_out vrs_objects.pkl
```

`--vcf_in` specifies the path of the input VCF file to annotate. `--vcf_out` specifies the path of the output annotated VCF file. The `--vrs_pickle_out` specifies the path of the output pickle file containing VRS data.

### Use local SeqRepo Data Proxy with different

You can change the root directory of SeqRepo by using `seqrepo_root_dir`.

To use the local SeqRepo data proxy with SeqRepo root directory at `vrs-python/seqrepo/latest`:

```commandline
python3 -m src.ga4gh.vrs.extras.vcf_annotation --vcf_in input.vcf.gz --vcf_out output.vcf.gz --vrs_pickle_out vrs_objects.pkl --seqrepo_root_dir vrs-python/seqrepo/latest
```

### Use the REST SeqRepo Data Proxy with default base url

You can change the data proxy type by using: `--seqrepo_dp_type` (options are `local` or `rest`).

To use the REST SeqRepo data proxy at default url: `http://localhost:5000/seqrepo`:

```commandline
python3 -m src.ga4gh.vrs.extras.vcf_annotation --vcf_in input.vcf.gz --vcf_out output.vcf.gz --vrs_pickle_out vrs_objects.pkl --seqrepo_dp_type rest
```

### Use the REST SeqRepo Data Proxy with different base url
You can change the SeqRepo REST base url by using: `--seqrepo_base_url`.

To use the REST SeqRepo data proxy, at custom url: `http://custom.url:5000/seqrepo`:
```commandline
python3 -m src.ga4gh.vrs.extras.vcf_annotation --vcf_in input.vcf.gz --vcf_out output.vcf.gz --vrs_pickle_out vrs_objects.pkl --seqrepo_dp_type rest --seqrepo_base_url http://custom.url:5000/seqrepo
```
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ extras =
hgvs>=1.4
requests
dill~=0.3.7
click
notebooks =
ipython
jupyter
Expand Down
91 changes: 70 additions & 21 deletions src/ga4gh/vrs/extras/translator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"""

from collections.abc import Mapping
from typing import Union
from typing import Union, Tuple
import logging
import re

Expand All @@ -26,6 +26,10 @@
_logger = logging.getLogger(__name__)


class ValidationError(Exception):
"""Class for validation errors during translation"""


class Translator:
"""Translates various variation formats to and from GA4GH VRS models
Expand All @@ -38,7 +42,7 @@ class Translator:
"""

beacon_re = re.compile(r"(?P<chr>[^-]+)\s*:\s*(?P<pos>\d+)\s*(?P<ref>\w+)\s*>\s*(?P<alt>\w+)")
gnomad_re = re.compile(r"(?P<chr>[^-]+)-(?P<pos>\d+)-(?P<ref>\w+)-(?P<alt>\w+)")
gnomad_re = re.compile(r"(?P<chr>[^-]+)-(?P<pos>\d+)-(?P<ref>[ACGTN]+)-(?P<alt>[ACGTN]+|\*|\.)", re.IGNORECASE)
hgvs_re = re.compile(r"[^:]+:[cgnpr]\.")
spdi_re = re.compile(r"(?P<ac>[^:]+):(?P<pos>\d+):(?P<del_len_or_seq>\w*):(?P<ins_seq>\w*)")

Expand Down Expand Up @@ -90,7 +94,7 @@ def _ir_stype(a):
return "g"
return None

def translate_from(self, var, fmt, **kwargs):
def translate_from(self, var, fmt=None, **kwargs):
"""Translate variation `var` to VRS object
If `fmt` is None, guess the appropriate format and return the variant.
Expand All @@ -100,10 +104,17 @@ def translate_from(self, var, fmt, **kwargs):
kwargs:
For CnvTranslator
copies: The number of copies to use. If provided will return a
copies(int): The number of copies to use. If provided will return a
CopyNumberCount
copy_change: Copy change. If not provided, default is efo:0030067 for
deletions and efo:0030070 for duplications
copy_change(models.CopyChange): Copy change. If not provided, default is
efo:0030067 for deletions and efo:0030070 for duplications
For AlleleTranslator
assembly_name (str): Assembly used for `var`. Defaults to the
`default_assembly_name`. Only used for beacon and gnomad.
require_validation (bool): If `True` then validation checks must pass in
order to return a VRS object. A `ValidationError` will be raised if
validation checks fail. If `False` then VRS object will be returned
even if validation checks fail. Defaults to `True`.
"""
if fmt:
try:
Expand Down Expand Up @@ -145,7 +156,7 @@ def _get_refget_accession(self, alias):
alias, namespace="ga4gh"
)
except KeyError:
pass
_logger.error("KeyError when getting refget accession: %s", alias)
else:
if aliases:
refget_accession = aliases[0].split("ga4gh:")[-1]
Expand Down Expand Up @@ -173,6 +184,31 @@ def _from_vrs(self, var):
return None
return model(**var)

def _is_valid_ref_seq(
self, sequence_id: str, start_pos: int, end_pos: int, ref: str
) -> Tuple[bool, str]:
"""Return wether or not the expected reference sequence matches the actual
reference sequence
:param sequence_id: Sequence ID to use
:param start_pos: Start pos (inter-residue) on the sequence_id
:param end_pos: End pos (inter-residue) on the sequence_id
:param ref: The expected reference sequence on the sequence_id given the
start_pos and end_pos
:return: Tuple containing whether or not actual reference sequence matches
the expected reference sequence and error message if mismatch
"""
actual_ref = self.data_proxy.get_sequence(sequence_id, start_pos, end_pos)
is_valid = actual_ref == ref
err_msg = ""
if not is_valid:
err_msg = (
f"Expected reference sequence {ref} on {sequence_id} at positions "
f"({start_pos}, {end_pos}) but found {actual_ref}"
)
_logger.warning(err_msg)
return is_valid, err_msg

class AlleleTranslator(Translator):
"""Class for translating formats to and from VRS Alleles"""

Expand All @@ -195,18 +231,21 @@ def __init__(
"spdi": self._to_spdi,
}

def _from_beacon(self, beacon_expr, assembly_name=None, **kwargs):
def _from_beacon(self, beacon_expr, **kwargs):
"""Parse beacon expression into VRS Allele
#>>> a = tlr.from_beacon("13 : 32936732 G > C")
kwargs:
assembly_name (str): Assembly used for `beacon_expr`.
#>>> a = tlr.from_beacon("19 : 44908822 C > T")
#>>> a.model_dump()
{
'location': {
'end': 32936732,
'start': 32936731,,
'end': 44908822,
'start': 44908821,
'sequenceReference': {
'type': 'SequenceReference',
'refgetAccession': 'SQ._0wi-qoDrvram155UmcSC-zA5ZK4fpLT'
'refgetAccession': 'SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'
},
'type': 'SequenceLocation'
},
Expand All @@ -226,8 +265,7 @@ def _from_beacon(self, beacon_expr, assembly_name=None, **kwargs):
return None

g = m.groupdict()
if assembly_name is None:
assembly_name = self.default_assembly_name
assembly_name = kwargs.get("assembly_name", self.default_assembly_name)
sequence = assembly_name + ":" + g["chr"]
refget_accession = self._get_refget_accession(sequence)
if not refget_accession:
Expand All @@ -247,9 +285,16 @@ def _from_beacon(self, beacon_expr, assembly_name=None, **kwargs):
return allele


def _from_gnomad(self, gnomad_expr, assembly_name=None, **kwargs):
def _from_gnomad(self, gnomad_expr, **kwargs):
"""Parse gnomAD-style VCF expression into VRS Allele
kwargs:
assembly_name (str): Assembly used for `gnomad_expr`.
require_validation (bool): If `True` then validation checks must pass in
order to return a VRS object. A `ValidationError` will be raised if
validation checks fail. If `False` then VRS object will be returned even
if validation checks fail. Defaults to `True`.
#>>> a = tlr.from_gnomad("1-55516888-G-GA")
#>>> a.model_dump()
{
Expand Down Expand Up @@ -278,19 +323,23 @@ def _from_gnomad(self, gnomad_expr, assembly_name=None, **kwargs):
return None

g = m.groupdict()
if assembly_name is None:
assembly_name = self.default_assembly_name
assembly_name = kwargs.get("assembly_name", self.default_assembly_name)
sequence = assembly_name + ":" + g["chr"]
refget_accession = self._get_refget_accession(sequence)
if not refget_accession:
return None

start = int(g["pos"]) - 1
ref = g["ref"]
alt = g["alt"]
ref = g["ref"].upper()
alt = g["alt"].upper()
end = start + len(ref)
ins_seq = alt

# validation checks
valid_ref_seq, err_msg = self._is_valid_ref_seq(sequence, start, end, ref)
if kwargs.get("require_validation", True) and not valid_ref_seq:
raise ValidationError(err_msg)

seq_ref = models.SequenceReference(refgetAccession=refget_accession)
location = models.SequenceLocation(sequenceReference=seq_ref, start=start, end=end)
sstate = models.LiteralSequenceExpression(sequence=ins_seq)
Expand Down Expand Up @@ -627,8 +676,8 @@ def _post_process_imported_cnv(self, copy_number):
expressions = [
"bogus",
"1-55516888-G-GA",
"13 : 32936732 G > C",
"NC_000013.11:g.32936732G>C",
"19 : 44908822 C > T",
"NC_000019.10:g.44908822C>T",
"NM_000551.3:21:1:T", {
"location": {
"end": 22,
Expand Down
Loading

0 comments on commit 65aee0a

Please sign in to comment.