Variation Normalization

Services and guidelines for normalizing variation terms to VRS compatible representations.

Public OpenAPI endpoint: https://normalize.cancervariants.org/variation

Installing with pip:

pip install variation-normalizer

The variation-normalization repo depends on VRS models, and therefore each variation-normalizer package on PyPI uses a particular version of VRS. The correspondences between packages may be summarized as:

variation-normalization branch	variation-normalizer version	gene-normalizer version	VRS version
main	0.6.X	0.1.X	1.X.X
staging	>=0.8.X	>=0.3.X	2.0-alpha

About

Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.

Variation Normalization is limited to the following types of variants:

HGVS expressions and text representations (ex: BRAF V600E):
- protein (p.): substitution, deletion, insertion, deletion-insertion
- coding DNA (c.): substitution, deletion, insertion, deletion-insertion
- genomic (g.): substitution, deletion, ambiguous deletion, insertion, deletion-insertion, duplication
gnomAD-style VCF (chr-pos-ref-alt, ex: 7-140753336-A-T)
- genomic (g.): substitution, deletion, insertion

Variation Normalizer accepts input from GRCh37 or GRCh8 assemblies.

We are working towards adding more types of variations, coordinates, and representations.

Endpoints

`/to_vrs`

Returns a list of validated VRS Variations.

`/normalize`

Returns a VRS Variation aligned to the prioritized transcript. The Variation Normalizer relies on Common Operations On Lots-of Sequences Tool (cool-seq-tool) for retrieving the prioritized transcript data. More information on the transcript selection algorithm can be found here.

If a genomic variation query is given a gene (E.g. BRAF g.140753336A>T), the associated cDNA representation will be returned. This is because the gene provides additional strand context. If a genomic variation query is not given a gene, the GRCh38 representation will be returned.

Developer Instructions

Clone the repo:

git clone https://github.com/cancervariants/variation-normalization.git
cd variation-normalization

For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.

Once installed, from the project root dir, just run:

pipenv shell
pipenv update && pipenv install --dev

Backend Services

Variation Normalization relies on some local data caches which you will need to set up. It uses pipenv to manage its environment, which you will also need to install.

Gene Normalizer

Variation Normalization relies on data from Gene Normalization. You must load all sources and merged concepts.

You must also have Gene Normalization's DynamoDB running in a separate terminal for the application to work.

For more information about the gene-normalizer and how to load the database, visit the README.

SeqRepo

Variation Normalization relies on seqrepo, which you must download yourself.

Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.

From the root directory:

pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29  # Replace with latest version using `seqrepo list-remote-instances` if outdated

If you get an error similar to the one below:

PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'

You will want to do the following:
(Might not be ._fkuefgd, so replace with your error message path)

sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
exit

Use the SEQREPO_ROOT_DIR environment variable to set the path of an already existing SeqRepo directory. The default is /usr/local/share/seqrepo/latest.

UTA

Variation Normalizer also uses Common Operations On Lots-of Sequences Tool (cool-seq-tool) which uses UTA as the underlying PostgreSQL database.

The following commands will likely need modification appropriate for the installation environment.

Install PostgreSQL

Create user and database.

createuser -U postgres uta_admin
createuser -U postgres anonymous
createdb -U postgres -O uta_admin uta

To install locally, from the variation/data directory:

export UTA_VERSION=uta_20210129.pgd.gz
curl -O http://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433

UTA Installation Issues

If you have trouble installing UTA, you can visit these two READMEs.

Connecting to the UTA database

To connect to the UTA database, you can use the default url (postgresql://uta_admin@localhost:5433/uta/uta_20210129). If you do not wish to use the default, you must set the environment variable UTA_DB_URL which has the format of driver://user:pass@host:port/database/schema.

Starting the Variation Normalization Service Locally

gene-normalizers dynamodb and the uta database must be running.

To start the service, run the following:

uvicorn variation.main:app --reload

Next, view the OpenAPI docs on your local machine: http://127.0.0.1:8000/variation

Init coding style tests

Code style is managed by Ruff and checked prior to commit.

Check style with ruff:

python3 -m ruff format . && python3 -m ruff check --fix .

We use pre-commit to run conformance tests.

This ensures:

Style correctness
No large files
AWS credentials are present
Private key is present

Pre-commit must be installed before your first commit. Use the following command:

pre-commit install

Testing

From the root directory of the repository:

pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 1,180 Commits
.ebextensions		.ebextensions
.github/workflows		.github/workflows
codebuild		codebuild
docs		docs
src/variation		src/variation
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Procfile		Procfile
README.md		README.md
biomart.png		biomart.png
cron.yaml		cron.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variation Normalization

About

Endpoints

`/to_vrs`

`/normalize`

Developer Instructions

Backend Services

Gene Normalizer

SeqRepo

UTA

UTA Installation Issues

Connecting to the UTA database

Starting the Variation Normalization Service Locally

Init coding style tests

Testing

About

Releases 58

Packages

Contributors 9

Languages

License

cancervariants/variation-normalization

Folders and files

Latest commit

History

Repository files navigation

Variation Normalization

About

Endpoints

/to_vrs

/normalize

Developer Instructions

Backend Services

Gene Normalizer

SeqRepo

UTA

UTA Installation Issues

Connecting to the UTA database

Starting the Variation Normalization Service Locally

Init coding style tests

Testing

About

Resources

License

Stars

Watchers

Forks

Releases 58

Packages 0

Contributors 9

Languages

`/to_vrs`

`/normalize`

Packages