Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPDI normalization with refseq files #76

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
UTILITIES_DATA_VERSION=113c119
UTA_DATABASE_SCHEMA=uta_20240523b
PYARD_DATABASE_VERSION=3580
1 change: 1 addition & 0 deletions .github/workflows/cicd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ jobs:
run: ./fetch_utilities_data.sh && python -m pytest
env:
MONGODB_READONLY_PASSWORD: ${{ secrets.MONGODB_READONLY_PASSWORD }}
UTA_DATABASE_URL: ${{ secrets.UTA_DATABASE_URL }}

deploy:
name: Deploy to dev
Expand Down
8 changes: 4 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
.pytest_cache
__pycache__
.venv
utilities/FASTA
utilities/mongo_utilities.py
/data
secrets.env
app/temp.py
/data
/seqrepo
/tmp
/utilities/FASTA
3 changes: 0 additions & 3 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@
"python.testing.pytestArgs": [
"."
],
"[python]": {
"editor.defaultFormatter": "ms-python.autopep8",
},
"autopep8.args": [
"--max-line-length=200"
],
Expand Down
155 changes: 147 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,156 @@ The operations return the following status codes:

## Testing

To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
run `python3 -m pytest` from the terminal to execute them all.
For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
password and the UTA Postgres database connection string (see the UTA section below for details):

```
MONGODB_READONLY_PASSWORD=...
UTA_DATABASE_URL=...
```

Then, you will need to run `fetch_utilities_data.sh` in a terminal to fetch the required data files:

```shell
$ ./fetch_utilities_data.sh
```

To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
to execute them all.

Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
changes.

## Update py-ard database
## Heroku Deployment

Currently, there are two environments running in Heroku:
- Dev: <https://fhir-gen-ops-dev-ca42373833b6.herokuapp.com/>
- Prod: <https://fhir-gen-ops.herokuapp.com/>

Pull requests will trigger a deployment to the dev environment automatically after being merged.

The ["Manual Deployment"](https://github.com/FHIR/genomics-operations/actions/workflows/manual_deployment.yml) workflow
can be used to deploy code to either the `dev` or `prod` environments. To do so, please select "Run workflow", ignore
the "Use workflow from" dropdown which lists the branches in the current repo (I can't disable / remove it) and then
select the environment, the branch and the repository. By default, the `https://github.com/FHIR/genomics-operations`
repo is specified, but you can replace it with any any fork.

Deployments to the prod environment can only be triggered manually from the `main` branch of the repo using the Manual
Deployment.

### Heroku Stack

Make sure that the Python version under [`runtime.txt`](./runtime.txt) is
[supported](https://devcenter.heroku.com/articles/python-support#supported-runtimes) by the
[Heroku stack](https://devcenter.heroku.com/articles/stack) that is currently running in each environment.

## UTA Database

The Biocommons [hgvs](https://github.com/biocommons/hgvs) library which is used for variant parsing, validation and
normalisation requires access to a copy of the [UTA](https://github.com/biocommons/uta) Postgres database.

We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).

We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
currently imported database schema.

### Database import procedure (it will take about 30 minutes):

- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
open a new window)
- Go to the Settings tab
- Click "View Credentials"
- Use the fields from this window to fill in the variables below

```shell
$ POSTGRES_HOST="<Heroku Postgres Host>"
$ POSTGRES_DATABASE="<Heroku Postgres Database>"
$ POSTGRES_USER="<Heroku Postgres User>"
$ PGPASSWORD="<Heroku Postgres Password>"
$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
```

Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.

Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
it.

### Connection string

The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.

For local development, set `UTA_DATABASE_URL` to the Heroku Postgres connection string in the `secrets.env` file.
Alternatively, you can set it to `postgresql://anonymous:anonymous@uta.biocommons.org/uta` if you'd like to use the HGVS
public instance.

### Testing the database

```shell
$ source secrets.env
$ pgcli "${UTA_DATABASE_URL}"
> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
> select count(*) from alembic_version
union select count(*) from associated_accessions
union select count(*) from exon
union select count(*) from exon_aln
union select count(*) from exon_set
union select count(*) from gene
union select count(*) from meta
union select count(*) from origin
union select count(*) from seq
union select count(*) from seq_anno
union select count(*) from transcript
union select count(*) from translation_exception;
```

### Update utilities data

The RefSeq metadata from the UTA database needs to be in sync with the RefSeq data which is available for the Seqfetcher
Utility endpoint. Currently, this is stored in GitHub as release artifacts. Similarly, the PyARD SQLite database is also
stored as a release artifact.

To update the RefSeq data and PyARD database, you will have to run `./utilities/pack_seqrepo_data.py`. Here is a
step-by-step guide on how to do this:

```shell
$ mkdir seqrepo
$ cd seqrepo
$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install setuptools==75.7.0
$ pip install biocommons.seqrepo==0.6.9
$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
$ # rsync on OSX.
# # This OSX-specific. Guess the standard package managers have it available on Linux.
$ brew install rsync
$ # Fetch seqrepo data (should take about 16 minutes)
$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which
$ # got created):
$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
$
$ # Exit venv and cd to genomics-operations repo.
$
$ # Pack the utilities data (should take about 25 minutes)
$ python ./utilities/pack_utilities_data.py
```
You should see a warning in the output log if the current `PYARD_DATABASE_VERSION` is outdated and you can change
`PYARD_DATABASE_VERSION` in the `.env` file if you wish to switch to the latest version that is printed in this log.

Now you should set a new value for `UTILITIES_DATA_VERSION` in the `.env` file, create a new branch and commit this
change in it. Then also create a git tag for this commit with the `UTILITIES_DATA_VERSION` value and push it to GitHub
along with the branch. Now you can use this tag to create a new [release](https://github.com/FHIR/genomics-operations/releases).
Inside this release, you need to attach all the `*.tar.gz` files from the `./tmp` folder which was created after
`pack_utilities_data.py` ran successfully.

Once the release is published, create PR from this new branch and merge it.

- Run `pyard.init(data_dir='./data/pyard', imgt_version=<new version>)` to download the new version
- Run `cd data/pyard && tar -czf pyard.sqlite3.tar.gz pyard-<new version>.sqlite3`
- Upload `pyard.sqlite3.tar.gz` in a new release on GitHub
- Update `PYARD_DATABASE_VERSION` in `.env`
- Update `UTILITIES_DATA_VERSION` in `.env` with the new tag ID (short git sha)
Finally, in order to validate the new release locally, run `fetch_utilities_data.sh` locally to recreate the `data`
directory (delete it first if you have it already).
7 changes: 7 additions & 0 deletions app/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
from flask_cors import CORS
import os

import hgvs
# Disable the hgvs LRU cache to avoid blowing up memory
# TODO: Revisit this, since this caching might not use a ton of memory.
hgvs.global_config.lru_cache.maxsize = 0
# Disable HGVS strict bounds checks as a workaround for liftover failures: https://github.com/biocommons/hgvs/issues/717
hgvs.global_config.mapping.strict_bounds = False


def create_app():
# App and API
Expand Down
58 changes: 58 additions & 0 deletions app/api_spec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1330,6 +1330,64 @@ paths:
type: string
example: "NM_001127510.3:c.145A>T"

/utilities/normalize-variant-hgvs:
get:
summary: "Normalize Variant HGVS"
operationId: "app.utilities_endpoints.normalize_variant_hgvs"
tags:
- "Operations Utilities (not part of balloted HL7 Operations)"
responses:
"200":
description: "Returns a normalized variant in both GRCh37 and GRCh38."
content:
application/json:
schema:
type: object
parameters:
- name: variant
in: query
required: true
description: "Variant."
schema:
type: string
example: "NM_021960.4:c.740C>T"

/utilities/seqfetcher/1/sequence/{acc}:
get:
summary: "Seqfetcher"
operationId: "app.utilities_endpoints.seqfetcher"
tags:
- "Operations Utilities (not part of balloted HL7 Operations)"
responses:
"200":
description: "Returns RefSeq subsequence"
content:
text/plain:
schema:
type: string
parameters:
- name: acc
in: path
required: true
description: Accession
schema:
type: string
example: "NC_000001.10"
- name: start
in: query
required: true
description: Subsequence start index
schema:
type: integer
example: 10000
- name: end
in: query
required: true
description: Subsequence end index
schema:
type: integer
example: 10010

/utilities/normalize-hla:
get:
description: >
Expand Down
Loading