FHIR · mihaitodor · Sep 21, 2023 · Sep 21, 2023 · Feb 20, 2025
diff --git a/.env b/.env
@@ -1,2 +1,3 @@
 UTILITIES_DATA_VERSION=113c119
+UTA_DATABASE_SCHEMA=uta_20240523b
 PYARD_DATABASE_VERSION=3580
diff --git a/.github/workflows/cicd.yml b/.github/workflows/cicd.yml
@@ -33,6 +33,7 @@ jobs:
         run: ./fetch_utilities_data.sh && python -m pytest
         env:
           MONGODB_READONLY_PASSWORD: ${{ secrets.MONGODB_READONLY_PASSWORD }}
+          UTA_DATABASE_URL: ${{ secrets.UTA_DATABASE_URL }}
 
   deploy:
     name: Deploy to dev

diff --git a/.gitignore b/.gitignore
@@ -4,8 +4,8 @@
 .pytest_cache
 __pycache__
 .venv
-utilities/FASTA
-utilities/mongo_utilities.py
-/data
 secrets.env
-app/temp.py
+/data
+/seqrepo
+/tmp
+/utilities/FASTA
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -10,9 +10,6 @@
     "python.testing.pytestArgs": [
         "."
     ],
-    "[python]": {
-        "editor.defaultFormatter": "ms-python.autopep8",
-    },
     "autopep8.args": [
         "--max-line-length=200"
     ],

diff --git a/README.md b/README.md
@@ -42,17 +42,156 @@ The operations return the following status codes:
 
 ## Testing
 
-To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
-run `python3 -m pytest` from the terminal to execute them all.
+For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
+password and the UTA Postgres database connection string (see the UTA section below for details):
+
+```
+MONGODB_READONLY_PASSWORD=...
+UTA_DATABASE_URL=...
+```
+
+Then, you will need to run `fetch_utilities_data.sh` in a terminal to fetch the required data files:
+
+```shell
+$ ./fetch_utilities_data.sh
+```
+
+To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
+Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
+to execute them all.
 
 Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
 can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
 changes.
 
-## Update py-ard database
+## Heroku Deployment
+
+Currently, there are two environments running in Heroku:
+- Dev: <https://fhir-gen-ops-dev-ca42373833b6.herokuapp.com/>
+- Prod: <https://fhir-gen-ops.herokuapp.com/>
+
+Pull requests will trigger a deployment to the dev environment automatically after being merged.
+
+The ["Manual Deployment"](https://github.com/FHIR/genomics-operations/actions/workflows/manual_deployment.yml) workflow
+can be used to deploy code to either the `dev` or `prod` environments. To do so, please select "Run workflow", ignore
+the "Use workflow from" dropdown which lists the branches in the current repo (I can't disable / remove it) and then
+select the environment, the branch and the repository. By default, the `https://github.com/FHIR/genomics-operations`
+repo is specified, but you can replace it with any any fork.
+
+Deployments to the prod environment can only be triggered manually from the `main` branch of the repo using the Manual
+Deployment.
+
+### Heroku Stack
+
+Make sure that the Python version under [`runtime.txt`](./runtime.txt) is
+[supported](https://devcenter.heroku.com/articles/python-support#supported-runtimes) by the
+[Heroku stack](https://devcenter.heroku.com/articles/stack) that is currently running in each environment.
+
+## UTA Database
+
+The Biocommons [hgvs](https://github.com/biocommons/hgvs) library which is used for variant parsing, validation and
+normalisation requires access to a copy of the [UTA](https://github.com/biocommons/uta) Postgres database.
+
+We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
+dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).
+
+We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
+currently imported database schema.
+
+### Database import procedure (it will take about 30 minutes):
+
+- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
+- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
+open a new window)
+- Go to the Settings tab
+- Click "View Credentials"
+- Use the fields from this window to fill in the variables below
+
+```shell
+$ POSTGRES_HOST="<Heroku Postgres Host>"
+$ POSTGRES_DATABASE="<Heroku Postgres Database>"
+$ POSTGRES_USER="<Heroku Postgres User>"
+$ PGPASSWORD="<Heroku Postgres Password>"
+$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
+$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
+```
+
+Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.
+
+Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
+it.
+
+### Connection string
+
+The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
+It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
+same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.
+
+For local development, set `UTA_DATABASE_URL` to the Heroku Postgres connection string in the `secrets.env` file.
+Alternatively, you can set it to `postgresql://anonymous:anonymous@uta.biocommons.org/uta` if you'd like to use the HGVS
+public instance.
+
+### Testing the database
+
+```shell
+$ source secrets.env
+$ pgcli "${UTA_DATABASE_URL}"
+> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
+> select count(*) from alembic_version
+    union select count(*) from associated_accessions
+    union select count(*) from exon
+    union select count(*) from exon_aln
+    union select count(*) from exon_set
+    union select count(*) from gene
+    union select count(*) from meta
+    union select count(*) from origin
+    union select count(*) from seq
+    union select count(*) from seq_anno
+    union select count(*) from transcript
+    union select count(*) from translation_exception;
+```
+
+### Update utilities data
+
+The RefSeq metadata from the UTA database needs to be in sync with the RefSeq data which is available for the Seqfetcher
+Utility endpoint. Currently, this is stored in GitHub as release artifacts. Similarly, the PyARD SQLite database is also
+stored as a release artifact.
+
+To update the RefSeq data and PyARD database, you will have to run `./utilities/pack_seqrepo_data.py`. Here is a
+step-by-step guide on how to do this:
+
+```shell
+$ mkdir seqrepo
+$ cd seqrepo
+$ python3 -m venv .venv
+$ . .venv/bin/activate
+$ pip install setuptools==75.7.0
+$ pip install biocommons.seqrepo==0.6.9
+$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
+$ # rsync on OSX.
+# # This OSX-specific. Guess the standard package managers have it available on Linux.
+$ brew install rsync
+$ # Fetch seqrepo data (should take about 16 minutes)
+$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
+$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which
+$ # got created):
+$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
+$
+$ # Exit venv and cd to genomics-operations repo.
+$
+$ # Pack the utilities data (should take about 25 minutes)
+$ python ./utilities/pack_utilities_data.py
+```
+You should see a warning in the output log if the current `PYARD_DATABASE_VERSION` is outdated and you can change
+`PYARD_DATABASE_VERSION` in the `.env` file if you wish to switch to the latest version that is printed in this log.
+
+Now you should set a new value for `UTILITIES_DATA_VERSION` in the `.env` file, create a new branch and commit this
+change in it. Then also create a git tag for this commit with the `UTILITIES_DATA_VERSION` value and push it to GitHub
+along with the branch. Now you can use this tag to create a new [release](https://github.com/FHIR/genomics-operations/releases).
+Inside this release, you need to attach all the `*.tar.gz` files from the `./tmp` folder which was created after
+`pack_utilities_data.py` ran successfully.
+
+Once the release is published, create PR from this new branch and merge it.
 
-- Run `pyard.init(data_dir='./data/pyard', imgt_version=<new version>)` to download the new version
-- Run `cd data/pyard && tar -czf pyard.sqlite3.tar.gz pyard-<new version>.sqlite3`
-- Upload `pyard.sqlite3.tar.gz` in a new release on GitHub
-- Update `PYARD_DATABASE_VERSION` in `.env`
-- Update `UTILITIES_DATA_VERSION` in `.env` with the new tag ID (short git sha)
+Finally, in order to validate the new release locally, run `fetch_utilities_data.sh` locally to recreate the `data`
+directory (delete it first if you have it already).
diff --git a/app/__init__.py b/app/__init__.py
@@ -3,6 +3,13 @@
 from flask_cors import CORS
 import os
 
+import hgvs
+# Disable the hgvs LRU cache to avoid blowing up memory
+# TODO: Revisit this, since this caching might not use a ton of memory.
+hgvs.global_config.lru_cache.maxsize = 0
+# Disable HGVS strict bounds checks as a workaround for liftover failures: https://github.com/biocommons/hgvs/issues/717
+hgvs.global_config.mapping.strict_bounds = False
+
 
 def create_app():
     # App and API

diff --git a/app/api_spec.yml b/app/api_spec.yml
@@ -1330,6 +1330,64 @@ paths:
             type: string
             example: "NM_001127510.3:c.145A>T"
 
+  /utilities/normalize-variant-hgvs:
+    get:
+      summary: "Normalize Variant HGVS"
+      operationId: "app.utilities_endpoints.normalize_variant_hgvs"
+      tags:
+        - "Operations Utilities (not part of balloted HL7 Operations)"
+      responses:
+        "200":
+          description: "Returns a normalized variant in both GRCh37 and GRCh38."
+          content:
+            application/json:
+              schema:
+                type: object
+      parameters:
+        - name: variant
+          in: query
+          required: true
+          description: "Variant."
+          schema:
+            type: string
+            example: "NM_021960.4:c.740C>T"
+
+  /utilities/seqfetcher/1/sequence/{acc}:
+    get:
+      summary: "Seqfetcher"
+      operationId: "app.utilities_endpoints.seqfetcher"
+      tags:
+        - "Operations Utilities (not part of balloted HL7 Operations)"
+      responses:
+        "200":
+          description: "Returns RefSeq subsequence"
+          content:
+            text/plain:
+              schema:
+                type: string
+      parameters:
+        - name: acc
+          in: path
+          required: true
+          description: Accession
+          schema:
+            type: string
+            example: "NC_000001.10"
+        - name: start
+          in: query
+          required: true
+          description: Subsequence start index
+          schema:
+            type: integer
+            example: 10000
+        - name: end
+          in: query
+          required: true
+          description: Subsequence end index
+          schema:
+            type: integer
+            example: 10010
+
   /utilities/normalize-hla:
     get:
       description: >