Added Portal Athena CLI use case

- Moved SQL scripts into Athena doc directory - Added programmatic access pointers
umccr · Aug 11, 2022 · 6ec6ec1 · 6ec6ec1
1 parent b5c7c4d
commit 6ec6ec1
Show file tree

Hide file tree

Showing 11 changed files with 250 additions and 4 deletions.
diff --git a/docs/examples/LibraryRun.sql → docs/athena/LibraryRun.sql b/docs/examples/LibraryRun.sql → docs/athena/LibraryRun.sql
diff --git a/docs/athena/README.md b/docs/athena/README.md
@@ -34,17 +34,68 @@ You need to be part of AWS Power User group i.e. able to assume role `ProdOperat
 
 ### Step 6
 
-- At Database dropdown, select "data_portal"
+- Next, highlight the query and click "Run"
 ![athena6.png](img/athena6.png)
 
+### Step 7
+
+- You can save and load the "Named Query" from _Saved Queries_ tab.
+- Highlight each query statement block (if multilines) and click "Run".
+- You can download the results by "Download Results" button.
+![athena7.png](img/athena7.png)
+
 ## Next
 
-Please have a look some `*.sql` in [Example](../examples) folder. 
+Please have a look some `*.sql` in this directory. 
 
 Most of these scripts should be there in Athena Console as saved Named Query (i.e. `Saved queries` tab).
 
+## Data Model
+
+See [../model/data_portal_model.pdf](../model/data_portal_model.pdf)
+
+## CLI
+
+See [README_CLI.md](README_CLI.md)
+
+## Programmatic
+
+### R
+
+- cloudyr
+  - https://github.com/cloudyr/aws.athena
+- Paws
+  - https://paws-r.github.io/docs/athena/
+- RAthena 
+  - https://github.com/DyfanJones/RAthena
+  - https://www.r-bloggers.com/2019/09/athena-and-r-there-is-another-way/
+- noctua 
+  - https://github.com/DyfanJones/noctua
+- RStudio/ODBC 
+  - https://db.rstudio.com/databases/athena/
+  - https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html
+- rJava/RJDBC
+  - https://github.com/s-u/rJava
+  - https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
+  - https://developer.ibm.com/tutorials/athena-goes-r-how-to-handle-s3-object-storage-queries-from-r/
+- DBI
+  - https://dbi.r-dbi.org
+
+### Python
+
+- Boto3
+- https://aws-data-wrangler.readthedocs.io/en/stable/index.html
+- https://awstip.com/how-to-query-your-s3-data-lake-using-athena-within-an-aws-glue-python-shell-job-491c00af8867
+
+### GA4GH Data Connect API
+
+_i.e. querying in more Genomics specific standardised way_
+
+- WIP, see https://github.com/umccr/data-portal-apis/issues/452
+
+
 ## Notes
 
-- Athena is based on [PrestoDB](https://prestodb.io)/[Trino](https://trino.io) query. Not 100% native SQL.
-- Athena is Serverless AWS managed service. If no use, no charges. Otherwise, it price at [$5 per TB data scan](https://aws.amazon.com/athena/pricing/?nc=sn&loc=3).
+- Athena is based on [PrestoDB](https://prestodb.io)/[Trino](https://trino.io) query engine. Not 100% native SQL.
+- Athena is Serverless AWS managed service. If no use, no charges. Otherwise, it prices at [$5 per TB data scan](https://aws.amazon.com/athena/pricing/?nc=sn&loc=3).
 - There are some feature parity between Athena and PrestoSQL. But. Those were mainly on advanced use cases. Most typical analytic query should support it.
diff --git a/docs/athena/README_CLI.md b/docs/athena/README_CLI.md
@@ -0,0 +1,101 @@
+# Athena CLI
+
+_Athena query using AWS CLI_
+
+> CONCEPT: Athena is a distributed query engine. Hence, unlike conventional SQL server request/response, the query result dispatch asynchronously. This is the typical case for BigData query processing. There are 3 parts in process from making a query to getting the result. As follows. 
+
+https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/index.html
+
+## 0. Prelude
+
+```
+aws athena list-work-groups --profile prodops
+```
+```
+aws athena list-data-catalogs --profile prodops
+```
+```
+aws athena list-databases --catalog-name data_portal --profile prodops
+```
+```
+aws athena get-database --catalog-name data_portal --database-name data_portal --profile prodops
+```
+```
+aws athena list-named-queries --work-group data_portal --profile prodops
+```
+```
+aws athena get-named-query --named-query-id 1e9c43b3-02e3-4822-9108-f461ac3b42d4 --profile prodops
+```
+
+## 1. Start Query Execution
+
+> Say a use case, where we simply wish to extract some data points from [Portal Pipeline database](../model/data_portal_model.pdf) through Athena SQL Query.
+
+- Here is an example [athena_cli_start.sh](athena_cli_start.sh) bash script that wrap AWS CLI Athena subcommand to make a query.
+
+```
+sh athena_cli_start.sh > athena_cli_start.json
+
+cat athena_cli_start.json
+{
+    "QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9"
+}
+```
+
+## 2. Poll Query Execution Status
+
+```
+aws athena get-query-execution --query-execution-id "f5da9315-76f1-47ee-9c38-bec1b44e60e9" --profile prodops
+```
+
+- Depends on the query execution status, you can do more polling. The state is SUCCEEDED but could be QUEUED, RUNNING, FAILED.
+
+```
+{
+    "QueryExecution": {
+        "QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9",
+        "Query": "<SNIP>",
+        "StatementType": "DML",
+        "ResultConfiguration": {
+            "OutputLocation": "s3://umccr-data-portal-build-prod/athena-query-results/f5da9315-76f1-47ee-9c38-bec1b44e60e9.csv",
+            "EncryptionConfiguration": {
+                "EncryptionOption": "SSE_S3"
+            }
+        },
+        "QueryExecutionContext": {
+            "Database": "data_portal",
+            "Catalog": "data_portal"
+        },
+        "Status": {
+            "State": "SUCCEEDED",
+            "SubmissionDateTime": "2022-08-11T01:16:29.701000+10:00",
+            "CompletionDateTime": "2022-08-11T01:16:37.291000+10:00"
+        },
+        "Statistics": {
+            "EngineExecutionTimeInMillis": 6786,
+            "DataScannedInBytes": 118029,
+            "TotalExecutionTimeInMillis": 7590,
+            "QueryQueueTimeInMillis": 521,
+            "ServiceProcessingTimeInMillis": 283
+        },
+        "WorkGroup": "data_portal",
+        "EngineVersion": {
+            "SelectedEngineVersion": "AUTO",
+            "EffectiveEngineVersion": "Athena engine version 2"
+        }
+    }
+}
+```
+
+## 3. Download Query Result
+
+```
+aws s3 cp s3://umccr-data-portal-build-prod/athena-query-results/f5da9315-76f1-47ee-9c38-bec1b44e60e9.csv . --profile prodops
+```
+
+- Please note that the Athena result bucket has S3 life cycle policy that routinely purge away older query results. Please treat it as ephemeral data store. If this is of-concern, please kindly reach out.
+
+
+## Next
+
+See [Programmatic](./README.md) section for more advanced usage.
diff --git a/docs/athena/athena_cli_start.json b/docs/athena/athena_cli_start.json
@@ -0,0 +1,3 @@
+{
+    "QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9"
+}
diff --git a/docs/athena/athena_cli_start.sh b/docs/athena/athena_cli_start.sh
@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+
+# https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/start-query-execution.html#examples
+
+# Usage:
+#   sh athena_cli_start.sh
+
+AWS_REGION=ap-southeast-2
+AWS_PROFILE=prodops
+DATA_PORTAL=data_portal
+
+# Paste your Athena SQL query between EOF i.e. bash heredoc syntax
+read -rd '' sqlquery << EOF
+select
+    wfl.id,
+    wfl.wfr_name,
+    -- wfl.sample_name,
+    wfl.type_name,
+    wfl.wfr_id,
+    wfl.wfl_id,
+    wfl.wfv_id,
+    wfl.version,
+    -- wfl.input,
+    wfl.start,
+    -- wfl.output,
+    "wfl"."end",
+    wfl.end_status,
+    wfl.notified,
+    wfl.sequence_run_id,
+    wfl.batch_run_id,
+    wfl.portal_run_id
+from
+    data_portal.data_portal_workflow as wfl
+where
+    wfl.type_name in ('bcl_convert', 'BCL_CONVERT')
+    -- and wfl.end_status in ('Succeeded', 'Failed', 'Aborted', 'Started', 'Deleted', 'Deleted;;issue475')
+order by id desc;
+EOF
+
+# echo "$sqlquery"
+
+aws athena start-query-execution \
+  --region "$AWS_REGION" \
+  --profile "$AWS_PROFILE" \
+  --query-string "$sqlquery" \
+  --work-group "$DATA_PORTAL" \
+  --query-execution-context Database="$DATA_PORTAL",Catalog="$DATA_PORTAL"
diff --git a/docs/examples/demo.sql → docs/athena/demo.sql b/docs/examples/demo.sql → docs/athena/demo.sql
diff --git a/docs/examples/fastq.sql → docs/athena/fastq.sql b/docs/examples/fastq.sql → docs/athena/fastq.sql
diff --git a/docs/athena/img/athena7.png b/docs/athena/img/athena7.png
diff --git a/docs/examples/pairing.sql → docs/athena/pairing.sql b/docs/examples/pairing.sql → docs/athena/pairing.sql
diff --git a/docs/examples/pipeline.sql → docs/athena/pipeline.sql b/docs/examples/pipeline.sql → docs/athena/pipeline.sql
diff --git a/docs/athena/stats.sql b/docs/athena/stats.sql
@@ -0,0 +1,44 @@
+-- stats
+
+-- shape
+describe data_portal.data_portal_workflow;
+
+-- size
+select count(1) as total_wfl_runs from data_portal.data_portal_workflow;
+
+-- all possible distinct workflow runs types
+select distinct(wfl.type_name) from data_portal.data_portal_workflow as wfl;
+
+-- all possible distinct workflow runs end statuses
+select distinct(wfl.end_status) from data_portal.data_portal_workflow as wfl;
+
+-- total bcl_convert runs
+select count(1) as total_bcl_convert_wfl_runs from data_portal.data_portal_workflow as wfl where wfl.type_name in ('bcl_convert', 'BCL_CONVERT');
+
+-- extract all workflow runs by bcl conversion, sorted descending
+select
+    wfl.id,
+    wfl.wfr_name,
+    wfl.sample_name,
+    wfl.type_name,
+    wfl.wfr_id,
+    wfl.wfl_id,
+    wfl.wfv_id,
+    wfl.version,
+    -- wfl.input,
+    wfl.start,
+    -- wfl.output,
+    "wfl"."end",
+    wfl.end_status,
+    wfl.notified,
+    wfl.sequence_run_id,
+    wfl.batch_run_id,
+    wfl.portal_run_id
+from
+    data_portal.data_portal_workflow as wfl
+where
+    wfl.type_name in ('bcl_convert', 'BCL_CONVERT')
+    -- and wfl.end_status in ('Succeeded', 'Failed', 'Aborted', 'Started', 'Deleted', 'Deleted;;issue475')
+order by id desc
+-- limit 10
+;