Skip to content

Commit

Permalink
Added Portal Athena CLI use case
Browse files Browse the repository at this point in the history
- Moved SQL scripts into Athena doc directory
- Added programmatic access pointers
  • Loading branch information
victorskl committed Aug 11, 2022
1 parent b5c7c4d commit 6ec6ec1
Show file tree
Hide file tree
Showing 11 changed files with 250 additions and 4 deletions.
File renamed without changes.
59 changes: 55 additions & 4 deletions docs/athena/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,68 @@ You need to be part of AWS Power User group i.e. able to assume role `ProdOperat

### Step 6

- At Database dropdown, select "data_portal"
- Next, highlight the query and click "Run"
![athena6.png](img/athena6.png)

### Step 7

- You can save and load the "Named Query" from _Saved Queries_ tab.
- Highlight each query statement block (if multilines) and click "Run".
- You can download the results by "Download Results" button.
![athena7.png](img/athena7.png)

## Next

Please have a look some `*.sql` in [Example](../examples) folder.
Please have a look some `*.sql` in this directory.

Most of these scripts should be there in Athena Console as saved Named Query (i.e. `Saved queries` tab).

## Data Model

See [../model/data_portal_model.pdf](../model/data_portal_model.pdf)

## CLI

See [README_CLI.md](README_CLI.md)

## Programmatic

### R

- cloudyr
- https://github.com/cloudyr/aws.athena
- Paws
- https://paws-r.github.io/docs/athena/
- RAthena
- https://github.com/DyfanJones/RAthena
- https://www.r-bloggers.com/2019/09/athena-and-r-there-is-another-way/
- noctua
- https://github.com/DyfanJones/noctua
- RStudio/ODBC
- https://db.rstudio.com/databases/athena/
- https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html
- rJava/RJDBC
- https://github.com/s-u/rJava
- https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/
- https://developer.ibm.com/tutorials/athena-goes-r-how-to-handle-s3-object-storage-queries-from-r/
- DBI
- https://dbi.r-dbi.org

### Python

- Boto3
- https://aws-data-wrangler.readthedocs.io/en/stable/index.html
- https://awstip.com/how-to-query-your-s3-data-lake-using-athena-within-an-aws-glue-python-shell-job-491c00af8867

### GA4GH Data Connect API

_i.e. querying in more Genomics specific standardised way_

- WIP, see https://github.com/umccr/data-portal-apis/issues/452


## Notes

- Athena is based on [PrestoDB](https://prestodb.io)/[Trino](https://trino.io) query. Not 100% native SQL.
- Athena is Serverless AWS managed service. If no use, no charges. Otherwise, it price at [$5 per TB data scan](https://aws.amazon.com/athena/pricing/?nc=sn&loc=3).
- Athena is based on [PrestoDB](https://prestodb.io)/[Trino](https://trino.io) query engine. Not 100% native SQL.
- Athena is Serverless AWS managed service. If no use, no charges. Otherwise, it prices at [$5 per TB data scan](https://aws.amazon.com/athena/pricing/?nc=sn&loc=3).
- There are some feature parity between Athena and PrestoSQL. But. Those were mainly on advanced use cases. Most typical analytic query should support it.
101 changes: 101 additions & 0 deletions docs/athena/README_CLI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Athena CLI

_Athena query using AWS CLI_

> CONCEPT: Athena is a distributed query engine. Hence, unlike conventional SQL server request/response, the query result dispatch asynchronously. This is the typical case for BigData query processing. There are 3 parts in process from making a query to getting the result. As follows.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/index.html

## 0. Prelude

```
aws athena list-work-groups --profile prodops
```
```
aws athena list-data-catalogs --profile prodops
```
```
aws athena list-databases --catalog-name data_portal --profile prodops
```
```
aws athena get-database --catalog-name data_portal --database-name data_portal --profile prodops
```
```
aws athena list-named-queries --work-group data_portal --profile prodops
```
```
aws athena get-named-query --named-query-id 1e9c43b3-02e3-4822-9108-f461ac3b42d4 --profile prodops
```

## 1. Start Query Execution

> Say a use case, where we simply wish to extract some data points from [Portal Pipeline database](../model/data_portal_model.pdf) through Athena SQL Query.
- Here is an example [athena_cli_start.sh](athena_cli_start.sh) bash script that wrap AWS CLI Athena subcommand to make a query.

```
sh athena_cli_start.sh > athena_cli_start.json
cat athena_cli_start.json
{
"QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9"
}
```

## 2. Poll Query Execution Status

```
aws athena get-query-execution --query-execution-id "f5da9315-76f1-47ee-9c38-bec1b44e60e9" --profile prodops
```

- Depends on the query execution status, you can do more polling. The state is SUCCEEDED but could be QUEUED, RUNNING, FAILED.

```
{
"QueryExecution": {
"QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9",
"Query": "<SNIP>",
"StatementType": "DML",
"ResultConfiguration": {
"OutputLocation": "s3://umccr-data-portal-build-prod/athena-query-results/f5da9315-76f1-47ee-9c38-bec1b44e60e9.csv",
"EncryptionConfiguration": {
"EncryptionOption": "SSE_S3"
}
},
"QueryExecutionContext": {
"Database": "data_portal",
"Catalog": "data_portal"
},
"Status": {
"State": "SUCCEEDED",
"SubmissionDateTime": "2022-08-11T01:16:29.701000+10:00",
"CompletionDateTime": "2022-08-11T01:16:37.291000+10:00"
},
"Statistics": {
"EngineExecutionTimeInMillis": 6786,
"DataScannedInBytes": 118029,
"TotalExecutionTimeInMillis": 7590,
"QueryQueueTimeInMillis": 521,
"ServiceProcessingTimeInMillis": 283
},
"WorkGroup": "data_portal",
"EngineVersion": {
"SelectedEngineVersion": "AUTO",
"EffectiveEngineVersion": "Athena engine version 2"
}
}
}
```

## 3. Download Query Result

```
aws s3 cp s3://umccr-data-portal-build-prod/athena-query-results/f5da9315-76f1-47ee-9c38-bec1b44e60e9.csv . --profile prodops
```

- Please note that the Athena result bucket has S3 life cycle policy that routinely purge away older query results. Please treat it as ephemeral data store. If this is of-concern, please kindly reach out.


## Next

See [Programmatic](./README.md) section for more advanced usage.
3 changes: 3 additions & 0 deletions docs/athena/athena_cli_start.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"QueryExecutionId": "f5da9315-76f1-47ee-9c38-bec1b44e60e9"
}
47 changes: 47 additions & 0 deletions docs/athena/athena_cli_start.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#!/usr/bin/env bash

# https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/start-query-execution.html#examples

# Usage:
# sh athena_cli_start.sh

AWS_REGION=ap-southeast-2
AWS_PROFILE=prodops
DATA_PORTAL=data_portal

# Paste your Athena SQL query between EOF i.e. bash heredoc syntax
read -rd '' sqlquery << EOF
select
wfl.id,
wfl.wfr_name,
-- wfl.sample_name,
wfl.type_name,
wfl.wfr_id,
wfl.wfl_id,
wfl.wfv_id,
wfl.version,
-- wfl.input,
wfl.start,
-- wfl.output,
"wfl"."end",
wfl.end_status,
wfl.notified,
wfl.sequence_run_id,
wfl.batch_run_id,
wfl.portal_run_id
from
data_portal.data_portal_workflow as wfl
where
wfl.type_name in ('bcl_convert', 'BCL_CONVERT')
-- and wfl.end_status in ('Succeeded', 'Failed', 'Aborted', 'Started', 'Deleted', 'Deleted;;issue475')
order by id desc;
EOF

# echo "$sqlquery"

aws athena start-query-execution \
--region "$AWS_REGION" \
--profile "$AWS_PROFILE" \
--query-string "$sqlquery" \
--work-group "$DATA_PORTAL" \
--query-execution-context Database="$DATA_PORTAL",Catalog="$DATA_PORTAL"
File renamed without changes.
File renamed without changes.
Binary file added docs/athena/img/athena7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
File renamed without changes.
44 changes: 44 additions & 0 deletions docs/athena/stats.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
-- stats

-- shape
describe data_portal.data_portal_workflow;

-- size
select count(1) as total_wfl_runs from data_portal.data_portal_workflow;

-- all possible distinct workflow runs types
select distinct(wfl.type_name) from data_portal.data_portal_workflow as wfl;

-- all possible distinct workflow runs end statuses
select distinct(wfl.end_status) from data_portal.data_portal_workflow as wfl;

-- total bcl_convert runs
select count(1) as total_bcl_convert_wfl_runs from data_portal.data_portal_workflow as wfl where wfl.type_name in ('bcl_convert', 'BCL_CONVERT');

-- extract all workflow runs by bcl conversion, sorted descending
select
wfl.id,
wfl.wfr_name,
wfl.sample_name,
wfl.type_name,
wfl.wfr_id,
wfl.wfl_id,
wfl.wfv_id,
wfl.version,
-- wfl.input,
wfl.start,
-- wfl.output,
"wfl"."end",
wfl.end_status,
wfl.notified,
wfl.sequence_run_id,
wfl.batch_run_id,
wfl.portal_run_id
from
data_portal.data_portal_workflow as wfl
where
wfl.type_name in ('bcl_convert', 'BCL_CONVERT')
-- and wfl.end_status in ('Succeeded', 'Failed', 'Aborted', 'Started', 'Deleted', 'Deleted;;issue475')
order by id desc
-- limit 10
;

0 comments on commit 6ec6ec1

Please sign in to comment.