Using Airflow to implement our ETL pipelines
[TOC]
這三個 job 什麼時候打開需要人工確認(麻煩當年的組長大大了),理論上是售票開始前我們要測試一下然後打開
gcloud beta compute ssh --zone asia-east1-b data-team --project pycontw-225217 -- -NL 8080:localhost:8080
- Navigate to http://localhost:8080/admin/
- 打開下列的 Airflow tasks:
KKTIX_DISCORD_BOT_FOR_TEAM_REGISTRATION
: 每天送賣票的銷量資訊到 discord 給註冊組看KKTIX_TICKET_ORDERS_V3
: 每五分鐘就去抓最新的 kktix 資料進 BigQuery, 更新後 metabase 相關連的 dashboard 就會更新(售票完之後可以關掉)KLAVIYO_SEND_MAIL_V3
: 每年寄發行前信時啓動隨買即用, 已購票的用戶會批次寄發信件, 新的購票者也會批次派發送信
- Dag 的命名規則請看這篇 阿里巴巴大數據實戰
- Please refer to this article for naming guidline
- examples
ods/opening_crawler
: Crawlers written by @Rain. Those openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy.ods/survey_cake
: A manually triggered uploader that would upload questionnaires to bigquery. The uploader should be invoked after we receive the surveycake questionnaire.
- examples
docker pull docker.io/apache/airflow:1.10.13-python3.8
- Python dependencies:
virtualenv venv
. venv/bin/activate
pip install poetry
poetry install
- Npm dependencies for linter, formatter, and commit linter (optional):
brew install npm
npm ci
git add <files>
npm run check
: Apply all the linter and formatternpm run commit
Please use Gitlab Flow, otherwise, you cannot pass docker hub CI
. venv/bin/activate
. ./.env.sh
cd contrib
- Check its command in contrib/README.md
python xxx.py
Find @davidtnfsh if you don't have those secrets.
⚠ WARNING: About .env
Please don't use the .env for local development, or it might screw up the production tables.
-
Build docker image:
- Build a production image (for production):
docker build -t davidtnfsh/pycon_etl:prod --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile .
If you want to build dev/test image, you also need to build this docker image first because dev/test image is on top of this production image. See below. - Build dev/test image (for dev/test):
docker build -t davidtnfsh/pycon_etl:test --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile.test .
- Build a production image (for production):
-
Fill in some secrets:
cp .env.template .env.staging
for dev/test.cp .env.template .env.production
instead if you are going to start a production instance.- Follow the instructions in
.env.<staging|production>
and fill in your secrets. If you are running the staging instance for development as a sandbox and not going to access any specific third-party service, leave the.env.staging
as-is should be fine.
-
Start the Airflow server:
- production:
docker run --log-opt max-size=1m -p 8080:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.production davidtnfsh/pycon_etl:prod webserver
- dev/test:
docker run -p 8080:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.staging davidtnfsh/pycon_etl:test webserver
- Note the difference is just the env file name and the image cache.
- production:
-
Portforward compute instance to your local and then navigate to http://localhost:8080/admin/:
gcloud beta compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217" -- -NL 8080:localhost:8080
- If Port 8080 is already in use. You need to stop the service occupied 8080 port on your local first.
-
Setup Airflow's Variable and Connections:
Do not use Windows Powershell; please use Command Prompt instead.
Find @davidtnfsh if you don't have those secrets.
⚠ WARNING: About .env
Please don't use the .env for local development, or it might screw up the production tables.
- Build docker image:
- Build a production image (for production):
docker build -t davidtnfsh/pycon_etl:prod --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile .
If you want to build dev/test image, you also need to build this docker image first because dev/test image is on top of this production image. See below. - Build dev/test image (for dev/test):
docker build -t davidtnfsh/pycon_etl:test --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile.test .
- Build a production image (for production):
- Fill in some secrets:
copy .env.template .env.staging
for dev/test.copy .env.template .env.production
instead if you are going to start a production instance.- Follow the instructions in
.env.<staging|production>
and fill in your secrets. If you are running the staging instance for development as a sandbox, and not going to access any specific thrid-party service, leave the.env.staging
as-is should be fine.
- Start the Airflow server:
- production:
docker run -p 8080:8080 --name airflow -v "/$(pwd)"/dags:/usr/local/airflow/dags -v "/$(pwd)"/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.production davidtnfsh/pycon_etl:prod webserver
- dev/test:
docker run -p 8080:8080 --name airflow -v "/$(pwd)"/dags:/usr/local/airflow/dags -v "/$(pwd)"/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.staging davidtnfsh/pycon_etl:test webserver
- Note the difference are just the env file name and the image cache.
- production:
- Portforward compute instance to your local and then navigate to http://localhost/admin/:
gcloud beta compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217" -- -N -L 8080:localhost:8080
- If Port 8080 is already in use. You need to stop the service occupied 8080 port on your local first.
- Setup the Authentication of GCP: https://googleapis.dev/python/google-api-core/latest/auth.html
- After invoking
gcloud auth application-default login
, you'll get a credentials.json resides in$HOME/.config/gcloud/application_default_credentials.json
. Invokeexport GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
if you have it. - service-account.json: Please contact @david30907d using email, telegram, or discord. No worry about this json if you are running the sandbox staging instance for development.
- After invoking
- Give Toy-Examples a try
- Login to the data team's server:
gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"
- service:
- ETL:
/home/zhangtaiwei/pycon-etl
- btw, metabase is located here:
/mnt/disks/data-team-additional-disk/pycontw-infra-scripts/data_team/metabase_server
- ETL:
- Pull the latest codebase to this server:
sudo git pull
- Add Credentials (only need to do once):
- Airflow:
- Connections:
- kktix_api:
conn_id=kktix_api
,host
andextra(header)
are confidential since its KKTIX's private endpoint. Please DM @GTB or data team's teammembers for these credentials.- extra:
{"Authorization": "bearer xxx"}
- extra:
- klaviyo_api:
conn_id=klaviyo_api
,host
is https://a.klaviyo.com/api
- kktix_api:
- Variables:
- KLAVIYO_KEY: Create from https://www.klaviyo.com/account#api-keys-tab
- KLAVIYO_LIST_ID: Create from https://www.klaviyo.com/lists
- KLAVIYO_CAMPAIGN_ID: Create from https://www.klaviyo.com/campaigns
- kktix_events_endpoint: url path of kktix's
hosting_events
, ask @gtb for details!
- Connections:
- Airflow:
Please check .github/workflows for details
BigQuery Example:
from google.cloud import bigquery
client = bigquery.Client(project='pycontw-225217')
# Perform a query.
QUERY = '''
SELECT scenario.day2checkin.attr.diet FROM `pycontw-225217.ods.ods_opass_attendee_timestamp`
'''
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.diet)
- Find topk biggest folders:
du -a /var/lib/docker/overlay2 | sort -n -r | head -n 20
- Show the folder size:
du -hs xxxx
- delete those pretty big folder
df -h