This repository contains a data engineering solution for analyzing drug mentions across scientific publications, including PubMed articles and clinical trials. The project processes multiple data sources to generate a comprehensive graph showing relationships between drugs, publications, and journals.
.
├── README.md # This file
├── .github/ # GitHub Actions workflows
│ └── workflows/
│ ├── build-push.yml # Build and push Docker image
│ └── deploy-dag.yml # Deploy DAG to Cloud Composer
├── dags/ # Airflow DAG definitions
│ └── dag_servier_drug_graph.py
├── drugs_graph/ # Main application package
│ ├── app/ # Application source code
│ ├── Dockerfile # Container definition
│ ├── poetry.lock # Dependencies lock file
│ └── pyproject.toml # Project configuration
└── sql/ # SQL analysis queries
├── sales_by_day.sql
└── sales_by_product_type.sql
The project uses GitHub Actions for automated CI/CD pipelines, integrating with Google Cloud Platform services.
GCP Workload Identity Federation is assumed to be already configured in the GCP projects GitHub repository secrets are properly set up with the following variables:
GCP_PROJECT_ID # Google Cloud Project ID
WORKLOAD_IDENTITY_PROVIDER # GCP Workload Identity Provider
SERVICE_ACCOUNT_EMAIL # GCP Service Account Email
COMPOSER_DAG_BUCKET # Cloud Composer DAG Bucket
-
Build and Push to Artifact Registry Trigger: Push to main branch (excluding .md files) or manual trigger
-
Deploy DAG to Cloud Composer Trigger: Push to main branch (dags/** files) or manual trigger
- Segregated environment configurations:
.env_dev
for development.env_stg
for staging.env_prd
for production
- Each environment has isolated resources and configurations
Deployment strategies:
develop → Development Environment
staging → Staging Environment
prod → Production Environment
- Merge Requests trigger deployments to respective environments
- Automated testing at each stage
- Manual validation gates between environments
main branch:
├── Merge Request → Dev Environment
├── Merge → Staging Environment
└── Tag → Production Environment
- Tags follow Semantic Versioning (MAJOR.MINOR.PATCH)
- Automated regression testing
- Security scanning at each stage
- Automated test suites:
- Unit tests
- Integration tests
- End-to-end tests
- Security scanning:
- Dependency vulnerabilities
- Code quality metrics
- Container scanning
- Performance testing at scale
Suitable for computation-intensive processing:
-
Benefits:
- Distributed data processing
- Memory-efficient operations
- Native scalability
-
Implementation:
- Convert pandas operations to PySpark
- Deploy on high-capacity virtual machines
- Configure resource allocation
- Implement partitioning strategy
Ideal for large-scale data warehousing:
-
Architecture:
Google Cloud Storage └── Raw Data ├── Input Processing (BigQuery) ├── Transformation (dbt) └── Output Storage (GCS)
-
Features:
- Serverless processing
- Cost-effective storage
- Automated scaling
- Built-in monitoring
-
Extended test coverage:
- Data quality validation
- Schema evolution testing
- Performance benchmarking
- Scalability testing
-
Monitoring:
- Processing metrics
- Resource utilization
- Error rates
- Data quality metrics
-
Resource Management:
- Dynamic scaling
- Resource quotas
- Cost optimization
- Performance monitoring
-
Data Lifecycle:
- Retention policies
- Archival strategy
- Backup procedures
- Disaster recovery