BEES-Breweries-Case

Objective: The goal of this test is to assess your skills in consuming data from an API, transforming and persisting it into a data lake following the medallion architecture with three layers: raw data, curated data partitioned by location, and an analytical aggregated layer.

Instructions:

API: Use the Open Brewery DB API to fetch data. The API has an endpoint for listing breweries: https://api.openbrewerydb.org/breweries
Orchestration Tool: Choose the orchestration tool of your preference (Airflow, Luigi, Mage etc.) to build a data pipeline. We're interested in seeing your ability to handle scheduling, retries, and error handling in the pipeline.
Language: Use the language of your preference for the requests and data transformation. Please include test cases for your code. Python and PySpark are preferred but not mandatory.
Containerization: If you use Docker or Kubernetes for modularization, you'll earn extra points.
Data Lake Architecture: Your data lake must follow the medallion architecture having a bronze, silver, and gold layer: a. Bronze Layer: Persist the raw data from the API in its native format or any format you find suitable. b. Silver Layer: Transform the data to a columnar storage format such as parquet or delta, and partition it by brewery location. Please explain any other transformations you perform. c. Gold Layer: Create an aggregated view with the quantity of breweries per type and location.
Monitoring/Alerting: Describe how you would implement a monitoring and alerting process for this pipeline. Consider data quality issues, pipeline failures, and other potential problems in your response.
Repository: Create a public repository on GitHub with your solution. Document your design choices, trade-offs, and provide clear instructions on how to run your application.
Cloud Services: If your solution requires any cloud services, please provide instructions on how to set them up. Please do not post them in your public repository.

INSTRUCTIONS

In order to prepare the environment for this test is required to create an Azure Account or log in into an existent one (https://learn.microsoft.com/en-us/dotnet/azure/create-azure-account).
Into Azure portal create a group of resources following a template according to: https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/template-tutorial-create-first-template?tabs=azure-powershell.
Use json template files located in this repository ("conf/*.json")
Afterward it follows the test through resources:
- Azure Data Factory (Get data from API REST (Open breweries) and store in Storage Account)
- Azure Data Lake Storage Gen2 (Landing zone for raw data)
- Azure Databricks (Used to create a medallion architecture: bronze - DBFS/ silver - delta table/ gold - aggregated view)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
conf		conf
data factory/src		data factory/src
databricks/ETL Pipelines/src		databricks/ETL Pipelines/src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BEES-Breweries-Case

INSTRUCTIONS

About

Releases

Packages

Languages

Gabriel0598/BEES-Breweries-Case

Folders and files

Latest commit

History

Repository files navigation

BEES-Breweries-Case

INSTRUCTIONS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages