This project leverages several technologies to achieve its objectives:
-
Python: Used for writing Directed Acyclic Graphs (DAGs) to orchestrate the data pipeline.
-
Airflow: Utilized for creating and managing data pipelines, ensuring automation and scheduling of tasks.
-
Hadoop HDFS: Employed for storing and managing large datasets, where extracted job vacancies are stored.
-
Scala Spark: Utilized for processing and analyzing the data from Hadoop HDFS, providing a scalable and distributed data processing framework.
-
Scala Akka: Employed for implementing actors and asynchronous requests, enabling efficient, concurrent data processing.
The project is organized into the following components and directories:
-
docker-hadoop: A custom Docker image for running Hadoop. You can find it at docker-hadoop on GitHub.
-
docker-airflow: A customized Docker image for running Apache Airflow. For detailed information, refer to docker-airflow on GitHub.
-
src/dags/vacancies_extract_dag.py: Python code for the Airflow DAG responsible for extracting job vacancies from the hh.ru API and storing them in Hadoop HDFS.
-
src/scala/CurrencyConverter.scala: Scala code for a data processing task, possibly related to currency conversion.
-
src/scala/HRActivityAnalysis.scala: Scala code for another data processing task, possibly related to HR activity analysis.
-
src/scala/build.sbt: The SBT (Scala Build Tool) configuration file for the Scala project, defining dependencies and build settings.
-
src/scala/project/build.properties: Configuration file for the Scala project's build settings.
-
src/scala/project/plugins.sbt: Configuration file for SBT plugins used in the Scala project.
To run this project, follow these steps:
-
In the project's root directory, start the Docker containers for Airflow and Hadoop:
docker-compose up -f docker-compose.yaml -d # for Airflow docker-compose up -f docker-compose.yaml -d # for Hadoop
The primary objective of this project is to utilize the hh.ru API to extract job vacancies and automate the data processing workflow. Here's a high-level overview of the project's process:
-
Job vacancies are retrieved from the hh.ru API.
-
The extracted data is stored in Hadoop HDFS for further processing.
-
Scala Spark is employed to process and analyze the data, which can include tasks like currency conversion and HR activity analysis.
-
A data mart is created, likely for reporting and data visualization purposes.
-
The data processing and visualization tasks are orchestrated and automated using Apache Airflow, ensuring a streamlined and scheduled workflow.