Automating Data Scrapers Apache Airflow

Overview

This project focuses on scraping house rental data from the website "www.buyrentkenya.com". The goal is to extract various attributes such as titles, locations, number of bedrooms, number of bathrooms, descriptions, and prices of available rental properties. The project includes both a web scraping script and an automated workflow using Apache Airflow.

Project Structure

main.py: This Python script contains the code for scraping the rental data from the website.
buyrentkenya.py: This python script contains the code for airflow automation and stores the data in a database.
buy_rent_kenya.csv: The CSV file where the scraped data from the main.py is stored.
buyrentkenya.db: SQLite database file where the scraped data is stored by the Apache Airflow DAG.
README.md: This file provides an overview of the project, its objectives, and how to execute the script.

Execution

To run the web scraping script:

Ensure that you have Python installed on your system.
Install the required libraries by running pip install pandas requests beautifulsoup4.
Execute the script main.py.
As for the airflow automation script,

Web Scraping Script Description

The script sends a request to the specified URL using the requests library.
It parses the HTML content of the response using BeautifulSoup.
The script extracts relevant information such as titles, locations, number of bedrooms, number of bathrooms, descriptions, and prices of rental properties.
Data from multiple pages is collected by iterating through page numbers.
The scraped data is stored in a Pandas DataFrame and then saved to a CSV file named buy_rent_kenya.csv.

Apache Airflow DAG Description

The Apache Airflow DAG automates the execution of the web scraping script at specified intervals.
The DAG is scheduled to run daily at 12:05 PM.
It starts by creating a SQLite table if it does not exist already.
The web scraping task collects rental data from the website and stores it in the SQLite database.
The DAG utilizes task dependencies to ensure that tasks are executed sequentially.

Dependencies

Python 3.x
Pandas
Requests
BeautifulSoup4
Apache Airflow

Execution

To run the web scraping script:

Ensure that you have Python installed on your system.
Install the required libraries by running pip install pandas requests beautifulsoup4.
Execute the script web_scraping_script.py.
To run the Apache Airflow DAG:
Install Apache Airflow and initialize the Airflow database.
Place the Buy_Rent_Kenya_Dag.py file in the Airflow DAGs folder.
Start the Airflow scheduler and webserver.
The DAG will automatically execute according to the specified schedule.

Collaborators

Acknowledgments

Special thanks to Data Science East Africa, ALX_Kenya and Lux Academy for organizing the Data Science Hackathon.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
README.md		README.md
buyrentkenya_dag.py		buyrentkenya_dag.py
index.ipynb		index.ipynb
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automating Data Scrapers Apache Airflow

Overview

Project Structure

Execution

Web Scraping Script Description

Apache Airflow DAG Description

Dependencies

Execution

Collaborators

Acknowledgments

About

Releases

Packages

Languages

MarwaBrian/Automating-Data-Scrapers-Apache-Airflow

Folders and files

Latest commit

History

Repository files navigation

Automating Data Scrapers Apache Airflow

Overview

Project Structure

Execution

Web Scraping Script Description

Apache Airflow DAG Description

Dependencies

Execution

Collaborators

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages