Skip to content

Weaby is a program that can collect data from multiple websites. It is written in Python and extracts data from websites using Selenium.

License

Notifications You must be signed in to change notification settings

arman-bd/weaby-the-extractor

Repository files navigation

Weaby the Extractor

Weaby is a program that can collect data from multiple websites. It is developed using FastAPI and extracts data from websites using Selenium. The project is containerized using Docker Compose. Undetected Chrome Driver within the project downloads the most recent version of Chrome Driver to support the current Chrome version, which is installed by the Docker.

Installation

Docker

To install the project, you need to have Python 3, Docker and Docker Compose installed on your machine. You can download Python from here, Docker from here and Docker Compose from here.

Project

After installing Docker and Docker Compose, you can clone the project by running the following command:

git clone https://github.com/arman-bd/weaby-the-extractor.git

Environment

After cloning the project, you need to create a .env file in the project directory. You can copy the .env.example file and rename it to .env.

cp .env.example .env

You may change the .env file according to your needs. To change the .env file, open it with a text editor and change the values of the variables.

Usage

Run the following command to start the project:

docker compose up --build -d

After running the command, you can access the project by visiting http://localhost:8081 in your browser.

Weaby in Action

Supported Websites

Currently, the Weaby supports the following websites for data extraction:

Adding Support for a Website

To add support for a website, you need to follow the steps below:

  1. Create a Service Method in app/services/extract.py.
async def website_data(driver: uc.Chrome, id: str, wait: int = 5):
    driver.get(f"https://YOUR_WEBSITE_HERE/{id}")
    time.sleep(wait)
    title = driver.find_element(By.XPATH, "/html/body/div[3]/h1/span").text
    description = driver.find_element(By.XPATH, "/html/body/div[3]/div[3]/div[5]/div[1]/p[2]").text
    return {
        "title": title,
        "description": description
    }
  1. Create a Controller Method in app/controllers/extract.py.
async def website_data(id: str):
    try:
        driver = wd.create_driver()
        return await ExtractService.website_data(driver, id, 5)
    except Exception as e:
        return {"error": str(e)}
  1. Create a Router Method in app/routers/extract.py.
@router.get("/website/{id}", response_model=WebsiteData)
async def website_data(id: str):
    return await ExtractController.website_data(id)

Now you can access the data from the website by sending a GET request to http://localhost:8081/extract/website/{id}.

Disclaimer

The project is still in development and is not ready for production. The project is not tested thoroughly and may contain bugs. It is designed to be used for educational purposes only. The very purpose of this project is to demonstrate how to use Selenium to interact with a websites. Use at your own risk. I am not responsible for any misuse of this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

Weaby is a program that can collect data from multiple websites. It is written in Python and extracts data from websites using Selenium.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published