Skip to content

A financial news collector and aggregator service that not only collects the data, but processes and enriches it, while providing a simple UI for basic analytics.

License

Notifications You must be signed in to change notification settings

DAguirreAg/Financial-news-collectors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Financial-news-collectors

A financial news collector and aggregator service that not only collects the data, but processes and enriches it, while providing a simple UI for basic analytics.

Main design
Overview of architectural design.

Introduction

Even most financial news outlets provide API services for data consumption, most of them have very high fees, so this repository tries to create a cheaper alternative while providing some custom analytics.

Frontend view Database table content view
Overview of frontend and database.

How to use it

Follow the next steps:

  • Prepare the database:

    • Open your favourite SQL database and create the database and the table using setup.sql helper script.
    • Open the config.py file and modify the SQLALCHEMY_DATABASE_URL to use the database you just created.
  • Prepare the data-backend service:

    • Install the required python packages via pip install -r requirements.txt. (It is recommended to use virtual environments when installing them to avoid conflicting version issues).
    • Schedule the data-backend to run every 30mins by opening a terminal window and typing crontab -e. Then go to the end of the file and type and save the following */30 * * * * bash {absolute/path/to/script.sh}.
    • You can see the scheduled script run by typing crontab -e.
Running the data-backend service
Running the data-backend service.
  • Prepare the backend service:
    • Install the required python packages via pip install -r requirements.txt. (It is recommended to use virtual environments when installing them to avoid conflicting version issues).
    • Open a terminal window and type uvicorn main:app --reload to launch the application.
    • (Optional) Open a browser and type http://127.0.0.1:8000/docs in the address bar to open an interactive view of the backend service.
Running the backend service
Running the backend service.
  • Prepare the frontend:
    • Run the index.html in a live server and open the provided URL. (I recommend using VScode's Live server plug-in due to its ease of use. For non-local deployments you will need to look for more advanced web servers like Apache HTTP server).
Running the frontend
Running the frontend.

Technical details

Back-of-the-envelope calculations

Find below a rough estimation of database requirements.

Assumptions:

  • Size per webpage: 0.65Mb
  • Webpages downloaded every time: 20 pages
  • Number of downloads per day: 48 / day (2 download per hour)
  • Number of days to run: 5 * 365

Database size:

  • Required storage: 0.65 (Mb/webpage) * 20 (webpages/download) * 48 (downloads/day) * 5 * 365 (days) = 1,138 Gb

Technology stack

  • Vanilla JS, JQuery and d3.js were selected due to the simple nature of the frontend application.
  • FastAPI was selected due to its performance and great community adoption.
  • SQL was selected due the robustness of it and the structured nature of the data to be received.
  • Selenium was selected due to its power and wide capabilities.

How it works

The repository is divided into three parts:

  • Data-backend: Contains the code to collect news webpages and extract and enrich headline related data and place it into a database. It can store the downloaded webpages in two ways: Local storage and mongoDB. The reason for having multiple options of storage is to facilitate the deployment of this code in simple machines, as well utilize more advance (and costly options).
  • Frontend: Contains the code for the UI to facilitate the basic analytics functionality.
  • Backend: Contains the code for the frontend to consume the data. It is an intermediary API service to facilitate data fetching.

Note that each service should be ran in their own environment, as they work independently from each other (just communicate over API/Database).

Main design
Overview of architectural design.

Future functionalities/Considerations

  • Every time the script runs, all the files are reloaded in search of headlines, even the ones already analyzed.. This is quite a waste of resources as unless the ETL process changes, no new data will be extracted. To avoid this behaviour, two solutions can be implemented:

    • Move the already analyzed files to another location (or flag them as analyzed if using MongoDB).
    • Creating a table to store the files already analyzed and only analyze the files that are not in that list.
  • Due to Bloomberg's anti bot features (which blocks any get request to urls), it was decided to open/close a browser every time a url is visited. This, even if not very performant, circumvents this issue and allows for retrieval of the pages.

About

A financial news collector and aggregator service that not only collects the data, but processes and enriches it, while providing a simple UI for basic analytics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published