Financial-news-collectors

A financial news collector and aggregator service that not only collects the data, but processes and enriches it, while providing a simple UI for basic analytics.

Overview of architectural design.

Introduction

Even most financial news outlets provide API services for data consumption, most of them have very high fees, so this repository tries to create a cheaper alternative while providing some custom analytics.

Overview of frontend and database.

How to use it

Follow the next steps:

Prepare the database:
- Open your favourite SQL database and create the database and the table using setup.sql helper script.
- Open the config.py file and modify the SQLALCHEMY_DATABASE_URL to use the database you just created.
Prepare the data-backend service:
- Install the required python packages via pip install -r requirements.txt. (It is recommended to use virtual environments when installing them to avoid conflicting version issues).
- Schedule the data-backend to run every 30mins by opening a terminal window and typing crontab -e. Then go to the end of the file and type and save the following */30 * * * * bash {absolute/path/to/script.sh}.
- You can see the scheduled script run by typing crontab -e.

Running the data-backend service.

Prepare the backend service:
- Install the required python packages via pip install -r requirements.txt. (It is recommended to use virtual environments when installing them to avoid conflicting version issues).
- Open a terminal window and type uvicorn main:app --reload to launch the application.
- (Optional) Open a browser and type http://127.0.0.1:8000/docs in the address bar to open an interactive view of the backend service.

Running the backend service.

Prepare the frontend:
- Run the index.html in a live server and open the provided URL. (I recommend using VScode's Live server plug-in due to its ease of use. For non-local deployments you will need to look for more advanced web servers like Apache HTTP server).

Running the frontend.

Technical details

Back-of-the-envelope calculations

Find below a rough estimation of database requirements.

Assumptions:

Size per webpage: 0.65Mb
Webpages downloaded every time: 20 pages
Number of downloads per day: 48 / day (2 download per hour)
Number of days to run: 5 * 365

Database size:

Required storage: 0.65 (Mb/webpage) * 20 (webpages/download) * 48 (downloads/day) * 5 * 365 (days) = 1,138 Gb

Technology stack

Vanilla JS, JQuery and d3.js were selected due to the simple nature of the frontend application.
FastAPI was selected due to its performance and great community adoption.
SQL was selected due the robustness of it and the structured nature of the data to be received.
Selenium was selected due to its power and wide capabilities.

How it works

The repository is divided into three parts:

Data-backend: Contains the code to collect news webpages and extract and enrich headline related data and place it into a database. It can store the downloaded webpages in two ways: Local storage and mongoDB. The reason for having multiple options of storage is to facilitate the deployment of this code in simple machines, as well utilize more advance (and costly options).
Frontend: Contains the code for the UI to facilitate the basic analytics functionality.
Backend: Contains the code for the frontend to consume the data. It is an intermediary API service to facilitate data fetching.

Note that each service should be ran in their own environment, as they work independently from each other (just communicate over API/Database).

Overview of architectural design.

Future functionalities/Considerations

Every time the script runs, all the files are reloaded in search of headlines, even the ones already analyzed.. This is quite a waste of resources as unless the ETL process changes, no new data will be extracted. To avoid this behaviour, two solutions can be implemented:
- Move the already analyzed files to another location (or flag them as analyzed if using MongoDB).
- Creating a table to store the files already analyzed and only analyze the files that are not in that list.
Due to Bloomberg's anti bot features (which blocks any get request to urls), it was decided to open/close a browser every time a url is visited. This, even if not very performant, circumvents this issue and allows for retrieval of the pages.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
documentation		documentation
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial-news-collectors

Introduction

How to use it

Technical details

Back-of-the-envelope calculations

Technology stack

How it works

Future functionalities/Considerations

About

Releases

Packages

Languages

License

DAguirreAg/Financial-news-collectors

Folders and files

Latest commit

History

Repository files navigation

Financial-news-collectors

Introduction

How to use it

Technical details

Back-of-the-envelope calculations

Technology stack

How it works

Future functionalities/Considerations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages