Skip to content

The Financial DataWarehouse project aims to provide an efficient solution for storing and managing financial and commodities data using Cassandra. This project includes a REST API built with FastAPI for easy access and manipulation of data. Using Docker Compose, it deploys a multi-node Cassandra cluster to ensure data redundancy and fault tolerance

Notifications You must be signed in to change notification settings

liviuxyz-ctrl/Financial-DataWarehouse

Repository files navigation

Financial DataWarehouse Project

Table of Contents
  1. Introduction
  2. System Architecture
  3. Getting Started
  4. Project Structure
  5. Docker Architecture
  6. API Documentation
  7. License

Introduction

Details This Data Warehouse project is engineered to facilitate extensive data handling capabilities for financial and commodities data. It employs advanced Python data engineering techniques, leveraging ORM for efficient data interactions and providing a RESTful API for data access.

System Architecture

Details The architecture is built around Python and Cassandra, with Docker ensuring container management. The integration of Python ORM simplifies database interactions, converting complex SQL into manageable Python code, enhancing maintainability and scalability.

Getting Started

Details

Prerequisites

  • Python 3.10 or later
  • Docker and Docker Compose
  • Cassandra
  • Virtualenv or any environment management tool

Installation

  1. Clone the repository:

    git clone https://yourrepository.com/data-warehouse.git
    cd data-warehouse
  2. Set up the virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install dependencies:

    pip install -r requirements.txt
  4. Launch Docker containers:

    docker-compose up -d
  5. Database Initialization: Execute scripts to configure the database schema and seed it with initial data.

Project Structure

Details
  • src/: Contains all source files.
    • clients/: API clients for data sources.
      • commodities_api_client.py: Retrieves commodities data.
      • nasdaq_api_client.py: Fetches NASDAQ data.
    • config/: Application configurations.
      • settings.py: Central config file.
    • data/: Handles database operations.
      • database.py: Manages database connections.
      • models.py: Defines ORM models.
    • ingestion/: Manages data loading and processing.
      • load.py: Ingests data into the database.
      • transform.py: Transforms data as needed.
    • init_scripts/: Database initialization scripts.
      • populate_commodities_data.py: Seeds commodities data.
      • populate_sp500_data.py: Seeds S&P 500 data.
    • utils/: Utility scripts.
      • log_helper.py: Provides logging functions.

Docker Architecture

Details This project uses Docker to containerize and manage the Cassandra database cluster, ensuring consistency and scalability in the development and deployment environments. The Docker setup is defined in the `docker-compose.yml` file, which specifies the configuration for a multi-node Cassandra cluster along with Portainer for container management.

Docker Compose File

The docker-compose.yml file defines the services and their configurations as follows:

version: '3'

services:
  # Node 1 Configuration
  DC1N1:
    image: cassandra:3.10
    command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 0; fi && /docker-entrypoint.sh cassandra -f'
    networks:
      - dc1ring
    volumes:
      - ./n1data:/var/lib/cassandra
    environment:
      - CASSANDRA_CLUSTER_NAME=dev_cluster
      - CASSANDRA_SEEDS=DC1N1
    expose:
      - 7000  # Cluster communication
      - 7001  # SSL Cluster communication
      - 7199  # JMX
      - 9042  # CQL
      - 9160  # Thrift service
    ports:
      - "9042:9042"
    ulimits:
      memlock: -1
      nproc: 32768
      nofile: 100000

  # Node 2 Configuration
  DC1N2:
    image: cassandra:3.10
    command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 60; fi && /docker-entrypoint.sh cassandra -f'
    networks:
      - dc1ring
    volumes:
      - ./n2data:/var/lib/cassandra
    environment:
      - CASSANDRA_CLUSTER_NAME=dev_cluster
      - CASSANDRA_SEEDS=DC1N1
    depends_on:
      - DC1N1
    expose:
      - 7000
      - 7001
      - 7199
      - 9042
      - 9160
    ports:
      - "9043:9042"
    ulimits:
      memlock: -1
      nproc: 32768
      nofile: 100000

  # Node 3 Configuration
  DC1N3:
    image: cassandra:3.10
    command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 120; fi && /docker-entrypoint.sh cassandra -f'
    networks:
      - dc1ring
    volumes:
      - ./n3data:/var/lib/cassandra
    environment:
      - CASSANDRA_CLUSTER_NAME=dev_cluster
      - CASSANDRA_SEEDS=DC1N1
    depends_on:
      - DC1N1
    expose:
      - 7000
      - 7001
      - 7199
      - 9042
      - 9160
    ports:
      - "9044:9042"
    ulimits:
      memlock: -1
      nproc: 32768
      nofile: 100000

  # Portainer Configuration
  portainer:
    image: portainer/portainer
    networks:
      - dc1ring
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./portainer-data:/data
    ports:
      - "9000:9000"

networks:
  dc1ring: { }

Explanation

  1. Cassandra Nodes:

    • DC1N1, DC1N2, DC1N3:
      • Each service represents a Cassandra node in the cluster.
      • The image specifies the Docker image used.
      • The command ensures that the node waits if the data directory is empty, then starts Cassandra.
      • networks configures the internal network (dc1ring) for the cluster.
      • volumes maps the host directory to the container directory for persistent storage.
      • environment variables set cluster configurations such as CASSANDRA_CLUSTER_NAME and CASSANDRA_SEEDS.
      • ports exposes necessary ports for communication and management.
      • ulimits sets resource limits for the container.
  2. Portainer:

    • The Portainer service provides a web-based interface for managing Docker containers.
    • It is configured to use the same dc1ring network and has access to the Docker socket for control.

API Documentation

Details The API is structured around resources representing financial data and commodities. It supports operations for retrieving data based on asset identifiers and includes pagination capabilities.

For easier use, a Postman collection is provided. You can download it here.

Endpoint Details

Financial Data Endpoints

  • GET /api/v1/data/{asset_id}
    • Retrieves financial data for a specified asset.
    • Parameters:
      • asset_id: UUID of the asset.
      • limit: Number of records to return.
      • offset: Pagination offset.
    • Example: http://127.0.0.1:8000/api/v1/data/AAPL?limit=20&offset=0

Commodity Data Endpoints

  • GET /api/v1/commodities/{commodity_id}
    • Fetches commodity data.
    • Parameters:
      • commodity_id: Identifier for the commodity.
      • limit: Controls the size of the returned data set.
      • offset: Specifies the pagination offset.
    • Example: http://127.0.0.1:8000/api/v1/commodities/brent?limit=20&offset=0

Asset Endpoints

  • GET /api/v1/assets
    • Retrieves a list of asset names.
    • Parameters:
      • offset: The number of records to skip from the beginning.
      • limit: The number of records to return.
    • Example: http://127.0.0.1:8000/api/v1/assets?offset=0&limit=20

Data Source Endpoints

  • GET /api/v1/data_sources

    • Retrieves a list of all data sources.
    • Example: http://127.0.0.1:8000/api/v1/data_sources
  • GET /api/v1/data_sources/{source_id}

    • Retrieves details of a specific data source.
    • Parameters:
      • source_id: UUID of the data source.
    • Example: http://127.0.0.1:8000/api/v1/data_sources/{source_id}

Examples

# Fetch financial data for a specific asset
curl -X GET "http://localhost:8000/api/v1/data/AAPL?limit=10&offset=0"

# Retrieve commodity data
curl -X GET "http://localhost:8000/api/v1/commodities/brent?limit=5&offset=0"

# Get a list of assets
curl -X GET "http://localhost:8000/api/v1/assets?offset=0&limit=20"

# Get a list of data sources
curl -X GET "http://localhost:8000/api/v1/data_sources"

# Get details of a specific data source
curl -X GET "http://localhost:8000/api/v1/data_sources/{source_id}"

Photos

Details

image

image

image

License

Details Licensed under the MIT License. See [LICENSE.md](LICENSE) for more details.

About

The Financial DataWarehouse project aims to provide an efficient solution for storing and managing financial and commodities data using Cassandra. This project includes a REST API built with FastAPI for easy access and manipulation of data. Using Docker Compose, it deploys a multi-node Cassandra cluster to ensure data redundancy and fault tolerance

Topics

Resources

Stars

Watchers

Forks

Languages