Skip to content

Latest commit

 

History

History
289 lines (238 loc) · 13.8 KB

modern.md

File metadata and controls

289 lines (238 loc) · 13.8 KB

PYFLINK-POC

Streamlining data flow with PyFlink power!

license last-commit repo-top-language repo-language-count

Built with the tools and technologies:

GNU%20Bash Python AIOHTTP pandas Apache%20Kafka Apache%20Flink


Table of Contents

📍 Overview

Pyflink-poc enables real-time data streaming and processing, seamlessly integrating Apache Flink and Apache Kafka. It empowers users to handle data efficiently with Pandas and asyncio, emphasizing scalability and performance. Ideal for developers seeking lightweight, responsive architectures for large dataset management and concurrent operations in their projects.


👾 Features

Feature Summary
⚙️ Architecture
  • Combines the power of PyFlink, Apache Kafka, and asyncio for real-time data streaming and processing.
  • Utilizes lightweight and responsive architecture for efficient data handling and scalability.
  • Implements batch processing for anomaly detection and alert handling.
🔩 Code Quality
  • Well-structured codebase with clear separation of concerns.
  • Follows PEP 8 guidelines for Python code style.
  • Includes unit tests for critical components ensuring robustness.
📄 Documentation
  • Comprehensive documentation in multiple formats like txt, py, sh, yaml, and toml.
  • Includes setup instructions, usage guidelines, and project overview.
  • Enhances project management and onboarding of new contributors.
🔌 Integrations
  • Integrates seamlessly with Apache Kafka for data ingestion and aiohttp for asynchronous HTTP requests.
  • Utilizes PyFlink for distributed data processing.
  • Facilitates integration with external systems through APIs.
🧩 Modularity
  • Follows modular design principles for easy extensibility and maintenance.
  • Encapsulates functionalities into separate modules for better organization.
  • Promotes reusability of components across the codebase.
🧪 Testing
  • Includes unit tests using pytest to ensure functionality and reliability.
  • Tests cover critical paths, edge cases, and integration scenarios.
  • Facilitates continuous integration and deployment pipelines.
⚡️ Performance
  • Emphasizes performance optimization for handling large datasets.
  • Utilizes asynchronous processing with asyncio for improved efficiency.
  • Fine-tunes configurations for optimal resource allocation and fault tolerance.
🛡️ Security
  • Implements secure data transmission with serialization using Apache Avro.
  • Follows best practices for handling sensitive data and authentication.
  • Ensures data integrity and confidentiality in communication with external systems.
📦 Dependencies
  • Utilizes dependencies like PyFlink, Apache Kafka, Pandas, and aiohttp for core functionality.
  • Manages dependencies using pip via requirements.txt for easy installation.
  • Ensures compatibility and version consistency across dependencies.

📁 Project Structure

└── pyflink-poc/
    ├── README.md
    ├── conf
    │   ├── conf.toml
    │   └── flink-config.yaml
    ├── data
    │   └── data.csv
    ├── requirements.txt
    ├── scripts
    │   ├── clean.sh
    │   └── run.sh
    ├── setup
    │   └── setup.sh
    ├── setup.py
    └── src
        ├── alerts_handler.py
        ├── consumer.py
        └── logger.py

📂 Project Index

PYFLINK-POC/
__root__
requirements.txt - Enables real-time data streaming and processing by integrating Apache Flink and Apache Kafka with asynchronous HTTP requests
- Facilitates efficient data handling and manipulation using Pandas library while leveraging asyncio for concurrent operations
- The codebase emphasizes scalability and performance in handling large datasets through its lightweight and responsive architecture.
setup.py - Configures project dependencies and packages for STREAM-ON through setup.py
- Sets up various packages for documentation, style checking, and testing, enhancing the overall project structure and management.
setup
setup.sh - Facilitates setup and configuration of project dependencies and environment variables
- Installs Java 11, Python 3.7, and PyFlink, sets environment variables, and creates aliases for zsh
- Enables seamless integration and execution of PyFlink within the development environment.
scripts
run.sh Initiate Flink cluster, execute PyFlink job, and terminate Flink cluster using the provided run.sh script.
clean.sh - Clean script file removes backup files, Python caches, build artifacts, Jupyter notebook checkpoints, and pytest cache from the project directory
- It ensures the project remains clutter-free by deleting unnecessary files and directories to streamline development and maintenance processes.
conf
flink-config.yaml Define Flink cluster configuration parameters in the provided YAML file to ensure optimal resource allocation, fault tolerance, and scalability for distributed data processing.
conf.toml Define project-wide configuration constants for Kafka and Flink services in the conf.toml file.
src
alerts_handler.py - Handles the sending of alerts to an API in batches using asyncio and aiohttp
- The code serializes alerts using Apache Avro before sending them to the designated API endpoint
- Additionally, it includes functionality to buffer alerts and send them in batches when a certain threshold is reached.
logger.py - The Logger class provides structured logging capabilities for the project, enabling different log levels and colored output
- It enhances the codebase architecture by ensuring effective logging of critical information, warnings, and errors, thereby facilitating debugging and monitoring activities across the system.
consumer.py - Implements data stream processing with Apache Flink and Python, orchestrating streaming and batch data comparisons for anomaly detection
- Manages state and fault tolerance through checkpointing and processes flagged records to trigger alerts, enhancing the real-time monitoring system.

🚀 Getting Started

☑️ Prerequisites

Before getting started with pyflink-poc, ensure your runtime environment meets the following requirements:

  • Programming Language: Python
  • Package Manager: Pip

⚙️ Installation

Install pyflink-poc using one of the following methods:

Build from source:

  1. Clone the pyflink-poc repository:
❯ git clone https://github.com/eli64s/pyflink-poc
  1. Navigate to the project directory:
cd pyflink-poc
  1. Install the project dependencies:

Using pip  

❯ pip install -r requirements.txt

🤖 Usage

Run pyflink-poc using the following command: Using pip  

❯ python {entrypoint}

🧪 Testing

Run the test suite using the following command: Using pip  

❯ pytest

📌 Project Roadmap

  • Task 1: Implement feature one.
  • Task 2: Implement feature two.
  • Task 3: Implement feature three.

🔰 Contributing

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your github account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone https://github.com/eli64s/pyflink-poc
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
  6. Push to github: Push the changes to your forked repository.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


🎗 License

This project is protected under the SELECT-A-LICENSE License. For more details, refer to the LICENSE file.


🙌 Acknowledgments

  • List any resources, contributors, inspiration, etc. here.