- Overview
- Project Structure
- Installation
- Usage
- Outputs
- Core Components
- Our Key Contributions
- Testing
- Future Work
- Acknowledgments
- License
This project implements a hazard detection and captioning system for driver monitoring using videos. The system combines YOLO for object detection, BLIP-based models for caption generation, and a custom state change detection algorithm to evaluate driver behavior. It processes videos frame-by-frame, identifies hazards, generates descriptions for them, and outputs results in a CSV format compatible with competition scoring requirements.
├── data
│ ├── annotations
│ │ └── annotations.pkl # Annotations for videos
│ └── videos
│ ├── video_0001.mp4 # Sample video file
└── ...
│ └── video_0200.mp4 # Additional video files
├── models
│ ├── blip-image-captioning-base # BLIP captioning model files
│ │ ├── config.json
│ │ ├── preprocessor_config.json
│ │ ├── pytorch_model.bin
│ │ ├── README.md
│ │ └── ...
│ └── YOLO_models # Pre-trained YOLO models
│ ├── yolo11n.pt
│ └── yolov8n.pt
├── pics # Sample images for testing
│ ├── dog.png
├── README.md # Documentation file (this file)
├── requirements.txt # Dependencies for the project
├── results
│ └── results.csv # Output file for detection results
├── src # Source code
│ ├── __init__.py
│ ├── main.py # Main script for video processing
│ └── utils # Utility modules
│ ├── captioning_utils.py # Captioning-related utilities
│ ├── detection_utils.py # Detection-related utilities
│ ├── state_change_utils.py # State change detection logic
│ └── video_utils.py # Video handling utilities
- Python 3.9 or later.
- A machine with GPU support for efficient video processing (optional but recommended).
-
Clone the repository:
git clone git@github.com:Ebimsv/Hazard-Detection-and-Captioning-System.git cd Hazard-Detection-and-Captioning-System
-
Install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Using Pre-trained Models:
-
YOLO Models:
- Download and place pre-trained YOLO models (e.g.,
yolov8n.pt
oryolo11n.pt
) in themodels/YOLO_models
directory. - Alternatively, you can set
model_name = "yolo11n.pt"
ormodel_name = "yolov8n.pt"
indetection_utils.py
to reference the downloaded models.
- Download and place pre-trained YOLO models (e.g.,
-
BLIP Captioning Model:
- Place the BLIP captioning model in
models/blip-image-captioning-base
. - Alternatively, set
model_name = "Salesforce/blip-image-captioning-base"
incaptioning_utils.py
to use the model from the Hugging Face repository.
- Place the BLIP captioning model in
-
- Annotations: Ensure
annotations.pkl
is indata/annotations
. - Videos: Place video files in
data/videos
.
Execute the main script to process videos and generate results:
python src/main.py --annotations data/annotations/annotations.pkl --video_root data/videos --caption_model blip_base
--annotations
: Path to the annotations file.--video_root
: Directory containing video files.--caption_model
: Captioning model (blip_base
,instruct_blip
, orvit_g
).
The system generates a results.csv
file in the results
directory. It contains:
ID
: Frame identifier (e.g.,video_0001_0
for the first frame ofvideo_0001.mp4
).Driver_State_Changed
: Boolean flag (True
/False
) for state change detection.Hazard_Track_X
andHazard_Name_X
: Tracks and descriptions of detected hazards, up to 22 slots.
ID,Driver_State_Changed,Hazard_Track_1,Hazard_Name_1,...,Hazard_Track_22,Hazard_Name_22
video_0001_0,False,1,"car detected",,,,,,,,,,,,,,,,,
video_0001_28,True,3,"bicycle detected",,,,,,,,,,,,,,,,,
-
Driver State Change Detection:
- Purpose: Analyze driver behavior to identify changes in state (e.g., slowing down, reacting to hazards).
- Implementation: A robust algorithm evaluates movement trends in detected hazards using median distances across frames. A custom threshold mechanism determines state changes.
- Contribution:
- Improved detection logic for accurately identifying driver state changes.
- Integration of temporal filtering to minimize noise and false positives.
-
Hazard Detection and Description:
- Purpose: Identify hazards in each video frame and provide meaningful descriptions.
- Implementation:
- Object Detection: Utilizes the YOLOv8 model for bounding box detection, class identification, and confidence scoring.
- Captioning: Employs state-of-the-art captioning models (e.g., BLIP) to generate descriptions of detected objects.
- Dynamic Class Filtering: Introduced a filtering mechanism for relevant YOLO classes (e.g., cars, pedestrians) to focus on impactful hazards.
- Contribution:
- Dynamically retrieved YOLO class names, removing the need for hardcoded labels.
- Combined class names with captions for improved hazard descriptions.
-
CSV Output Formatting:
- Purpose: Ensure output adheres to competition guidelines with consistent and structured data.
- Implementation:
- Automatically initializes a
results.csv
file with appropriate headers. - Records up to 22 hazards per frame, ensuring correct alignment of
Hazard_Track
andHazard_Name
. - Handles frames with no hazards gracefully by filling empty slots.
- Automatically initializes a
- Contribution:
- Streamlined hazard recording with unique identifiers and descriptive captions.
- Adherence to competition requirements for structured CSV output.
-
Video Processing Pipeline:
- Purpose: Efficiently process multiple videos for hazard detection and driver state analysis.
- Implementation:
- Frame-by-frame analysis using OpenCV.
- Skips non-relevant frames for performance optimization.
- Detects and tracks hazards dynamically across frames.
- Contribution:
- Developed a modular and extensible pipeline for large-scale video processing.
- Optimized performance by integrating filtering and temporal consistency checks.
-
Enhanced Hazard Detection:
- Implemented a class filtering mechanism to prioritize relevant YOLO classes.
- Introduced a global counter for generating unique
Hazard_Track
values, ensuring meaningful hazard tracking.
-
Dynamic Caption Generation:
- Combined YOLO-detected class names with state-of-the-art captioning models to produce accurate and interpretable hazard descriptions.
- Improved interpretability of results by dynamically retrieving class names from the model.
-
Robust Driver State Detection:
- Developed a novel approach to detect driver state changes using median movement trends.
- Added a cool-down mechanism to prevent rapid toggling of state changes.
-
Flexible and Modular Design:
- Designed a YOLODetector class that dynamically retrieves class names and processes detections efficiently.
- Modularized the pipeline into distinct components (video processing, detection, captioning), making it adaptable for future improvements.
-
Competition-Compliant Output:
- Ensured
results.csv
aligns with the competition format, including up to 22 hazards per frame with structured track-name pairs. - Addressed issues with missing or redundant hazards by integrating validation checks.
- Ensured
Run the YOLO detection module on a sample image:
python src/utils/detection_utils.py --image pics/dog.png --model YOLO_models/yolov8n.pt
Generate captions for an image:
from PIL import Image
from src.utils.captioning_utils import get_captioner
image = Image.open("pics/dog.png")
captioner = get_captioner("blip_base")
caption = captioner.get_caption(image)
print("Generated Caption:", caption)
-
Support for Advanced Models:
- Incorporate cutting-edge detection models (e.g., YOLO11x, Detectron2) for enhanced accuracy and robustness.
⚠️ High Computational Requirement: These models may require significant GPU memory for inference or fine-tuning. For my case, with a laptop having only 2GB of VRAM, this is not feasible for training or real-time applications without a high-performance GPU.
-
Improved Driver State Detection:
- Use advanced temporal algorithms, such as LSTMs or Transformer-based models, to better capture driver reactions and state changes.
⚠️ High Computational Requirement: Training Transformer-based models on large datasets is GPU-intensive.
-
Real-Time Hazard Monitoring:
- Adapt the system for real-time hazard detection, ensuring low latency for live monitoring applications.
- Optimize for edge devices using lightweight models (e.g., YOLO-Nano, MobileNet).
-
Hazard Severity Estimation:
- Implement a scoring mechanism to prioritize detected hazards based on their proximity, size, and potential impact.
-
Multi-Hazard Scenarios:
- Enhance the pipeline to handle complex scenes with multiple overlapping hazards using advanced tracking or segmentation techniques (e.g., Mask R-CNN, Panoptic Segmentation).
⚠️ High Computational Requirement: Segmentation models like Mask R-CNN are resource-intensive, requiring high VRAM for training.
- YOLO Models: Powered by Ultralytics YOLO.
- BLIP Captioning Models: Provided by Hugging Face Transformers.
- Thanks to contributors and open-source communities for their tools and resources.
This project is open-source and licensed under the MIT License.