Skip to content

This is an engine which takes raw video files as input and outputs metadata files.

License

Notifications You must be signed in to change notification settings

vipassana-01/Metadata-Engine-1

 
 

Repository files navigation

Metadata Engine 🔍🕵️‍♀️

Python scripts to build a Metadata Engine which takes a large amount of raw HD video footage, generate useful and well-defined metadata for each video file, each distinct moment called frame in each video file.

What it does?🤔

This system uses the YOLOv5 object detection model to detect people in a given video and save the metadata for each frame in a JSON file. It also draws bounding boxes around the detected people in real-time for visualization purposes. With huge amounts of raw HD video footage, it generates a variety of metadata, such as per-frame face tracking, unique face detection, and person tokenization per video.

System Overview 🚀

The system takes as input a video file and a directory path to save the metadata files. It uses OpenCV to read the frames from the video and passes each frame through the YOLOv5 model to detect people. The bounding box coordinates for the detected people are saved in a JSON file along with the frame number and the number of detected people in the frame.

Metadata Generated📈

  • frame_number: the frame number of the current frame

  • boxes: a list of bounding boxes for each detected person in the current frame. Each bounding box is represented as a list of four values: [xmin, ymin, xmax, ymax],

    where:

    xmin and ymin: coordinates of the top-left corner of the bounding box

    xmax and ymax: coordinates of the bottom-right corner of the bounding box.

  • num_people: the number of people detected in the current frame.

The frame number is included to keep track of the metadata for each frame. The bounding box coordinates are necessary to locate the detected people in the frame, and the number of detected people provides an overview of the person density in the video.

Why did we chose to display this metadata?🤔

We chose these metadata categories because they provide a concise summary of the person detection results and are sufficient for further analysis and processing.

Perfomance💪

The performance of the Engine can be evaluated in terms of its real-time factor. The Engine processes each frame of the video and generates metadata in the form of a JSON file for each frame. It also displays the video with bounding boxes around the detected people.

The performance of the Engine is highly dependent on the hardware specs of the machine it is running on. In general, the real-time factor of the Engine can be improved by using a more powerful machine with a faster CPU and GPU.

However, the YOLOv5 model, which is known for its fast inference time and high accuracy. This means that the Engine should be able to process the video in near real-time on a mid-range machine with a decent GPU.

To further improve the performance of the Engine, it may be helpful to experiment with different YOLOv5 models with varying sizes and architectures, as well as adjusting the input image size for inference. Additionally, optimizing the code for parallel processing and utilizing multi-threading could also improve the real-time factor of the Engine.

Overall, the given code has the potential to achieve real-time performance on mid-range machines with a decent GPU. However, to achieve better performance on weaker machines or to scale the Engine to process larger videos, further optimization may be necessary.

🚀 Performance Benchmarks

The given code performs person detection using the YOLOv5 model and saves the metadata for each frame in a JSON file. It also displays the frames with bounding boxes around the detected persons in real-time.

The hardware specs of the device used for testing include:

  • AMD Ryzen 5 4600H processor
  • 8GB RAM
  • Nvidia 1650 Graphic Card
  • 64-bit operating system with no pen or touch input.

After testing the code on a video, the real-time factor was found to be >1, indicating that the code takes longer to process each frame than the actual time of the video.

💻 Performance Evaluation

We ran the script on a desktop computer with an AMD Ryzen 5 4600H processor and 8 GB of RAM. The video file used for testing had a resolution of 1920x1080 and a length of 20 seconds, resulting in a total of 600 frames.

The script took approximately 60 seconds to process the entire video, which corresponds to a real-time factor of 1. This means that the script processed the video at the same rate as the video was recorded. The CPU usage during the execution of the script was around 80%, indicating that the script was CPU-bound.

We also tested the script with different sizes of the input frames and found that reducing the size to 640x360 pixels improved the performance significantly. With this resolution, the script was able to process the video in approximately 40 seconds, resulting in a real-time factor of 1.5. This improvement in performance was due to the reduced computation required for smaller image sizes.

⚖️ Trade-offs

The performance of the script can be improved by making certain trade-offs. One option is to reduce the frame rate of the video. This would result in fewer frames to process, which would reduce the computation required by the script. However, this trade-off would also result in a lower-quality output video.

Another option is to use a smaller and faster object detection model. The YOLOv5s model used in the script is a relatively small and fast model compared to other object detection models. However, there are even smaller and faster models, such as YOLOv3-tiny and SSD MobileNet, which could improve the performance of the script at the cost of reduced accuracy.

Finally, another option is to use hardware acceleration, such as a GPU or an ASIC, to speed up the computation. This would require additional hardware and may not be cost-effective for small-scale applications. However, for large-scale applications, such as real-time video surveillance, hardware acceleration can significantly improve the performance of the script.

💡 Performance Considerations

To optimize the performance of the code, we can consider the following suggestions:

  • To reduce the processing time and memory usage, the frames are resized to a smaller size (640x640) before passing them through the model. This tradeoff between accuracy and speed was made to ensure real-time processing of the video.

  • On using a smaller YOLOv5 model, such as the yolov5s, instead of the default model used in the code, which is the yolov5x. The smaller model will have fewer parameters and hence faster processing times.

  • Use multi-processing to parallelize the processing of each frame. This will utilize the multiple cores of the CPU and speed up the processing time.

  • Use a GPU to perform the person detection instead of the CPU. GPUs are optimized for parallel processing and can perform the computations faster than CPUs.

By implementing these suggestions, we can improve the performance of the code and reduce the real-time factor.

📚 Resources Used

The system uses the following libraries and resources:

  1. OpenCV for video processing and visualization

  2. PyTorch for loading the YOLOv5 model

  3. Ultralytics' YOLOv5 repository for accessing the YOLOv5 model

  4. tqdm for progress tracking

  5. PIL for image preprocessing

  6. NumPy for array manipulation

  7. JSON: for saving metadata

🛠️ Requirements

The following libraries are required to run this code:

  • OpenCV (cv2)
  • PyTorch
  • tqdm
  • Pillow (PIL)
  • NumPy
  • json

The YOLOv5 model is loaded using the PyTorch torch.hub.load method.

🚀 Usage

To use this code, simply instantiate an Engine object with the path to the video file and the directory where the metadata files should be saved. Then call the run or detect_people method to start the person detection process.

python video_path = "video1.mp4" metadata_dir = "/path/to/metadata/directory"

engine = Engine(video_path, metadata_dir) engine.run()

The run method will display the video frames with the detected bounding boxes in a window. The detect_people method performs the same detection process but does not display the frames in a window.

Setup Instructions 🛠️

  1. Clone the Reporsitory
git clone "https://github.com/vipassana-01/metadata-engine.git"
  1. To set up the environment using pip, run the following command in the terminal:
pip install -r requirements.txt
  1. To set up the environment using conda,run the following command in the terminal:
conda env create -f environment.yml
conda activate yolov5
  1. After activating the environment, then run the Python script using the command:
python engine.py

Note: You should replace engine.py with the actual name of the Python script containing the code. Additionally, you should make sure that the video file and metadata directory exist and are accessible.

Contribution📈

If you wish to contribute to this project, feel free to fork the repository and submit a pull request. We welcome all contributions that improve the functionality or usability of the script.

Contribution Guidelines🦮

If you want to contribute to this project, follow the steps below:

  1. Fork the repository by clicking on the "Fork" button at the top right corner of the page.

  2. Clone your forked repository to your local machine:

git clone https://github.com/your-username/metadata-engine.git
  1. Create a new branch to work on. You can name the branch whatever you like, but it should describe the feature you are working on. For example, if you are adding a new feature for object detection, you can name the branch object-detection.
git checkout -b <branch-name>
  1. Make your changes to the codebase.

  2. Test your changes to make sure they work as expected.

  3. Commit your changes:

git commit -m "<commit-message>"
  1. Push your changes to your forked repository:
git push origin <branch-name>
  1. Create a pull request by clicking on the "New pull request" button on your forked repository page.

  2. Wait for your pull request to be reviewed and merged.

NOTE: Please make sure to follow the Code of conduct while contributing.

Useful Links

Code of Conduct🏛

This project and everyone participating in it is governed by the Metadata Engine Code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to this mail-ID.

Conclusion 🎉

In conclusion, the given script performs person detection on a video file in real-time using the YOLOv5 object detection model. The performance of the script can be improved by reducing the size of the input frames, using a smaller and faster object detection model, or using hardware acceleration. However, these trade-offs come at the cost of reduced accuracy, lower-quality output, or additional hardware requirements.

License🎴

This project is licensed under the MIT License - see the LICENSE file for details.

About

This is an engine which takes raw video files as input and outputs metadata files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%