As the sun sets on my 16-week journey with Google Summer of Code (GSoC) 2024, I'm thrilled to share the fruits of my labor: VisionGuard, an innovative desktop application designed to combat eye strain and promote healthier computing habits. This project, developed under the mentorship of the OpenVINO Toolkit team, represents a significant step forward in leveraging advanced computer vision technology for personal well-being.
VisionGuard is a privacy-focused screen time management tool that uses your computer's webcam to monitor your gaze and encourage healthy viewing habits. By operating entirely locally and supporting inference on AI PC's Neural Processing Units (NPUs), VisionGuard offers a unique blend of functionality, performance, and data security.
During the GSoC period, I successfully implemented the following features:
- Real-time Eye Gaze Tracking: Integrated OpenVINO’s gaze detection engine to accurately track user gaze without compromising privacy.
- Customizable Break Notifications: Developed a smart alert system that reminds users to take breaks based on the 20-20-20 rule.
- Comprehensive Statistics: Built a statistics calculator to provide daily and weekly screen time insights.
- Flexible Device Support: Enabled seamless switching between CPU, GPU, and NPU for inference, optimizing performance across hardware configurations.
- Multi-Camera Compatibility: Supported up to five camera devices for enhanced flexibility.
- Aesthetic Customization: Designed both dark and light themes for user preference.
- Resource Optimization: Integrated a system resource monitor and frame processing limits to ensure efficient performance.
- System Tray Integration: Developed a system tray application for quick access to key features without desktop clutter.
%%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff' }}}%%
graph TD
subgraph Client
UI[User Interface]
GVD[Gaze Vector Display]
GCW[Calibration Window]
STW[Screen Time Widget]
STS[Statistics Window]
CPR[Camera Permission Request]
RCK[Run-Time Control Keys]
end
subgraph Backend
CL[Core Logic]
GDM[Gaze Detection Engine]
GVC[Gaze Vector Calibration]
EGT[Eye Gaze Time Tracker]
BNS[Break Notification System]
SC[Statistics Calculator]
MC[Metric Calculator]
PC[Performance Calculator]
end
subgraph Data
UM[Usage Metrics]
end
UI <-->|Input/Output| CL
CPR -->|Permission Status| CL
RCK -->|Control Commands| CL
CL <--> UM
CL <--> GDM
CL <--> GVC
CL <--> EGT
CL --> BNS
CL <--> SC
CL <--> MC
CL <--> PC
BNS --> UI
CL --> GVD
CL --> GCW
CL --> STW
SC --> STS
PC --> UI
style UI fill:#f0f9ff,stroke:#0275d8,stroke-width:2px
style GVD fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style GCW fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style STW fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style STS fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style CPR fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style RCK fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style CL fill:#fff3cd,stroke:#ffb22b,stroke-width:2px
style GDM fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style GVC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style EGT fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style BNS fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style SC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style MC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style PC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style UM fill:#f2dede,stroke:#d9534f,stroke-width:1px
For a detailed architectural overview of each component, please refer to the Detailed Component Architecture document.
The client consists of two main components:
- Main Window Application: Provides the primary user interface.
- System Tray Application: Runs in the background within the OS system tray.
The heart of VisionGuard is its gaze detection engine, leveraging several models from the OpenVINO model zoo:
- Face Detection:
face-detection-retail-0005
- Head Pose Estimation:
head-pose-estimation-adas-0001
- Facial Landmark Detection:
facial-landmarks-35-adas-0002
- Eye State Estimation:
open-closed-eye-0001
- Gaze Estimation:
gaze-estimation-adas-0002
These models work together to create a robust gaze detection pipeline.
graph TD
A[Image Input] --> B[Face Detection]
B --> |Face Image| C[Facial Landmark Detection]
B --> |Face Image| D[Head Pose Estimation]
C --> E[Eye State Estimation]
D --> |Head Pose Angles| F[Gaze Estimation]
C --> |Eye Image| F
E --> |Eye State| F
F --> |Gaze Vector|G[Gaze Time Estimation]
G --> H[Accumulate Screen Gaze Time]
style B fill:#FFDDC1,stroke:#333,stroke-width:2px
style C fill:#FFDDC1,stroke:#333,stroke-width:2px
style D fill:#FFDDC1,stroke:#333,stroke-width:2px
style E fill:#FFDDC1,stroke:#333,stroke-width:2px
style F fill:#FFDDC1,stroke:#333,stroke-width:2px
style G fill:#FFDDC1,stroke:#333,stroke-width:2px
style A fill:#C1E1FF,stroke:#333,stroke-width:2px
style H fill:#C1E1FF,stroke:#333,stroke-width:2px
One of the most critical aspects of ensuring VisionGuard’s accuracy is the calibration process. Accurate calibration is essential for precise gaze tracking, as it directly influences how well the application can detect and respond to where the user is looking on the screen. The calibration process I developed is both user-friendly and technically robust, designed to adapt to various screen sizes and user positions.
The calibration process begins with a Four-Point Gaze Capture. Users are prompted to focus on four green dots that appear sequentially in the corners of the screen. This step is crucial for gathering data on the user's gaze behavior from different angles. The process ensures that multiple gaze points are captured for each corner, improving the overall accuracy of the calibration.
Figure: A screen with four green dots representing the four-point calibration process.
Once the gaze data is captured, the next step is the Convex Hull Calculation. The system takes all the captured gaze points and computes the smallest polygon that can enclose these points, known as the convex hull. This polygon represents the boundary within which the user's gaze is expected to fall.
Figure: Visualization of the convex hull enclosing the captured gaze points.
To account for potential inaccuracies in gaze detection, an Error Margin Application is performed. The convex hull is expanded by a predetermined margin (typically 150 pixels) to create a buffer zone. This extension ensures that slight deviations in gaze tracking won’t lead to incorrect detections.
Figure: The error margin applied to the convex hull to account for tracking inaccuracies.
The final step in the calibration process is determining the Final Calibration Points. The extended convex hull is intersected with the screen boundaries, and the resulting points form the final calibration set. These points are crucial for accurate gaze tracking, ensuring that the system can reliably detect whether the user is looking at the screen.
Figure: The final calibration points determined after applying the error margin.
This comprehensive calibration process not only improves accuracy but also enhances the user experience by making the setup process straightforward and reliable.
At the core of VisionGuard’s functionality is its ability to process video frames in real time and update gaze-related metrics. This system works by analyzing each frame captured by the webcam to determine the user’s gaze direction and then updating the screen time metrics accordingly.
The process starts with Face and Gaze Detection. Using models from the OpenVINO toolkit, the application first detects the user’s face and then estimates their gaze direction. This step is critical as it forms the basis for all subsequent calculations.
Figure: The face and gaze detection process, which identifies the user’s gaze direction.
Once the gaze direction is estimated, the next step is to calculate the Gaze Screen Intersection. Here, the 3D gaze vector is projected onto the 2D screen plane. This conversion is essential to determine whether the user is looking at the screen and, if so, where on the screen their gaze is focused.
Figure: Illustration of the gaze vector intersecting with the 2D screen plane.
Based on the intersection point, the system then performs a Gaze Time Update. If the user’s gaze is on the screen and their eyes are open, the application accumulates screen time. Conversely, if the gaze is off the screen or the eyes are closed, the system updates the gaze lost duration. If this duration exceeds a specified threshold, the accumulated screen time is reset.
Figure: Flowchart showing how gaze time is updated based on user behavior.
Finally, the system provides Visual Feedback by marking detected facial features and displaying the current gaze time and lost duration on the frame. Alongside this, Performance Metrics such as CPU utilization, memory usage, and frame processing speed are tracked to ensure that VisionGuard runs efficiently.
Figure: Example of visual feedback provided by VisionGuard, along with performance metrics.
Determining whether the user's gaze is within the screen boundaries is a critical task that VisionGuard accomplishes using a Point-in-Polygon Algorithm. Specifically, VisionGuard employs a ray-casting technique, which is a widely-used method for solving this problem.
The algorithm works by casting a ray from the gaze point and counting the number of intersections this ray has with the edges of the polygon representing the screen area. If the number of intersections is odd, the point lies inside the polygon; if even, it lies outside. This method is effective for both convex and concave polygons, making it highly adaptable.
Figure: Diagram illustrating how the ray-casting algorithm determines if a point is inside a polygon.
The Point-in-Polygon algorithm is particularly suited for VisionGuard’s 3D gaze estimation as it accurately maps the 3D gaze vector onto the 2D screen space. This mapping is crucial for reliable screen time tracking and ensuring that users receive timely notifications to take breaks.
Throughout the GSoC experience, I encountered numerous challenges that significantly contributed to my learning and growth as a developer:
-
Cross-platform C++ Development: Developing a cross-platform C++ application presented unique challenges, particularly in ensuring compatibility across different operating systems like macOS, Windows, and Linux. I faced difficulties with different compilers, such as issues compiling OpenVINO’s Model Zoo demo with MSVC 2022, which required troubleshooting and problem-solving to ensure smooth builds across platforms.
-
Understanding and Implementing CMake: CMake, a powerful build system, required me to deepen my understanding of build configurations and dependencies. This knowledge was essential in managing the complexity of a cross-platform project like VisionGuard.
-
Low-Level Design Issues: Navigating C++'s low-level design complexities, particularly with object-oriented principles (OOP), was challenging. Implementing robust design patterns while maintaining performance required careful consideration of memory management and efficiency.
-
Screen Calibration for Accurate Gaze Detection: One of the more technically demanding tasks was calibrating the screen to accurately detect if the user was gazing at the screen. This required developing a reliable and user-friendly calibration process that could adapt to different screen sizes and user positions.
-
Adhering to C++ Development Standards: Ensuring that VisionGuard adhered to modern C++ development standards was vital for the project’s long-term maintainability. I had to revise my approach to permissions and data storage, moving from storing stats in the current working directory to using appropriate libraries for handling resources securely and efficiently.
These challenges not only helped me improve VisionGuard but also significantly enhanced my problem-solving skills and understanding of cross-platform development.
While I'm proud of what I've accomplished during the GSoC period, there's always room for improvement. Some areas for future development include:
- Implementing comprehensive unit tests to ensure reliability and maintainability
- Developing GitHub workflows for automated building, testing, and linting
- Adding support for multi-monitor setups and multi-user environments
- Enhancing the statistics and reporting features for more detailed insights
My GSoC journey with OpenVINO and VisionGuard has been an incredible learning experience. I've had the opportunity to work with cutting-edge technology, collaborate with talented mentors, and create a tool that I believe can make a real difference in people's lives.
I want to express my heartfelt gratitude to my mentors, Dmitriy Pastushenkov, Ria Cheruvu for their guidance and support throughout this journey. I also want to thank the entire OpenVINO Toolkit community for their invaluable resources and assistance.
If you're interested in trying out VisionGuard or contributing to its development, please check out our GitHub repository. Your feedback and contributions are always welcome!