The system interprets requests and dynamically generates Python code to interact with applications with Langchain (openai).
Note: This project is a proof of concept demonstrating the automation capabilities described below. Be patient. It is not intended for production use without further validation, security measures, and adaptations.
- General Description
- Key Features
- Architecture and Components
- Prerequisites
- Installation
- Configuration
- Usage
- Screen Analysis and OCR
- Speech Recognition and Text-to-Speech
- Task Interpretation and LLM Pipeline
- Persistent Memory and Application Registry
- UI Interaction
- Logging and Debugging
- Error Handling and Exceptions
- Limitations and Usage Tips
- Contribution
- License
This project is an intelligent automation system designed to interact with Windows applications and LLM, analyze the screen, recognize speech, speak responses, execute planned tasks, and interact with user interfaces in an automated manner. It leverages advanced language models (via the OpenAI API), langchain, speech recognition, text-to-speech, screen capture, text recognition (OCR), and application interaction through tools like pywinauto and pyautogui.
As a proof of concept, it demonstrates how various components can be integrated to automate different tasks. It serves as an example and inspiration, but further study and hardening are needed before using it in a real-world environment.
- Launch and interact with Windows applications via pywinauto and pyautogui.
- Screen analysis: Uses
mss
for screenshots,pytesseract
for OCR, and OpenCV for detecting UI elements. - Speech recognition: Allows the user to give voice commands (via
speech_recognition
and Google Speech API). - Text-to-speech: Uses
pyttsx3
to provide spoken responses. - Dynamic Python code generation: Utilizes a LLM (ChatOpenAI) to produce UI interaction code based on natural language descriptions.
- Persistent memory: Stores state and application registry in
memory.json
. - Detailed logging: Actions, errors, and debug info are logged to
automation.log
.
Navigate to wikipedia.com with chrome and find the definition of AI
Write me a text on automation
In the whispering woods, where the tall trees sway,Nature's beauty unfolds in a magical display.With petals like silk and leaves like a song,The chorus of life hums, inviting us along.The rivers dance lightly, reflecting the sky,While mountains stand guard, reaching up high.Each sunrise a canvas, each sunset a dream,In the heart of the forest, life flows like a stream.The fragrance of flowers, the warmth of the sun,In nature's embrace, we find joy, we are one.So let us wander where the wild things grow,In the splendor of nature, let our spirits flow.
draw me a cat
It can accept pre-prepared tasks as input in JSON format, for example, if the tool is connected to an external utility and provides it with the steps. In this POC, simply type "sample.".
TaskPlan = {
"steps": [
{
"action": "open",
"application": "notepad"
},
{
"action": "interact",
"details": {
"process_name": "notepad.exe",
"action_description": "Create a new file and type 'Hello, World!'"
}
}
]
}
- Main file: The script (within the
if __name__ == "__main__": ...
block) is the program’s entry point. - Key Classes:
PersistentMemory
: Manages persistent state (read/write frommemory.json
).ApplicationRegistry
: Maintains a registry of discovered tools/applications and usage statistics.ScreenAnalyzer
: Captures and analyzes the screen, performs OCR, and detects UI elements.WindowLocator
: Finds windows associated with a given process or application.InteractionStrategies
: Logic to automatically interact with the window (executes code generated by the LLM).TaskInterpreter
: Interprets user requests into a task plan (uses LLM).TaskExecutor
: Executes the task plan step-by-step (open app, interact, capture screen, etc.).
- Operating System: Windows (required for pywinauto, win32gui, etc.).
- Python 3.8+ recommended.
- An OpenAI API key
- Microphone and speakers (for voice recognition and text-to-speech).
- Tesseract OCR installed on the machine (and accessible in the PATH). Installer here https://github.com/UB-Mannheim/tesseract/wiki.
- Internet connection (for Google Speech Recognition and OpenAI API).
-
Clone the repository:
git clone https://github.com/JacquesGariepy/CommanderAI.git cd CommanderAI
-
Install dependencies:
pip install -r requirements.txt
Dependencies include (not exhaustive):
pyautogui
,psutil
,pywinauto
,win32gui
,win32process
mss
,pytesseract
,numpy
,Pillow
,opencv-python-headless
speech_recognition
,pyttsx3
langchain
,openai
Plus standard libraries:
ast
,locale
,os
,json
,asyncio
,logging
,re
,time
,pathlib
,typing
,subprocess
,sys
-
Install Tesseract:
- Download and install Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
- Ensure
tesseract.exe
is in your PATH.
- OpenAI API Key: Set the
OPENAI_API_KEY
environment variable with your key in ".env" - System Language: The code detects your system language via
locale.getdefaultlocale()[0]
. - Tesseract Settings: Default is
TESSERACT_CONFIG = '--oem 3 --psm 6'
andlang='fra'
. Adjust if needed. - Memory File:
MEMORY_FILE = "memory.json"
. Ensure it’s writable.
Run the script:
python app.py
The program:
- Initializes components.
- Asks if you want to use voice or text mode.
- You can then say or type a command like "Open Notepad" or "Take a screenshot".
- The program interprets your command, generates a task plan, and executes it.
- Results are displayed and logged in
automation.log
.
- Text Mode: Type commands directly.
- Voice Mode: The script prompts you to speak. It listens, transcribes, and executes your commands.
- Open an application:
- Text mode: type "open notepad".
- Voice mode: say "open notepad".
- List available applications: say or type "list".
- Help: say or type "help".
ScreenAnalyzer
:
- Uses
mss
for screenshot. - Converts and threshold the image, then uses OCR with Tesseract.
- Detects UI elements (buttons, text fields) via OpenCV.
Screen analysis results can validate UI states after actions.
- Speech Recognition:
speech_recognition
+ Google Speech API (requires Internet). - Text-to-Speech:
pyttsx3
speaks the program’s responses.
Users can interact entirely by voice if desired.
TaskInterpreter
sends user requests to the LLM (ChatOpenAI via LangChain).- The LLM returns a JSON task plan.
TaskExecutor
executes these steps.- For interaction steps, code is dynamically generated by the LLM, validated, and executed.
This integration showcases how LLMs can aid in automation scenarios.
PersistentMemory
reads/writesmemory.json
to persist state across sessions.ApplicationRegistry
maintains a list of discovered apps and usage stats.
WindowLocator
finds application windows.pywinauto
andpyautogui
handle window interactions (clicking, typing, selecting menus).InteractionStrategies
dynamically generates interaction code from action descriptions.
- All steps and errors are logged in
automation.log
. - Adjust
logging
level (e.g.,logging.DEBUG
) for detailed diagnostics.
- Critical operations (OCR, app launch, speech recognition) are in try/except blocks.
- On error, a message is logged, possibly spoken, and the script attempts to continue if possible.
-
Limitations:
- Speech recognition depends on microphone quality and Internet connectivity.
- OCR and UI detection depend on the visual quality and context of the screen.
- The LLM can generate imperfect code; this proof of concept attempts to handle retries.
-
Tips:
- Test with simple apps (Notepad, Calculator).
- Verify Tesseract installation.
- If voice recognition fails, use text mode.
- Ensure a valid OpenAI key.
Contributions are welcome to:
- Fix bugs.
- Add features.
- Improve code robustness and validation.
- Adapt this proof of concept to real-world environments.
Submit a pull request with your changes and a clear description.
This proof of concept is not explicitly licensed. For use beyond experimental purposes, contact the author or define an appropriate license.