In our project, we demonstrate a prototypical smart glasses to help blind people navigate and identify object easier.
components:
- smart glasses: streams video and audio to the internet by connecting to a wifi network. To enable transportability one can connect it to a mobile hotspot instead.
- backend / server / computer: receives the video and audio streams; analyze the audio stream for voice commands and sends the video data to a bunch of ai to convert to useful text.
- in the prototype: the text is read aloud by the computer and the output audio is streamed to the blind persons phone throuugh the internet
- ideally: the text is sent to a custom app on the blind persons phone and it reads out the texts
note on image / video:
- in the prototype: an image is taken from the video stream every 10 seconds and the image is sent to different single-image processing machine learning models
- ideally: the video stream itself is taken as the input to more sophisticated machine learning models in order to take advantage of the temporal relations between frames
voice commands:
- [nothing]: every 10 seconds it reads aloud what object it detected according to the flowchart.
- "help": reads aloud available commands
- "start": turns on
- "stop": turns off
- "remind": saves the recording of what the user said
- "reminder": plays the most recent "remind" recording
- "describe": sends the image to a natural languge processing model to generate a coherent description of the image.
- "picture": saves the image and in the prototype saves it to a google drive for future use of the blind person; ideally the image will be sent to the custom app
- "crop" (todo): crops image to a rectangular object such as a poster or a paper the user is holding; describes the cropped image and saves it as well
- "point" (todo): reads aloud what object the user is pointing at with the blind cane
it literally screenshots the screen everytime and sends it to a bunch of ais; so open ur camera app to fullscreen to test (https://webcamtests.com/)
install libraries (for huggingface):
pip install sounddevice soundfile gtts transformers pyautogui pillow speechrecognition scipy torch timm playsound
for gcloud:
- https://cloud.google.com/sdk/docs/install-sdk
- https://googleapis.dev/python/google-api-core/latest/auth.html
for imageai:
- download the 3 .pt files on this webpage (scroll down a bit) https://github.com/OlafenwaMoses/ImageAI/tree/master/imageai/Detection
- put in same folder as the python scripts
for huggingface:
- no setup required we will use this as the main one
IMPORTANT: hold ctrl c to stop the program; please stop it after few tries cuz every time use google cloud some money is used in my account (currently running on free credits given by google)
job division:
- kohei: detect if object left or right (and maybe if object on ground to prevent tripping)
- ngoni: detect the cane (ok objec detection cant detect idk how to detect now maybe detect th white pixels) (if possible detect what bounding boxes overlap with the cane so can say what the blind person is pointing at)
- lucy: add depth estimation to objects (https://github.com/nianticlabs/monodepth2)
- toshi: add text ocr
- ambitious: train custom object detection for the blind cane (https://manivannan-ai.medium.com/how-to-train-yolov2-to-detect-custom-objects-9010df784f36, https://manivannan-ai.medium.com/how-to-train-yolov3-to-detect-custom-objects-ccbcafeb13d2)
- https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
- google cloud vision
- https://github.com/OlafenwaMoses/ImageAI
- https://arxiv.org/abs/1708.02002
- https://arxiv.org/abs/1804.02767
- https://github.com/NVlabs/SegFormer
- https://arxiv.org/abs/2105.15203
- https://github.com/nianticlabs/monodepth2