This lab will show you how to run pretrained models to locate people and objects in images, optimize their latency, and train on custom classes.
- Train your Model
- Upload your Model to the Pi
- Test the Model
- Train a Custom Model
- Run a Custom Model
- Next Steps
Just like in Lab 2, we'll be training our models on a cloud server, using
Google's free Colab service. Open the notebook
and follow the directions until you have completed the Download a Model
step and have the yolov8n_int8.tflite
file downloaded.
If you aren't able to run the training notebook, you can still download example models and try the rest of the Pi commands by running this:
cd ~/labs/lab3
./download_models.sh
You should be remotely connected to your Pi through VS Code, with this
repository open. In the file explorer, open up the models
folder, and drag
the YOLO model file into it, as you did with the flower model on the previous
lab.
This lab includes an example image of a bus with people in front that we can test the model on using this command:
cd ~/labs/lab3
python locate_objects.py \
--model_file=../models/yolov8n_int8.tflite \
--label_file=../models/yolov8_labels.txt \
--image=../images/bus.jpg
The three arguments are the path to the model file, the path to a labels file, and the image file to use as the input.
You should see something like this:
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
person: 0.83 (39, 135) 51x108
bus: 0.82 (113, 100) 218x107
person: 0.72 (79, 130) 33x93
person: 0.66 (205, 123) 37x117
time: 24.521ms
To have some fun with it, you can also take live camera input and output the results to an image you can view in VS Code, with this command:
python locate_objects.py \
--model_file=../models/yolov8n_int8.tflite \
--label_file=../models/yolov8_labels.txt \
--camera=0 \
--save_output=output.png
You'll need to find the output.png
image in the labs3
folder in VS Code's
file explorer. Once you select that, you should see the camera input with
bounding boxes and labels overlaid.
The default YOLOv8 model is trained on the COCO2017 dataset, which contains 80 different categories of objects. It's likely that a real-world application will need to recognize other classes though, so to complete this lab return to the Colab to learn how to retrain a model to recognize custom categories.
Once you've completed the Colab, you should have a african_wildlife.tflite
file in the models folder. To test it out, run:
python locate_objects.py \
--model_file=../models/african_wildlife.tflite \
--label_file=../models/african_wildlife_labels.txt \
--image=../images/zebra.jpeg
There are a lot of other kinds of information you can retrieve from images using computer vision. Here are some bonus examples to show some of those.
Ultralytics includes an excellent pose detection model as part of their YOLOv8 collection. I've included a pretrained model in this repository which you can try out with this command:
python locate_objects.py \
--model_file=../models/yolov8n-pose_int8.tflite \
--output_format=yolov8_pose \
--label_file=../models/yolov8-pose_labels.txt \
--image=../images/bus.jpg \
--save_output=output.png
You'll see that it shows a rough skeleton of each person it detects. To run it on a live camera, you can use this:
python locate_objects.py \
--model_file=../models/yolov8n-pose_int8.tflite \
--output_format=yolov8_pose \
--label_file=../models/yolov8-pose_labels.txt \
--camera=0 \
--save_output=output.png
Most of the extra complexity for this model comes from its use of keypoints to
define the important parts of the skeletal model it displays. These need to be
handled by the non-max suppression process which takes the raw output of the
model (which consists of a lot of overlapping boxes and their keypoints) and
merges them into a few clean boxes and poses. You can read more about this NMS
process in Non-Max Suppressions - How do they Work?.
This is also why we have to supply the --output_format=yolov8_pose
, so that
the code can interpret the output of the model correctly.
So far we've been using the "nano" version of the Ultralytics' YOLOv8 models,
because this is the smallest and fastest. You can also try the other model sizes
which offer increased accuracy at the cost of slower latency. Another tradeoff
you can look at is the input image size, set by the imgsz
parameter when
exporting. You can expand the image size to capture finer detail, but the model
will take longer to deal with the larger image.
The Mediapipe project from Google is another great source of pretrained models for a lot of different tasks, including face and hand detection, gesture recognition, and segmentation. You can find an example for the Raspberry Pi here.
Finding text in a natural scene and reading it is known as "scene OCR", and you
can give it a try for yourself by installing EasyOCR with pip install --break-system-packages easyocr
and then running python ./read_text.py
.
Unfortunately this takes over 30 seconds per frame, even on a Pi 5, so it's not useful for real time applications. I hope that we'll see smaller models emerge for this task in the future.