This repo is forked and modified from Ultralytics's YOLOv5 Pytorch implementation. Visit here for the original README.
The project is structured as follows:
<root>
|-- adr (architectural decision records)
|-- archive
|-- classify (classification)
|-- data
| |-- hyps (hyperparam YAMLs)
| | |-- hyp.*.yaml
| | |-- ...
| |
| |-- images (images folder)
| |-- scripts (bash scripts folder)
| |-- pricetag.yaml (dataset config for this project)
| |-- *.yaml (configs for other datasets)
|
|-- models (model configs)
| |-- hub
| |-- segment
| |-- *.py
| |-- yolov5*.yaml (YOLOv5 configs)
|
|-- runs
| |-- detect
| | |-- exp* (folder containing detection results)
|
|-- segment (Python files for segmentation)
|-- utils (utility functions)
|-- detect.py (run this to perform inference)
|-- train.py (run this to train your model)
|-- *.py (other Python files)
|-- requirements.txt
Note: Some files are omitted in the project tree above for the sake of brevity.
This section outlines the step-by-step procedure to replicate this project.
- Spin up EC2 instances with Ubuntu AMI 20.04. Configure them as follows:
- Instance 1:
- Name: YOLO-CentralMgt
- Type: t2.micro
- Instance 2:
- Name: YOLO-SubmHost
- Type: t2.micro
- Instance 3:
- Name: YOLO-Executor
- Type: t2.micro (Can be upgraded to t2.large if needed)
- Group all instances into single security group, and private key pair
.pem
is used to login in this project.
- Instance 1:
- Access the instances via terminal by using the command at the
Connect
button at each instances.$ cd .ssh $ ssh -i "private_key.pem" ubuntu@ec2-11-111-11-111.compute-1.amazonaws.com
- Install HTCondor in all three (3) instances.
- Perform update on all instances via
sudo apt-get update
on all instances. - Install HTCondor using this guide on respective machines as follows (Note that the variables marked
$
should be replaced with user-defined values):- YOLO-CentralMgt:
$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name
- YOLO-SubmHost:
$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name
- YOLO-Executor:
$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name
- At YOLO-SubmHost, run
$ condor_status
to see execute machines in the pool,$ condor_submit
to submit jobs and$ condor_q
to see the jobs run.
- Perform update on all instances via
- At each instance, update
/etc/hosts
with IP address and machine names.Add the CentralMgt, SubmHost and Executor IP address and machine name like below:$ sudo nano /etc/hosts
127.0.0.1 localhost 172.31.92.114 CentralMgt 172.31.91.11 SubmHost 172.31.88.90 Executor # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback ...
- Edit inbound rules for security group to allow all traffic to pass within the pool group.
- At sidebar, go to Network & Security and select Security Groups.
- Choose the security group that applies to the pool.
- At Inbound rules, select Edit inbound rules, then Add rule.
- Choose
All traffic
for Type,Custom
for Source, and select the security group in the box next to Source. Then Save rules. - Test HTCondor using this
sleep.sh
example.
The following setup procedure can be found here.
- Install NFS Server on SubmHost, and then start it.
$ sudo apt install nfs-kernel-server $ sudo systemctl start nfs-kernel-server.service
- Create
yolo
folder at/home/ubuntu
in SubmHost and Executor instances.$ mkdir /yolo
- Add the
yolo
folder to/etc/exports
file.Add$ sudo nano /etc/exports
/home/ubuntu/yolo *(rw,sync,no_subtree_check)
into the file like below:Then apply the new config via# /etc/exports: the access control list for filesystems which may be exported # to NFS clients. See exports(5). ... # /home/ubuntu/yolo *(rw,sync,no_subtree_check)
$ sudo exportfs -a
- At Executor instance, install NFS Client and then start it if it is not active.
$ sudo apt install nfs-common $ sudo systemctl status nfs-common.service $ sudo systemctl start nfs-common.service
- Still at Executor, mount the created
/home/ubuntu/yolo
directory to the exported directory in NFS Server.Test by creating a file at SubmHost and echo its content in the Executor side.$ sudo mount SubmHost:/home/ubuntu/yolo /home/ubuntu/yolo
# At SubmHost $ cd /home/ubuntu/yolo $ echo "Hello" > testfile.txt # At Executor $ cd /home/ubuntu/yolo $ echo testfile.txt
See Common Issues if you run into any issues during NFS setup.
- Clone this repository to Executor because there are several Ubuntu-based packages needed to run YOLO model. Note that Git and Python 3.8 come pre-installed with Ubuntu 20.04 AMI.
$ git clone https://github.com/yuenherny/um-wqd7008-pdc-yolov5.git
- Install required Ubuntu packages for OpenCV and venv.
$ sudo apt install python3-opencv $ sudo apt install python3.8-venv
- At the cloned local repository, create and activate Python environment.
Check if environment is activated. You should see a list of pre-installed packages.
$ cd um-wqd7008-pdc-yolov5 $ python3 -m venv venv $ source venv/bin/activate
$ pip list
- Before installing other dependencies using
requirements.txt
, install torch and torchvision packages from the official PyTorch docs, as downloads via PyPi wheel can be slow on AWS.- At START LOCALLY section, choose:
- PyTorch Build: Stable
- Your OS: Linux
- Package: Pip
- Language: Python
- Compute Platform: CPU
- Then copy the command with torchaudio removed.
$ pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
- At START LOCALLY section, choose:
- Amend
requirements.txt
and comment outthop>=0.1.1
, then save.Then, install dependencies.$ nano requirements.txt
$ pip install -r requirements.txt
- Now that required dependencies are installed, we can check if things could be run like normal - invoking from terminal.
This would download the YOLOv5s weights and perform inference using
$ python3 detect.py --weights yolov5s.pt --source data/images/zidane.jpg
data/images/zidane.jpg
input source. You should see something like below:which means inference is successful and the result is saved to runs/detect/exp folder.detect: weights=['yolov5s.pt'], source=data/images/bus.jpg, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1 Fusing layers... YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients image 1/1 /home/ubuntu/yolo/um-wqd7008-pdc-yolov5/data/images/bus.jpg: 640x480 4 persons, 1 bus, 309.3ms Speed: 2.9ms pre-process, 309.3ms inference, 3.7ms NMS per image at shape (1, 3, 640, 640) Results saved to runs/detect/exp
- At SubmHost, create a bash file (see
yolo.sh
).Then paste this into the bash file, and save:$ nano yolo.sh
Without the shebang (the first line in bash file -#!/usr/bin/bash # file name: yolo.sh # check python version echo "$(python3 --version)" # change to directory cd /home/ubuntu/yolo/um-wqd7008-pdc-yolov5 echo "Directory changed" # activate python env source venv/bin/activate echo "Python env activated" # run detect.py on an image echo "Execution start" python3 detect.py --weights yolov5s.pt --source data/images/bus.jpg echo "Execution complete" # deactivate python env deactivate echo "Python env deactivated"
#!/usr/bin/bash
), you might get the following error:This shebang-induced error is documented here and here.... 007 (019.000.000) 2023-01-09 10:23:53 Shadow exception! Error from slot1@ip-172-31-88-90.ec2.internal: Failed to execute '/var/lib/condor/execute/dir_1257/condor_exec.exe': (errno=8: 'Exec format error') 0 - Run Bytes Sent By Job 298 - Run Bytes Received By Job ... 012 (019.000.000) 2023-01-09 10:23:53 Job was held. Error from slot1@ip-172-31-88-90.ec2.internal: Failed to execute '/var/lib/condor/execute/dir_1257/condor_exec.exe': (errno=8: 'Exec format error') Code 6 Subcode 8 ...
- Still at SubmHost, create a HTCondor submit file (see
yolo.sub
).Then paste the following into it, and save:$ nano yolo.sub
# YOLO detection on an image executable = yolo.sh output = yolo.out error = yolo.err log = yolo.log should_transfer_files = yes when_to_transfer_output = ON_EXIT queue
- Still at SubmHost, submit
yolo.sh
as a job usingyolo.sub
submit file.$ condor_submit yolo.sub
If there were no errors, you should see your result in the Executor instance's um-wqd7008-pdc-yolov5/runs/detect/exp
directory. However, you might get the following output in yolo.err
:
Traceback (most recent call last):
File "detect.py", line 261, in <module>
main(opt)
File "detect.py", line 256, in main
run(**vars(opt))
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "detect.py", line 98, in run
model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/models/common.py", line 345, in __init__
model = attempt_load(weights if isinstance(weights, list) else w, device=device, inplace=True, fuse=fuse)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/models/experimental.py", line 79, in attempt_load
ckpt = torch.load(attempt_download(w), map_location='cpu') # load
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
super(_open_file, self).__init__(open(name, mode))
PermissionError: [Errno 13] Permission denied: 'yolov5s.pt'
or:
Traceback (most recent call last):
File "detect.py", line 261, in <module>
main(opt)
File "detect.py", line 256, in main
run(**vars(opt))
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "detect.py", line 94, in run
(save_dir / 'labels' if save_txt else save_dir).mkdir(parents=True, exist_ok=True) # make dir
File "/usr/lib/python3.8/pathlib.py", line 1288, in mkdir
self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: 'runs/detect/exp2'
The [Errno 13]
error means that the program tried to create new file or folder but it was denied due to write permission issues. We need to assign other users with write access permissions in both project root and the runs/detect/
directory. At the parent directory of um-wqd7008-pdc-yolov5
:
$ chmod 777 um-wqd7008-pdc-yolov5
$ chmod 777 um-wqd7008-pdc-yolov5/runs
Change permissions for other files and directories if needed. Read more about chmod
here.
To support uploading to Google Drive, we will be using gdrive.
Before proceeding, please create Google OAuth Client credentials by following this guide.
Due to a limitation of this package which requires a web browser, you are required to perform this setup on a GUI-supported local machine. After the setup, we will then export the account and import it on any remote server.
-
Download the latest release from Github.
v3.1.0
as of Jan 10, 2023. -
Unzip the Archive
-
Open Terminal where gdrive is located, add Google Account to gdrive
$ ./gdrive account add
- This will prompt you for your google Client ID and Client Secret
- Next you will be presented with an url
- Open the url in your browser and give approval for gdrive to access your Google Drive
- You will be redirected to
http://localhost:8085
(gdrive starts a temporary web server) which completes the setup - Gdrive is now ready to use!
-
Test upload a file
$ ./gdrive files upload <FILE_PATH>
-
Export your gdrive Account (i.e. student_id@siswa.um.edu.my)
$ ./gdrive account export <ACCOUNT_NAME>
-
Copy the exported archive to the remote server(s) using the command below:
$ scp <Options> <PATH/ON/LOCAL> <SERVER_NAME>@<HOST>:<PATH/ON/SERVER>
Example:
$ scp -i "nicholasleezt-7008.pem" ~/Downloads/gdrive_export-s2132376_siswa_um_edu_my.tar ubuntu@ec2-18-234-241-246.compute-1.amazonaws.com:/home/ubuntu
- Download the latest release from Github.
v3.1.0
as of Jan 10, 2023.$ wget https://github.com/glotlabs/gdrive/releases/download/3.1.0/gdrive_linux-x64.tar.gz
- Unzip the archive.
$ tar -xvf gdrive_linux-x64.tar.gz
- Put the gdrive binary at your PATH (i.e.
/usr/local/bin
)$ sudo mv /home/ubuntu/gdrive /usr/local/bin
- Import the gdrive account.
$ gdrive account import <EXPORTED_ACCOUNT>
- Test upload a file.
$ gdrive files upload <FILE/TO/PATH>
- Alternatively, to upload to a specific folder in your gdrive, run:
to get the folder ID. You will see something like:
$ gdrive files list
then do:Id Name Type Size Created 1N6Hs1Tx1f0ubfKphI4cfGyZCGsjJCEz4 Folder1 folder 2022-12-31 09:24:37 1i7uFBCbqgw176TftRsws6z0BTr64zB1m Folder2 folder 2022-11-07 06:43:56 1GNCqI5lG-6wd_XQnYz0Z1z3ReUTGUQUY Folder3 folder 2022-07-20 12:48:50
Note that as of 15 Jan 2023, gdrive package is only able to upload files. Folders or multiple files are not supported.$ gdrive files upload <PATH/TO/FILE> --parent <FOLDER_ID>
The DAGMan workflow yolo-gdrive.dag aims to:
- Perform YOLO inference on images on parallel - see yolo.sh and yolo.sub
- Upload the results to a Google Drive folder - see gdrive.sh and gdrive.sub
Unfortunately, the shell script for gdrive execution was unable to complete due to permission issues:
Error: Failed to create directory '/nonexistent/.config/gdrive3': Permission denied (os error 13)
Error: No account has been selected
Use `gdrive account list` to show all accounts.
Use `gdrive account switch` to select an account.
...
Error: No account has been selected
Use `gdrive account list` to show all accounts.
Use `gdrive account switch` to select an account.
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
This could happen when you stopped and then started the instance after complete setting up NFS Server and Client (you took a break and resumed this project).
Solution: Start the NFS Client service on Executor.
- At Executor, start the NFS Client:
$ sudo systemctl start nfs-common.service
- At Executor, perform mounting:
$ sudo mount <NFS_SERVER_IP_ADDRESS_OR_MACHINE_NAME>:<DIR_ON_NFS_SERVER> <DIR_ON_NFS_CLIENT>
This could happen when your NFS Client service unit file was symlinked to /dev/null
.
Solution: Remove the symlink, unmask and start the service.
- At Executor, navigate to
/lib/systemd/system/
and check if service unit file was symlinked to/dev/null
.It should return:$ file /lib/systemd/system/nfs-common.service OR $ file /etc/systemd/system/nfs-common.service
/lib/systemd/system/nfs-common.service: symbolic link to /dev/null
- Delete the symlink.
$ sudo rm /lib/systemd/system/nfs-common.service
- Reload the systemd daemon:
$ sudo systemctl daemon-reload
- Unmask, start and check the service:
$ sudo systemctl unmask nfs-common.service $ sudo systemctl start nfs-common.service $ $ sudo systemctl status nfs-common.service