Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low mAP score on pycocotools #71

Closed
ydixon opened this issue Jan 7, 2019 · 23 comments
Closed

Low mAP score on pycocotools #71

ydixon opened this issue Jan 7, 2019 · 23 comments

Comments

@ydixon
Copy link

ydixon commented Jan 7, 2019

@glenn-jocher Apologize for bringing up the same topic again. I've noticed there's lot of threads about mAP and I have read them, but none of them has code to test. So I modified a little bit of your detect/test code to run pycocotools.

Added load_images_v2 in datasets.py . Run eval_map.py and it is compared against ground truth file coco_valid.json. The ground truth file has been test against results generated from darknet repo with matching mAP. If you want to generate it yourself, you can go here.
map_coco

I am only running the model with official yolov3 weights. Any ideas on improving the score?

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 7, 2019

Hmm, I'm not familiar with pycocotools. Is that the official COCO mAP code? Inference is about identical between this repo and darknet (training differences abound though...), so mAP on the official weights should also be the same, though test.py computes mAP slightly differently than the official COCO code.

I noticed your local version is a bit out of date with the current repo. The current test.py conf_thres is 0.30, which shows improved results compared to the 0.50 you are using. 0.20 works better also BTW, I'm not sure exactly the perfect sweetspot, you could tune this if you have time.

yolov3/test.py

Line 136 in 2dd2564

parser.add_argument('--conf-thres', type=float, default=0.3, help='object confidence threshold')

@ydixon
Copy link
Author

ydixon commented Jan 7, 2019

Yeah, running https://github.com/cocodataset/cocoapi. So I didn't touch anything. And I've already tested conf_thres with 0.001, 0.005, 0.05, 0.4, 0.5. I gonna try 0.30 or 0.20 later as you suggested, but I doubt it's gonna make huge impact on the score.

@glenn-jocher
Copy link
Member

Ah it sounds like you tried several values. I think < 0.10 is too low, and > 0.30 is too high. You should get a pretty big improvement going from 0.5 to 0.3, perhaps 10% better mAP (i.e. from 0.40 to 0.50 mAP).

@ydixon
Copy link
Author

ydixon commented Jan 7, 2019

I modified eval_map.py,datasets.py to adapt more recent style of the repo.

Here are the results. The reason why I would try thresh less than 0.10 is because when we build the precision-recall curve, we could include all probability thresholds, starting from 0 score.

Thresh:0.001 mAP@0.5:0.388
Thresh:0.005 mAP@0.5:0.376
Thresh:0.05 mAP@0.5:0.425
Thresh:0.3 mAP@0.5:0.423
Thresh:0.4 mAP@0.5: 0.411
Thresh:0.5 mAP@0.5: 0.398

Console log:

(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.001
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=3.10s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=87.17s).
Accumulating evaluation results...
DONE (t=8.16s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.187
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.338
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.185
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.165
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.229
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.368
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.418
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.182
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.422
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.525
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.005
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.005, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=1.45s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=55.59s).
Accumulating evaluation results...
DONE (t=5.09s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.206
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.376
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.202
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.051
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.184
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.356
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.239
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.371
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.407
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.172
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.406
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.516
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.05
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.05, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.42s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=26.84s).
Accumulating evaluation results...
DONE (t=2.96s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.232
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.425
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.227
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.058
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.214
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.374
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.243
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.355
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.371
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.133
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.361
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.487
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.3
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.3, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.16s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=16.91s).
Accumulating evaluation results...
DONE (t=2.35s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.238
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.423
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.241
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.058
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.217
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.363
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.230
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.315
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.089
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.292
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.438
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.4
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.4, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.15s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=15.75s).
Accumulating evaluation results...
DONE (t=2.28s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.411
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.241
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.056
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.212
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.225
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.299
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.302
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.079
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.276
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.426
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# python eval_map.py --weights weights/yolov3.weights --conf-thres 0.5
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.5, data_config='cfg/coco.data', img_size=416, iou_thres=0.5, n_cpus=0, nms_thres=0.45, weights='weights/yolov3.weights')

Using device: "cuda:0"
Compute mAP...
loading annotations into memory...
Done (t=0.15s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.13s)
creating index...
index created!
Images: 5000
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=14.42s).
Accumulating evaluation results...
DONE (t=2.13s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.231
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.398
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.239
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.051
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.206
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.352
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.220
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.288
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.289
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.068
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.261
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
(fastai) root@4c990753b224:/deep_learning/ultralytics-yolov3# 

@glenn-jocher
Copy link
Member

Hmmm, well then I don't understand the discrepancy. The last official COCO SDK results from test.py were by @nirbenz in #2 (comment), showing 0.543 mAP@0.5 at 416 pixels. Nothing significant should have changed for inference in the repo since then. I'm not sure what to say, other than to try to ask @nirbenz for a PR for his SDK export code.

Recent results from detect.py also look the exact same as darknet's, i.e. #16 (comment)

@ydixon
Copy link
Author

ydixon commented Jan 8, 2019

Understood. I will continue testing and see if I did something wrong with eval_map. I'll let you know if I find something.

@AndOneDay
Copy link

Understood. I will continue testing and see if I did something wrong with eval_map. I'll let you know if I find something.

hi,i see your repo map is 0.547 use pycocotools, can you tell me how to solve this? I met the same issue.

@glenn-jocher
Copy link
Member

It looks like it would be beneficial for test.py to output a JSON file in the format that https://github.com/cocodataset/cocoapi wants, so we could generate mAP directly from cocoapi. I think the relevant JSON format is here. Do any of you have code ready-made for a PR that already does this?
https://github.com/cocodataset/cocoapi/blob/master/results/instances_val2014_fakebbox100_results.json

@ydixon
Copy link
Author

ydixon commented Feb 26, 2019

Understood. I will continue testing and see if I did something wrong with eval_map. I'll let you know if I find something.

hi,i see your repo map is 0.547 use pycocotools, can you tell me how to solve this? I met the same issue.

The NMS scheme in original darknet repo is different from this repo, I suggest you can take a look at the nms code where they differs. Awhile back I did try to make those change on this repo, and I was able to push the mAP to 0.49. But then I got dragged to work on something else.

@ydixon
Copy link
Author

ydixon commented Feb 26, 2019

It looks like it would be beneficial for test.py to output a JSON file in the format that https://github.com/cocodataset/cocoapi wants, so we could generate mAP directly from cocoapi. I think the relevant JSON format is here. Do any of you have code ready-made for a PR that already does this?
https://github.com/cocodataset/cocoapi/blob/master/results/instances_val2014_fakebbox100_results.json

I could make a simple PR, the code is pretty straightforward as shown in eval_map.py above. However, you will still need to generate the ground truth json for the 5k dataset as well as any other custom dataset if you want to use COCO api properly. I don't know in which way you would want it to be included in the code.

@okanlv
Copy link

okanlv commented Feb 26, 2019

@ydixon Actually, you could write the imgsIDs from "5k.txt" into a list and use that to filter ground truth labels in default cocoeval code. I could upload it if anyone needs.

@glenn-jocher
Copy link
Member

@ydixon @okanlv @AndOneDay, I updated test.py with a --save-json argument, which outputs a COCO json and evaluates it using pycocotools. There are a few adjustments to the data going into the json:

  1. The COCO json boxes are xywh, but xy is top left corner, not centered. Image origin is top left.
  2. COCO json uses the 91 original COCO paper classes, so I created a function to translate between the COCO2014/17 classes and the COCO paper classes coco80_to_coco91_class().

Code to compile the json dict:

yolov3/test.py

Lines 67 to 81 in eb6a4b5

if save_json:
# [{"image_id": 42, "category_id": 18, "bbox": [258.15, 41.29, 348.26, 243.78], "score": 0.236}, ...
box = torch.from_numpy(detections[:, :4]).clone() # xyxy
scale_coords(img_size, box, shapes[si]) # to original shape
box = xyxy2xywh(box) # xywh
box[:, :2] -= box[:, 2:] / 2 # xy center to top-left corner
# add to json dictionary
for di, d in enumerate(detections):
jdict.append({
'image_id': int(Path(paths[si]).stem.split('_')[-1]),
'category_id': coco91class[int(d[6])],
'bbox': [float3(x) for x in box[di]],
'score': float3(d[4] * d[5])
})

Code to evaluate the json with pycocotools:

yolov3/test.py

Lines 141 to 157 in eb6a4b5

if save_json:
imgIds = [int(Path(x).stem.split('_')[-1]) for x in dataloader.img_files]
with open('results.json', 'w') as file:
json.dump(jdict, file)
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
# https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb
cocoGt = COCO('../coco/annotations/instances_val2014.json') # initialize COCO ground truth api
cocoDt = cocoGt.loadRes('results.json') # initialize COCO detections api
cocoEval = COCOeval(cocoGt, cocoDt, 'bbox')
cocoEval.params.imgIds = imgIds # [:32] # only evaluate these images
cocoEval.evaluate()
cocoEval.accumulate()
cocoEval.summarize()

Output mAP is low using yolov3.weights though, so it may not be constructing the json correctly, or the test.py hyperparameters may not be properly aligned with darknet.

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3 && python3 test.py --save-json
...
       5000       5000      0.633      0.598      0.589
      Image      Total          P          R        mAP
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.271
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.460
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.285
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.106
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.295
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.317
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.320
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.123
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.492

@ydixon
Copy link
Author

ydixon commented Feb 26, 2019

In the nms function, comment and replace this line

# v = ((pred[:, 4] > conf_thres) & (class_prob > .4))  # TODO examine arbitrary 0.4 thres here
v = (pred[:, 4] * class_prob > conf_thres)

Run the test with with 0.005 conf_thresh, you might want to rename it to something else I think.

DONE (t=4.46s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.313
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.339
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.447
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.266
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.396
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.415
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.223
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.452
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.570

Please also let me know how is your model performing under COCO api. There's some interesting different design choices in this repo compared to the original darknet and I would really like to know how well they do under such changes.

@ydixon
Copy link
Author

ydixon commented Feb 26, 2019

@ydixon Actually, you could write the imgsIDs from "5k.txt" into a list and use that to filter ground truth labels in default cocoeval code. I could upload it if anyone needs.

Oh I thought the cocoDt.imgIds will automatically select the overlapped set. When it didn't work as expected, I ended up creating gt annotations itself. :D

@glenn-jocher
Copy link
Member

glenn-jocher commented Feb 27, 2019

@ydixon @okanlv @AndOneDay it worked, pycocotools mAP is 0.550 (416) and 0.579 (608) with yolov3.weights in the latest commit!! I simply applied the changes @ydixon recommended. Unfortunately performance swapped between pycocotools mAP and our own in-house mAP, which shows about 0.40 mAP now, will investigate, and also run on our scratch-trained model.

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3
...
python3 test.py --save-json --conf-thres 0.005
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.550
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.313
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.339
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.448
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.266
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.398
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.417
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.226
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.456
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.572
...
python3 test.py --save-json --conf-thres 0.005 --img-size 608 --batch-size 16
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.328
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.579
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.341
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.196
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.359
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.425
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.279
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.423
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.444
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.293
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.472
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.557

@ydixon
Copy link
Author

ydixon commented Feb 27, 2019

@glenn-jocher Thanks! Now I'm more incentivized to try the unique anchor loss layer approach. (GPU resource is expensive!)

@glenn-jocher
Copy link
Member

@ydixon yeah of course! I'm left disenfranchised by the mAP metric now. The lower I set conf_thres the better the test.py results. If I set conf_thres = 0.001 then pycocotools mAP rises to 0.554 at 416. But real world results require higher thresholds, around 0.5. So mAP appears to be a bad metric for real-world usability. Anyway, yes it is great to finally be able to output official pycocotools results directly!

python3 detect.py --conf_thres 0.005
zidane
dog

@simaiden
Copy link

simaiden commented Mar 30, 2020

@ydixon yeah of course! I'm left disenfranchised by the mAP metric now. The lower I set conf_thres the better the test.py results. If I set conf_thres = 0.001 then pycocotools mAP rises to 0.554 at 416. But real world results require higher thresholds, around 0.5. So mAP appears to be a bad metric for real-world usability. Anyway, yes it is great to finally be able to output official pycocotools results directly!

python3 detect.py --conf_thres 0.005
zidane
dog

Have you found the reason of this beahvior? The intuition is that the lower the confidence threshold, the higher the false positives, so the precision would be lower.. so it's a little confusing.

@glenn-jocher
Copy link
Member

@simaiden the original problems referenced in this issue have been corrected. mAP is correctly reported now, along with P and R.

@simaiden
Copy link

@simaiden the original problems referenced in this issue have been corrected. mAP is correctly reported now, along with P and R.

But what about the lower map with high confidence threshold? This happen when I use the coco api, do you mean that with this repo didn't happen?

Thanks!

@glenn-jocher
Copy link
Member

@simaiden I don't understand. mAP should be computed at 0.001 or 0.0001 confidence threshold. Everything is working correctly in this repo in regards to mAP computation.

@simaiden
Copy link

@simaiden I don't understand. mAP should be computed at 0.001 or 0.0001 confidence threshold. Everything is working correctly in this repo in regards to mAP computation.

Thanks for your reply.

Ok, this is the way to calculate the map, but do you have any clue why this? and why when I increase the confidence the map decrease? I'm not talking about this repo in particular but in general, sorry if my question is not about the repo itselfs.

@glenn-jocher
Copy link
Member

@simaiden search online, we can't help you with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants