Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SSD] Small object detection #3196

Open
Tsuihao opened this issue Jan 18, 2018 · 96 comments
Open

[SSD] Small object detection #3196

Tsuihao opened this issue Jan 18, 2018 · 96 comments
Labels
models:research models that come under research directory stat:community support

Comments

@Tsuihao
Copy link

Tsuihao commented Jan 18, 2018

Hi all,

I have a question regarding the configuration of SSD.
An interesting task for me is to fine-tuning the SSD_mobilenet_v1_coco_2017_11_17 with Bosch small traffic light dataset.

However, the default setting is to resize the image into 300 x 300 (image_resizer).
Here is the total loss during training.
It loss maintains around 6. (Please ignore the overlapping at 5000 steps, due to some re-launch trainign process.)
image

I think the trend of the total loss is okay.
However, when I stop around 12k and feed with the test dataset (around 90 images for a short try). There is nothing detected.

image

Personally, I have some doubts about this issue:

  1. Maybe the small traffic lights are too small for SSD?
  2. However, why the total loss curve displayed a correct "learning" process?

Can I simply change the config of image size into 512 x 512 or even larger value (1000 x 1000)?
Will this work correctly as well?

Regards,
Hao

@oneTimePad
Copy link

oneTimePad commented Jan 19, 2018

Did you try taking 300x300 crops from the images?

You could try training it on smaller images and feed in overlapping crops of size 300x300 that tile the original image, which could be bigger. I was able to train it on 1000x600 images, and it worked on my test set which was also 1000x600. This might be slightly hard since your original set is not 300x300, but if instead you could form a dataset out of random crops of size 300x300 from your original set then maybe...

The images I am actually working with are around 12MP, and I am feeding in crops of size 1000x600. However, with 1000x600, SSD is struggling to learn the classes, but the localization error is very low.

@Tsuihao
Copy link
Author

Tsuihao commented Jan 20, 2018

Hi @oneTimePad,

Thanks for your reply.

I have thought about this approach too.
However, in this case, I need to take care of the annotation too right?

Did you first annotation all the images and then covert the annotations into the cropped corresponding image (with some python script I assume)?

Or you first crop them and then annotate manually on those 300x 300 images?

@Luonic
Copy link

Luonic commented Jan 20, 2018

@Tsuihao you cropping already annotated images. SSD has issues with detecting small objects but Faster-RCNN much better at this.

@Tsuihao
Copy link
Author

Tsuihao commented Jan 20, 2018

Hi @Luonic,

Yes, I had successfully trained faster rcnn and obtained an accurate result.
As shown:
image

However, it is too slow for my use case.
That is why I want to try the fastest SSD mobilenet model :)

I have some concerns regarding the annotated information.
When you crop the annotated images, how did you "update" the information in the original annotation?
Let's say:
Original image 1280 x 720 and the annotated traffic light is :
boxes: {label: Green, occluded: false, x_max: 752.25, x_min: 749.0, y_max: 355.125, y_min: 345.125}

when you crop it into 300 x 300, the annotated image coordinate system need to be updated.
Did you manually re-annotate them or there is some crop image tool can help you do this?

Regards,
Hao

@oneTimePad
Copy link

oneTimePad commented Jan 20, 2018

Ah, yes. Completely forgot about the annotation. In my case I have program that generates all of my training data, so I can easily change the training data image size (which will then change the annotations). However, yeah, you could write a program that converts the bounding box coordinates as you mentioned, but as mentioned I am still struggling with getting the classification accuracy up.

An idea I had, was to first train mobilenet base network, fine tuning from the checkpoint trained on the coco dataset or a classification checkpoint, to just classify small crops of the the objects of interest. In your case, crops of traffic lights classifying their color. Then go back to SSD and fine-tune the model from these weights trained to classify. I haven't tried this yet, but it might help mostly with the classification accuracy.

You mentioned mobilenet(s); have you tried a different base network?

@Tsuihao
Copy link
Author

Tsuihao commented Jan 20, 2018

Hi @oneTimePad,

Thanks for the reply.
So there is one way I could do is: crop the traffic light image and then re-annotate all the images
I was trying to avoid this since the manual crop and re-annotate will take few days I assume :p.

In my case, I also used the pre-trained SSD mobilenet on coco dataset and fine tuning with the traffic light dataset.

There are two assumptions I made (please correct me if I am wrong):

  1. during the image_resize to 300 x 300, Tensorflow will also resize the annotation in "tf.record" data: In my case, it does not work just because the original images 1280 x 720 resize into 300 x 300, the small traffic light just nearly vanishes. I suspect that is the reason I could not have the correct result.

  2. I assume that the release Tensorflow SSD mobilenet is under SSD300 architecture, not SSD500 architecture : And this is why I was trying to change the image_resizer into larger value (512 x 512); however, it still not worked.

Maybe the last way is really like what you say, crop and re-annotate everything. that will be a lot overhead.

@izzrak
Copy link

izzrak commented Jan 22, 2018

If you want to train an SSD512 model, you need to start from scratch. The pre-trained model can only be fine-tuned as SSD300 model.

@augre
Copy link

augre commented Feb 7, 2018

Hi @Tsuihao Did you successfully train the SSD model on small objects? If so how did you get around it?

My original images are 512x512 I am thinking about cropping them to 300x300 around the areas of interest and create the TFrecords file from the cropped ones. Would this be ok?

@Tsuihao
Copy link
Author

Tsuihao commented Feb 9, 2018

Hi @augre,

I have not tried it yet.
I am also thinking about the same approach as you described and will try it as long as I have time.

I am not sure how the performance will be of cropping training images.
Maybe you can share your experience later :)

@chanyoungjung
Copy link

Hi @Tsuihao

Could you share your trained model(faster-rcnn)?

And what framework did you use for training, caffe or tensorflow?

Thanks

@jhagege
Copy link

jhagege commented Mar 26, 2018

@Tsuihao Any progress on this method ?
I'm having the same issue, do you have any interesting findings that you remember you could share ?
Thanks !

@sapjunior
Copy link

sapjunior commented Apr 1, 2018

Try this paper
S3FD: Single Shot Scale-invariant Face Detector
https://arxiv.org/abs/1708.05237 They modified SSD OHEM and IOU criterion to be more sensitive to small object like faces

@paolomanchisi
Copy link

Hi, I'm interested in training ssd500 mobilenet from scratch, can someone give me some hints?
Thank you.

@elifbykl
Copy link

Hi @Tsuihao

I have a problem with ssd_mobilenet_v2_coco. My images are 600x600 size but with resizing in the config file 300x300. Is there any possibility to work 600x600 in this case? Do my training images have to be 300x300? How did you solved small object problem?

@abhishekvahadane
Copy link

abhishekvahadane commented Apr 18, 2018

@sapjunior : Have you used the implementation on some application other than faces?

@Tsuihao
Copy link
Author

Tsuihao commented Apr 18, 2018

@jungchan1 sorry I could not provide my trained work. I was using TensorFlow

@cyberjoac Nope, I did not go further on this topic; however, I am still looking forward to see if anyone can share the experience in this community :)

@elifbykl 600X600 for me sounds acceptable to resize into 300x300; however, it also depends on the relative object size you are working on. Based on the above discussion, you training image will resize inito 300x 300 due to the fixed architecture SSD provided by Tensorflow. I am still not solving the small object detection with SSD yet.

@eumicro
Copy link

eumicro commented Apr 23, 2018

I trained a model capable of recognizing 78 German traffic signs. I used Tensorflow's Object Detection API for the training. The model can recognize the characters at a signsof about 15 meters. Here you can download the model and try it out.

Model: http://eugen-lange.de/download/ssd-4-traffic-sign-detection-frozen_inpherence_graph-pb/

@julianklumpers
Copy link

julianklumpers commented Apr 28, 2018

@Tsuihao i had a similar problem and i needed to slice the image into smaller tiles/crops. however i already labelled my dataset and i was not sure what size of tiles were suitable for training.
So i wrote a python script that slices the image in a giving size and recalculates the annotations for you in separate .xml files per tile/image it creates.

Here is the code, its far from perfect but i needed a quick solution.
https://github.com/julianklumpers/slice_image_with_annotations/blob/master/slice_image_with_annotations.py

It uses openCV rather then PIL because i tested both and openCV was much quicker with sliceing and saving the images.
It creates tiles with coordinates from the original image as a name, this way i can stich the image back together.
Feel free to adjust it to your needs. i will probably make a library some day

The function creates 2 rows and 2 columns. so if you have a image that is 1000x1000 and you need 500x500 tiles. you just put size=(2,2) 1000 / 2 = 500.

@willSapgreen
Copy link

Hello @Tsuihao,

have you tried the stock SSD_mobilenet_v1_coco_2017_11_17 without training and see the result visually?

My situation is the performance from stock SSD_inception_v2_coco_2017_11_17 is better than my trained-with-kitti model on car detection.

I am still working on this and hopefully can get back to you ASAP.

Best,

@Tsuihao
Copy link
Author

Tsuihao commented May 7, 2018

Hi @willSapgreen,

Yes, I have tried to use the pure SSD_mobilenet_v1_coco_2017_11_17 to do the traffic light detection. And the result is better than my trained SSD with traffic light dataset.

However, this result can be foreseen due to the fact that SSD_mobilenet_v1_coco_2017_11_17 trained with the COCO dataset. In my case, I need a more details about the detected traffic lights e.g. red, green, yellow, red left, etc.

In your case, you wanted to detect car, I believed that car in the image is much bigger than the traffic light; therefore, you should not have the same issue (traffic light is too small) as mine.
I will suggest you to:

  1. Check your tensorboard report (see whether training result is good or bad)
  2. Change with different model e.g. faster_rcnn (see whether your data/label is valid)

Regards,

@lozuwa
Copy link

lozuwa commented May 11, 2018

Hey, I read that you struggled with resizing/cropping and then labeling again. I had the same problem so I made some scripts that I am trying to turn into a library. Why don't you check them https://github.com/lozuwa/impy

There is a method called reduceDatasetByRois() that takes in an offset and produces images of size (offset)X(offset) which contain the annotations of the original image.

@simonegrazioso
Copy link

I'm finding several problems in obtaining a good detection on small objects.
My images are 640x480 and the objects size are typically around 70x35 - 120x60.

I'm using the typical ssd_mobilenet config file, and I train from ssd_mobilenet_v2 pretrained model. I'm interested in a good accuracy with a great speed, so I need SSD architecture. Maybe is better to move to SSD inception v2? Or can I change some parameters, like anchors and fixed_shape_resizer (but... how?)

Thank you for any advice,

@eumicro how did you edit the config file to obtain that good detection?

@hengshan123
Copy link

hengshan123 commented Jun 15, 2018

Hi, i have a problem related with this, but it's a little different. I want to train a model to detect my hand, yes only one class and run the model on my phone. But the speed is a little slow ,about 400ms. I want to resize the image to smaller size like 100*100, the speed is much fast, but the presicion is very bad. I guess i need to train the ssd from scratch, is that right ? @izzrak

@Luonic
Copy link

Luonic commented Jun 15, 2018 via email

@hengshan123
Copy link

OK i will try 224*224

@AliceDinh
Copy link

I have same problem with detecting small objects, my input 660x420 and the objects are about 25x35. I consider my objects medium size but SSD mobilenet v1 gives low accuracy and the training time is long. I did try to make my input 660x660 (width:heigh = 1:1) as recommended by @oneTimePad to see how the resizing step to 300x300 of SSD make any improvement but the answer is yes, but not much.

@simonegrazioso
Copy link

@AliceDinh, for long training time, what do you mean? How many steps? Which learning rate? Do you change anchors values?

@AliceDinh
Copy link

@simonegrazioso

  1. Training time is long, means to get loss~=1.0, the numbers of step are more than 200K. (With FasterRCNN, after 2K steps I get loss ~=0.02)
  2. Where to check the learning rate? Is that from the Tensorboard? I trained on server without Internet so I could not launch the Tensorboard from there.
  3. Change the anchors values? What specific values I should change?

@Ekko1992
Copy link

Ekko1992 commented Apr 1, 2019

@Deep-Sek
Isn't it a better idea to have some other tricks to distinguish between different types of those similar cars? for example, using OCR techniques to read the letters and decide whether it is a "C" series car or an "S" series car.

I had some experience classifying similar classes before though, e.g. different type of cars( different brand, year etc.) and different birds. It is indeed a hard problem, and I think you can have a look at paper in this domain, such as:
http://openaccess.thecvf.com/content_cvpr_2017/papers/Fu_Look_Closer_to_CVPR_2017_paper.pdf

@CrackedDS
Copy link

CrackedDS commented Apr 1, 2019

@Ekko1992 I skipped OCR techniques all together because I thought since this is "OCR in the wild" where we don't control the environment, the performance would not be good. I'll give it a try asap and keep everyone updated on how it works out. Maybe I can do some affine transformations and control the text density and structure a bit.

Also, will take a look at the paper and try that too. Thanks a lot for the resources. I'll provide an update as soon as I can.

I did try this: http://vis-www.cs.umass.edu/bcnn/docs/bcnn_iccv15.pdf
Basically, took this network architecture idea as a feature extractor and replicated it using MobileNet with bilinear connection and then plugged in the regular SSD for detection network after. Can you tell me what you think of that paper? The idea sounds like it should give amazing results.

But not sure if I did have enough data to substantiate training this huge network with double the parameters. It just took way too long to converge. I'll probably re-attempt too at a later time after trying out your suggestions.

@whuzs
Copy link

whuzs commented Apr 6, 2019

您好@oneTimePad

谢谢回复。
所以我可以做的一种方法是:裁剪交通灯图像,然后重新注释
我试图避免这种情况的所有图像,因为手动裁剪和重新注释需要几天我假设:p。

就我而言,我还在coco数据集上使用了预先训练过的SSD mobilenet,并使用交通灯数据集进行了微调。

我做了两个假设(如果我错了,请纠正我):

  1. 在image_resize到300 x 300期间,Tensorflow还将调整“tf.record”数据中的注释:在我的情况下,它不起作用只是因为原始图像1280 x 720调整为300 x 300,小交通灯几乎消失。我怀疑这是我无法获得正确结果的原因。
  2. 我假设发布的Tensorflow SSD mobilenet属于SSD300架构,而不是SSD500架构:这就是为什么我试图将image_resizer更改为更大的值(512 x 512); 然而,它仍然没有奏效。

也许最后一种方式真的像你说的那样,裁剪并重新注释一切。这将是一个很大的开销。

Even if the image is cropped and re-annotated during training, the image is still so large when detected that cropping seems to be of little use.

@sky5media
Copy link

Hello,

I am also facing a problem of recognizing small objects on the image.
In my case I need to be able to detect multiple numbers (0-9) as well as tiny logos on the image. Let's say we have an advertisement billboard of a more or less standard shape which contains 3-4 lines of small logos with digits in front.
For example:
DHL - 1248265
UPS - 7623652
FedEx - 3726565

The real size of a billboard is pretty big, but we need to detect numbers from a distance, so the numbers would actually become small, although you could still easily recognize them on the phone screen.
I am wondering if the following approach would work with SSD mobilenet V1/V2 models:

I will create a dataset consisting of individual numbers, logos and the whole billboard.
Then we will detect the whole billboard at first. Since its pretty large relative to the image. After getting it's bounding box, I will crop the image based on that, maybe enlarge it a bit and then feed the result back to the model to detect logos and numbers

So we would actually run the detector twice on the same image. I assume this would be anyway faster than running ResNet or Faster-RCNN on mobile device.

Does anyone know if that would make any improvements for detecting process with SSD mobilenet?

@jamessmith90
Copy link

Tensorflow is crap and below-par piece of shitty library written for the benefit of Google cloud.

Thank you.

@dexception
Copy link

dexception commented Jun 4, 2019

For those who are visiting... let me break down the entire story for you.
comment the following in your pipeline.config file. There are bugs depending upon which version of tensorflow your using that is why if your working on new version this problem should not come in your way. For the old version:

#data_augmentation_options {
#random_horizontal_flip {
#}
#}
#data_augmentation_options {
#ssd_random_crop {
#}
#}

@qraleq
Copy link

qraleq commented Jun 27, 2019

@dexception Which version of tensorflow you're reffering to as the old version? And since which version this bug is fixed?

Thanks.

@Gmrevo
Copy link

Gmrevo commented Aug 14, 2019

OK i will try 224224
@hengshanji Did training with 224
224 MobilenetSSD V2 solve the issue?

@tcrockett
Copy link

tcrockett commented Sep 6, 2019

Here is something I tried that I haven't seen anyone else try here.
My problem is my camera input is 1280x960 and I'm looking for small labels. To keep the height from becoming to distorted when the image is fit into the 300x300 input space I kept the aspect ratio but fit the image into the same linear space. e.g.

300 * 300 = 90e3,
Y = X * 960/1280,
90e3=X * X * 960/1280 = X^2 * 960/1280,
X = sqrt(90e3 * 1280/960) = 346.41,
Y = 259.81._

Rounding X and Y to integers to keep X * Y<90e3 with minimal wasted bytes finds the optimal new size to be 346x260 with 40 * 3 wasted bytes. img.shape = (260,346,3)

image_resizer {
  fixed_shape_resizer {
    height: 260
    width: 346
  }
}

Retraining a SSD with inception v2, I should keep the meat of what the model has learned with minimal trouble.
This converged to a loss of 1.8 after 86000 steps.

@CrackedDS
Copy link

@tcrockett Preserving aspect ratio should not really affect your training in anyway. If your camera input is 4:3 (1280x960) and you resize your input image to 1:1 (300x300) and you're always consistent with this. Then it shouldn't matter. For example, after you train your network by resizing your pics from 4:3 to 1:1.. as long as you do the same during inference time (post training) and convert your camera input from 4:3 to 1:1, the distortion that you do on the image is consistent and the neural network doesn't care much about that. I can see that the network having trouble with detections if you used a different aspect ratio to capture raw data (before resizing) and then resized that to 1:1. But preserving aspect ratio doesn't really do anything.

In SSD, the prior boxes have different aspect ratios which is why the aspect ratio of the input image doesn't really matter because the prior boxes will pick up the aspect ratio variation of the objects.

@tcrockett
Copy link

logoCmpare
left is 300x300, right is 260x346
Without aspect ratio adaption the width of the logo will be represented in the 300x300 space by fewer pixels reducing the horizontal detail.

@Siggi1988
Copy link

Hallo Tsuihao,

is the loss in your graph for the traffic light detection in percent? Or I must multiply the values with 100?
My problem is the same, because I get values between 1 and 2.

Thanks for your answer
Sigg

@arvindchandel
Copy link

Can anyone suggest something about Retraining a Object Detection model. i.e -
Suppose i train tensorflow faster Rcnn_inception on any custom data having 10 classes like ball, bottle, Coca etc.. and its performing quite well. Now later i got some new data of 10 more classes like Paperboat, Thums up etc and I want my model to trained on these too. Is there any method so that i can retrain my generated model for these 10 new classes too to upgrade it for 20 classes, rather starting training from scratch.

@lorenzolightsgdwarf
Copy link

lorenzolightsgdwarf commented Feb 4, 2020

Hi guys, here are my 2 cents: in my scenario I want to detect UI elements (buttons, checkbox, etc) from screenshots of 800x800 using ssd_mobile_net_v2. The dimensions of the objects range from 80px to 400px.

  • The input is are 800x800 images and the preprocessing step is fixed_shape_resizer set on 800x800.

  • I found extremely useful to modify the ssd_anchor_generator min_scale and max_scale based on the dimensions of the objects (0.1 and 0.5).

  • Another improvement was to modify the file ssd_mobilenet_v2_feature_extractor.py to use layer_15/expansion_output as first feature map and the rest are all new layers (no more layer_19).

Lastly in my case I also have the need for an augmentation that creates an effect of zoom-in zoom-out for simulating projects at different scales and positions. For this I modify the preprocessor as in the pull request #8043 and used the configuration

data_augmentation_options {
    ssd_random_crop_pad_fixed_aspect_ratio{
         aspect_ratio: 1.0
         min_padded_size_ratio: [0.5,0.5]
         max_padded_size_ratio: [2, 2]        
         operations {
            random_coef: 0.5
            overlap_thresh: 1.0 
            clip_boxes: false 
            min_object_covered: 1.0  
            min_aspect_ratio: 0.25
            max_aspect_ratio: 4
            min_area: 0.1
            max_area: 1.0
        }
   }
}

On Stack Overflow someone explained how to test the augmentation. This is the adapted script to visualize the effect of the above operation

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
 
import functools
import os
import cv2
from absl.testing import parameterized
 
import numpy as np
import tensorflow as tf
from scipy.misc import imsave, imread
 
from object_detection import inputs
from object_detection.core import preprocessor
from object_detection.core import standard_fields as fields
from object_detection.utils import config_util
from object_detection.utils import test_case
 
FLAGS = tf.flags.FLAGS
tf.disable_eager_execution()
class DataAugmentationFnTest(test_case.TestCase):
 
  def test_apply_image_and_box_augmentation(self):
    # Put here your augmentation
    data_augmentation_options = [
        (preprocessor.ssd_random_crop_pad_fixed_aspect_ratio, {
                'min_object_covered': [1.0],
                            'aspect_ratio': 1.0,
                            'aspect_ratio_range': [(0.25, 4)],
                            'area_range': [(0.1, 1.0)],
                            'overlap_thresh': [0.999999],
                            'clip_boxes': [False],
                            'random_coef': [0.0],
                            'min_padded_size_ratio': (0.25, 0.25),
                            'max_padded_size_ratio': (2, 2)})
    ]
    data_augmentation_fn = functools.partial(
        inputs.augment_input_data,
        data_augmentation_options=data_augmentation_options)
    tensor_dict = {
        fields.InputDataFields.image:
            # lena.png is the image reference
            tf.constant(imread('lena.png').astype(np.float32)),
        fields.InputDataFields.groundtruth_boxes:
            # just a ground truth box element in normalized coordinates [y1,x1,y2,x2]
            tf.constant(np.array([[ 0.5, 0.5,  0.53 , 0.53]], np.float32)),
        fields.InputDataFields.groundtruth_classes:
            tf.constant(np.array([1.0], np.float32))
    }
    # This is the size of the resizer
    final_image_size= (800, 800)
 
    augmented_tensor_dict = data_augmentation_fn(tensor_dict=tensor_dict)
    with self.session() as sess:
        for x in range(100):
            augmented_tensor_dict_out = sess.run(augmented_tensor_dict)
            final_image_shape=augmented_tensor_dict_out[fields.InputDataFields.image].shape
            print("Final Shape "+ str(x) + ": ", final_image_shape)
            print("Final Boxes "+ str(x) + ": ", augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes])
            final_image=augmented_tensor_dict_out[fields.InputDataFields.image]
            if augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes].shape[0] > 0:
                point_x=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][1]
                point_y=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][0]
                point_x2=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][3]
                point_y2=augmented_tensor_dict_out[fields.InputDataFields.groundtruth_boxes][0][2]
                final_image = cv2.rectangle(final_image, (int(point_x*final_image_shape[1]),int(point_y*final_image_shape[0])), (int(point_x2*final_image_shape[1]),int(point_y2*final_image_shape[0])), (255,0,0), 2)
            else:
                print("Boxes is empty")
            imsave('test/lena_out'+str(x)+'.jpeg',cv2.resize(final_image,final_image_size))
 
if __name__ == '__main__':
  tf.test.main()

@sainisanjay
Copy link

sainisanjay commented Apr 3, 2020

@eumicro what model and how did you fine-tune the model to get accurate prediction?

Hi, sorry my English is not that good. I described how I fine tuned and trained the SSD MobileNet here (only in German, sorry): http://eugen-lange.de/german-traffic-sign-detection/

the main "tuning steps" are:

  • generated my own data set (see my homepage for more details), I think it was the most important "step" ^^...
  • removed 2 first layers from the MobileNet
  • used grayscale pictures

from which file you removed first two layers ?

@synergy178
Copy link

@sky5media have you been able to solve your issue? If yes, how? I also try to use object detection for OCR but I have 14 classes and can only detect 9 of them with model_main. Train.py loss does something weird doing great for the first epoch and then goes expotentially to billioons.

@sainisanjay
Copy link

sainisanjay commented Apr 10, 2020

Quite a same issue i am facing with ssd_mobilenet_v2_coco_2018_03_29 pre-trained model. Localisation loss is fluctuating and loss is quite high even after 50K steps. Trying to train model with 7 classes (Pedestrian;Truck;Car;Van;Bus;MotorBike;Bicycle). I know the same classes are already available in the pre-trained model but i am feeding my own images. Any idea whats wrong?

trainingloss

@synergy178
Copy link

@sainisanjay Your learning rate(LR) is too high I guess. Try setting a scheduled decay of LR.

Check whether your objects are correctly annotated and easy to disntinguish from the background.

Check the exif orientation of your pictures as well.

@sainisanjay
Copy link

@synergy178, I have following parameters:

initial_learning_rate: 0.001
    decay_steps: 40000
    decay_factor: 0.95

I am not really sure how to check the the exif orientation of your pictures. But i have visualised my TF records with tfrecord-viewer. This tools gives my same results as original annotation. As can be seen attached image.

5

@sainisanjay
Copy link

sainisanjay commented Apr 15, 2020

Further, i have checked the image orientation with following two options. Both has gave me same orientation:
Option 1: Example from exif

import matplotlib.pyplot as plt
import image_to_numpy
img = image_to_numpy.load_image_file("my_file.jpg")
plt.imshow(img)
plt.show()

Option 2: Normal matplotlib lib.

from matplotlib import image
from matplotlib import pyplot
image = image.imread("my_file.jpg")
print(image.dtype)
print(image.shape)
pyplot.imshow(image)
pyplot.show()

exif
matplot
Since both libraries are giving same orientation so i assumed orientation of images are correct. Problem is something else?

@sky5media
Copy link

@synergy178 unfortunately no, I couldn't solve it.

@preronamajumder
Copy link

Here is something I tried that I haven't seen anyone else try here.
My problem is my camera input is 1280x960 and I'm looking for small labels. To keep the height from becoming to distorted when the image is fit into the 300x300 input space I kept the aspect ratio but fit the image into the same linear space. e.g.

300 * 300 = 90e3,
Y = X * 960/1280,
90e3=X * X * 960/1280 = X^2 * 960/1280,
X = sqrt(90e3 * 1280/960) = 346.41,
Y = 259.81._

Rounding X and Y to integers to keep X * Y<90e3 with minimal wasted bytes finds the optimal new size to be 346x260 with 40 * 3 wasted bytes. img.shape = (260,346,3)

image_resizer {
  fixed_shape_resizer {
    height: 260
    width: 346
  }
}

Retraining a SSD with inception v2, I should keep the meat of what the model has learned with minimal trouble.
This converged to a loss of 1.8 after 86000 steps.

It is not a good idea to have different height and width for the image resizer in case you want to convert it to uff to run on edge devices. Because you need to manually put the ratios in the uff config file. and the function that is used to calculate the ratios take only one variable as input. so for 300x300, the ratios would be calculated for 300. but for your case 260x346, if you input either 260 or 346, the resulting bounding boxes generated by the tensorrt model in the edge device will be different than the ones generated by the tensorflow model in your pc.

@sainisanjay
Copy link

@preronamajumder Did you use transfer learning or you train the model from scratch? I believe, If you change the height and width you can not use the pre-trained model (300x300) for weight initialization.

@HUI11126
Copy link

https://github.com/DetectionTeamUCAS/FPN_Tensorflow
This project based Faster rcnn + FPN, which is accurate to detect small objects. But I was not able to deploy the project on Openvino, sinice the merge function in "fusion_two_layer" is limited on Openvino.

@bhavyaj12
Copy link

Hi all.
I'm trying to train an SSD on a custom barcode detection task. The issue is that the dataset images are all different sizes and keep aspect ratio resizer doesn't seem to be working with ssd resnet 50. Is it required for the input images to be the same sizes in 1:1 ratio as in the fixed resizer?

@preronamajumder
Copy link

@preronamajumder Did you use transfer learning or you train the model from scratch? I believe, If you change the height and width you can not use the pre-trained model (300x300) for weight initialization.

I used transfer learning with ssd_mobilenet_v2_coco. fixed image resizer can be changed. But I started setting it to 300x300.

@preronamajumder
Copy link

Hi all.
I'm trying to train an SSD on a custom barcode detection task. The issue is that the dataset images are all different sizes and keep aspect ratio resizer doesn't seem to be working with ssd resnet 50. Is it required for the input images to be the same sizes in 1:1 ratio as in the fixed resizer?

Why dont you try to pad the images? It will maintain the aspect ratio of the ground truth boxes and will also give the appropriate size required by the detection model.

@NickosKal
Copy link

Hi @Luonic,

Yes, I had successfully trained faster rcnn and obtained an accurate result.
As shown:
image

However, it is too slow for my use case.
That is why I want to try the fastest SSD mobilenet model :)

I have some concerns regarding the annotated information.
When you crop the annotated images, how did you "update" the information in the original annotation?
Let's say:
Original image 1280 x 720 and the annotated traffic light is :
boxes: {label: Green, occluded: false, x_max: 752.25, x_min: 749.0, y_max: 355.125, y_min: 345.125}

when you crop it into 300 x 300, the annotated image coordinate system need to be updated.
Did you manually re-annotate them or there is some crop image tool can help you do this?

Regards,
Hao

Hey @Tsuihao could you share the repo you use for the faster-RCNN please? Thanks in advance!

@jvishnuvardhan jvishnuvardhan added the models:research models that come under research directory label Jul 19, 2022
@Petros626
Copy link

Petros626 commented Jun 29, 2023

I'm finding several problems in obtaining a good detection on small objects. My images are 640x480 and the objects size are typically around 70x35 - 120x60.

I'm using the typical ssd_mobilenet config file, and I train from ssd_mobilenet_v2 pretrained model. I'm interested in a good accuracy with a great speed, so I need SSD architecture. Maybe is better to move to SSD inception v2? Or can I change some parameters, like anchors and fixed_shape_resizer (but... how?)

Thank you for any advice,

@eumicro how did you edit the config file to obtain that good detection?

@darkdrake88 @sainisanjay

He removed the first two layers of the architecture in my opinion.
I thought a bit about it and I'm sure these layers are excluded:

scientific paper (https://arxiv.org/abs/1801.04381):
224x244x3 conv2d, output_channels=32, stride=2
112x112x3 bottleneck, output_channels=16, stride=1

TF OD API (https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py):

op(slim.conv2d, stride=2, num_outputs=32, kernel_size=[3, 3])
op(ops.expanded_conv, expansion_size=expand_input(1, divisible_by=1),num_outputs=16)

Questions about it:

  1. Is it necessary to rerun the protoc command (refer to the TensorFlow Installation guide) or just comment these two lines an start training?
  2. Why this change increase the ability of the model do detect smaller objects, which are more far away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory stat:community support
Projects
None yet
Development

No branches or pull requests