inaccurate zero-shot inference #5

askerlee · 2024-07-20T07:30:28Z

I tested the 5 images in the zip archive above. These images were selected from the SynFundus (https://github.com/parap1uie-s/SynFundus-1M) synthetic dataset, and each image comes with a caption.

The corresponding meta-information of the 5 images is:

file_md5	file_name	eyes_side	is_abnormal	is_amd	is_aon	is_crp	is_dm	is_dme	dr_grade	is_em	is_gc	is_md	is_mh	is_htr	has_lesion	is_pm	is_rvo	is_tessellated	is_treated	is_fundus	is_macular_readable	is_optic_disc_readable	is_retinal_region_readable	retinal_region_quality_score	diagnosticAdvice
8294bb6cfa7aeb46a5e5c7bf88680289	4899.png	left	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	1	1	1	9	疑似黄斑区域病变。
2c7070104be097670fe6e2b1fb67e185	4923.png	left	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	1	1	9	可见出血斑。
4c1149d2ec10cff917b7802ce190f10d	4977.png	left	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	1	1	1	1	9	疑似脉络膜视网膜病变，疑似黄斑区域病变。
82162645745bc1371cc2aea9ce923590	5048.png	left	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	5	疑似高血压性视网膜病变（中度）。
108247fbe1947933a3022b74154d8010	5053.png	left	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	1	1	1	9	疑似青光眼。

The label file is:

Normal Healthy 正常,健康
Diabetic Retinopathy 糖尿病视网膜病变
Age-related Macular Degeneration 年龄相关性黄斑变性
Anomalies of the Optic Nerve 视神经异常
Choroidal Retinal Vascular 脉络膜视网膜血管
Diabetic Macular Edema 糖尿病黄斑水肿
Epimacular Membrane 黄斑上膜
Glaucoma 青光眼
Hypertensive Retinopathy 高血压视网膜病变
Myopia 近视
Retinal Vein Occlusion 视网膜静脉阻塞
黄斑区域病变
可见出血斑
脉络膜视网膜病变

However, all 5 images were best matched with "Myopia 近视". This is weird, as myopia (is_pm, "is pathological myopia") is all 0 for all the 5 images, and the captions don't mention myopia.

The text was updated successfully, but these errors were encountered:

sStonemason · 2024-07-22T11:47:20Z

Sorry for the delayed reply. I have two messages to share with you:

I would recommend using a single image as both img_l and img_r as input, i.e. img_features = model(imgs,imgs,None). Also, the text information should be obtained by text_projection at patient level. This matches the design of the training phase and has been tested by me on multiple downstream tasks, which leads to optimal results, far superior to the results of just using monocular level features. However, please understand that I cannot publicize this part of the experimental results for now.
Noted that the image and text data you used are generated by AI models, after my testing, there is indeed such a issue as you described. In addition to our model, I have also tested other foundation models designed for fundus images, and again, I was unable to accurately achieve this zero-shot task. You can try the method I mentioned in 1. and observe if the accuracy improves. However, I think there are natural problems with AI-generated data and it is not suitable for use as model training and testing.

sStonemason · 2024-07-22T12:02:07Z

I think the core of the problem lies in the fact that the image features output by the model are not processed by the linear layer when monocular input is given, but the model is trained with a linear layer (see model.py for details). Therefore, when performing the zero-shot task, the image features cannot be well aligned with the text features. So in addition to the method mentioned above, you can also try just using monocular features, but it requires you to simply rewrite the code as below. This is a flaw in our design because when performing the downstream tasks of linear probing and fine-tuning, whether or not the image features pass through the linear layer doesn't make a difference to the final result, so we didn't take this into account, sorry.

def encode_image(self, img_l, img_r, mask_ratio=0):
    if img_r is None:
        if isinstance(self.visual, ModifiedResNet):
            # mask_ratio > 0 (FLIP strategy) is currently only implemented for VisualTransformer.
            vision_feature = self.visual(img_l.type(self.dtype))
            return self.left_feature_mapping(vision_feature)
        vision_feature = self.visual(img_l.type(self.dtype), mask_ratio)
        return self.left_feature_mapping(vision_feature)
    if img_l is None:
        if isinstance(self.visual, ModifiedResNet):
            # mask_ratio > 0 (FLIP strategy) is currently only implemented for VisualTransformer.
            vision_feature = self.visual(img_r.type(self.dtype))
            return self.right_feature_mapping(vision_feature)
        vision_feature = self.visual(img_r.type(self.dtype), mask_ratio)
        return self.right_feature_mapping(vision_feature)
    if isinstance(self.visual, ModifiedResNet):
        # mask_ratio > 0 (FLIP strategy) is currently only implemented for VisualTransformer.
        left_feature = self.visual(img_l.type(self.dtype))
        right_feature = self.visual(img_r.type(self.dtype))
        vision_feature = torch.cat(
            (left_feature, right_feature), dim=1)
        return self.global_feature_mapping(vision_feature), self.single_feature_mapping(
            left_feature), self.single_feature_mapping(right_feature)
    left_feature = self.visual(img_l.type(self.dtype), mask_ratio)
    right_feature = self.visual(img_r.type(self.dtype), mask_ratio)
    vision_feature = torch.cat(
        (left_feature, right_feature), dim=1)
    return self.global_feature_mapping(vision_feature), self.left_feature_mapping(
        left_feature), self.right_feature_mapping(right_feature)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inaccurate zero-shot inference #5

inaccurate zero-shot inference #5

askerlee commented Jul 20, 2024

sStonemason commented Jul 22, 2024

sStonemason commented Jul 22, 2024

inaccurate zero-shot inference #5

inaccurate zero-shot inference #5

Comments

askerlee commented Jul 20, 2024

sStonemason commented Jul 22, 2024

sStonemason commented Jul 22, 2024