Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in calc_square_dist() #2178

Closed
chyohoo opened this issue Aug 5, 2022 · 8 comments · Fixed by #2356
Closed

Potential bug in calc_square_dist() #2178

chyohoo opened this issue Aug 5, 2022 · 8 comments · Fixed by #2356
Assignees

Comments

@chyohoo
Copy link

chyohoo commented Aug 5, 2022

def calc_square_dist(point_feat_a: Tensor,

This function could return NaN, since the square dist could be negative in some cases.

for example:

a = torch.tensor([[[0.0000, 0.0000, 0.2188, 0.0000, 0.0000]]])
b = torch.tensor([[[0.0000, 0.0000, 0.2189, 0.0000, 0.0000]]])
calc_square_dist(a,b)

@zhouzaida
Copy link
Collaborator

Please @ZCMax have a look.

@ZCMax
Copy link
Contributor

ZCMax commented Aug 5, 2022

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

@chyohoo
Copy link
Author

chyohoo commented Aug 5, 2022

I can not reproduce the NaN result using your provided example. It calculates the square dist correctly.

this is very strange. I can reproduce neither. But I did encounter NaN while using it.
image

@zhouzaida
Copy link
Collaborator

zhouzaida commented Aug 9, 2022

Hi, what is your mmcv version? I can not reproduce the NaN result either.

image

@chyohoo
Copy link
Author

chyohoo commented Aug 10, 2022

nan_tensor.zip
I save two tensors that will casues nan in the zip file.
My mmcv version is

MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.0

@zhouzaida zhouzaida assigned Tai-Wang and unassigned ice-tong Aug 16, 2022
@Tai-Wang
Copy link
Member

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

@chyohoo
Copy link
Author

chyohoo commented Aug 17, 2022

This seems to be caused by a subtle numerical problem. Because there can be negative values around zero in the results of dist = a_square + b_square - 2 * corr_matrix, i.e., in dist, nan can be produced by computing sqrt(dist). A simple workaround is to add an small number epsilon to dist when computing its square root. Please @ZCMax have a look at whether this modification has other influence or has any effect on current related models.

yep. using torch.cdist instead not seen nan so far

@ZCMax
Copy link
Contributor

ZCMax commented Aug 17, 2022

Great, torch.cdist would be a better solution for this situation, contributions ( PR) are welcome if you have time after checking the performance influence of this modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants