Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Hierarchical clustering: Fix visualisations for small values #4310

Closed
wants to merge 1 commit into from

Conversation

PrimozGodec
Copy link
Contributor

@PrimozGodec PrimozGodec commented Jan 6, 2020

Issue

Fixes (partially) #2650

The problem here is that distances in a case when we are using cosine distance are very small since values are not normalized and dates are huge numbers (seconds from January 1, 1970) comparing to other values. Why we do not normalize values for the cosine distance? @lanzagar In this case, it would solve the problem.

Description of changes

This PR currently removes lines at the right of the visualization.

Lining lines still persist. They are caused since qt somehow randomly drops elements in QPainterPath when numbers are really small. Here is the example:

>>> from AnyQt.QtGui import QPainterPath
>>> p = QPainterPath()
>>> p.moveTo(4.996003610813204e-16, 0.5)
>>> p.elementCount()
1
>>> p.lineTo(0.0, 0.5)
>>> p.elementCount()
1
>>> p.lineTo(0.0, 2.375)
>>> p.elementCount()
2
>>> p.lineTo(3.885780586188048e-16, 2.375)
>>> p.elementCount()
2

Four elements should be in QPainterPath but only 2 persists.

This can be solved with scaling values for example in the range from 0 to 1. @ales-erjavec is there any better and easier solution for that? Sine if we do scaling the widget would need to be partially redesigned.

And there is another potential issue. After finding out that I cannot compute the cosine distance with normalization I was trying to normalize with the Preprocess widget. It also does not normalize the time variable since by default normalize_datetime=False. In my opinion, when one has a time variable among attributes and does the preprocessing he also wants to preprocess/normalize it together with other variables. Is there a specific reason that it is disabled? @lanzagar @janezd

Includes
  • Code changes
  • Tests
  • Documentation

@PrimozGodec PrimozGodec added the needs discussion Core developers need to discuss the issue label Jan 6, 2020
@codecov
Copy link

codecov bot commented Jan 6, 2020

Codecov Report

Merging #4310 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #4310      +/-   ##
=========================================
+ Coverage    86.7%   86.7%   +<.01%     
=========================================
  Files         396     396              
  Lines       71618   71618              
=========================================
+ Hits        62095   62098       +3     
+ Misses       9523    9520       -3

@janezd
Copy link
Contributor

janezd commented Jan 6, 2020

I haven't yet worked with time variables. @ajdapretnar?

Normalizing them doesn't seem to have sense on the first glance (if you multiply a year by 0.218 you get --- what?), but so doesn't normalizing meters and liters (while normalizing feet and pounds doesn't hurt because they already don't make much sense in their original units). I hence see no arguments against normalization.

@janezd
Copy link
Contributor

janezd commented Jan 6, 2020

I changed "Partially fixes #2650" to "Fixes (partially) #2650", otherwise merging this PR will close #2650.

@PrimozGodec
Copy link
Contributor Author

Hm, agree with you. I cannot find reason for normalizing the time variable too. But it makes sense in this case with the distances. When we compute distances with data provided in the issue the time variable has so big values that cosine distances become very small (around 1e-16), which is I think more computational error than real distance. If values would be normalized the cosine distance would be valid.

What is exactly the reason that we do not allow normalization before computing cosine distances?

@ajdapretnar
Copy link
Contributor

I agree that normalizing time variables doesn't make sense in general, but when computing distances I think it doesn't hurt. I mean, without normalization distance matrix becomes highly skewed.

@ales-erjavec
Copy link
Contributor

This can be solved with scaling values for example in the range from 0 to 1. @ales-erjavec is there any better and easier solution for that?

No. I think (pre)scaling would be the only correct solution.

@PrimozGodec
Copy link
Contributor Author

I am closing this. Error fixed in #4322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion Core developers need to discuss the issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants