-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: warning messages for invalid data in test_dataset.py #442
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcroci2 : How do we feel about users being able to make transformations that result in invalid data and what concretely happens in these cases to the data? Does it return a NaN or a 0 or a complex number or what?
I am surprised that numpy, or whichever package does the transformation, throws warnings instead of errors. Are we ok with warnings in our case, or do we want to throw an error if an invalid data point is detected?
Apart from that, if we keep it like this, I would be in favor of suppressing the warning in the test module. You can check in PR #249 how I did that there.
apologies, I accidentally pushed something unrelated onto this branch, but now reverted it again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not "fixing" the warning messages, we're only suppressing them. We were rushing a bit for the experiments, so we stopped trying to figure this out. But I do think it would be better to understand what is going on, so why those messages appear even when not expected and as Dani was saying having clear what happens to the data when they are invalid (do we have nans, complex numbers?)
What I'd do to start understanding what is going on would be to look at the features before and after the transformations, using the testing data that we are using in the transformations tests, and see
- which value is problematic
- why is that
- what it becomes if we apply the transformation anyway
You can retrieve the np.ndarray from the dataset easily, and then apply the transformation apart and see what is going on.
I tried to print out the data array directly(passing I tried to search a bit on the Internet, it may be because of this issue mentioned here https://www.geeksforgeeks.org/how-to-fix-runtimewarning-invalid-value-encountered-in-double_scalars/ (Scroll down to Method2). The reason is NumPy library cannot handle this large number on so complex a structure, it will give an invalid value warning on that. Maybe our I can show you the script and maybe discuss it together if my explanation is not clear enough. |
There seem to be negative values. I tested this line At this point, my main concern is not so much whether/which transformations are valid and which aren't. Cuberoots instead of squareroots work, because they can handle negative values. That is fine for the test not throwing a warning, but it means that if a user in a real world case uses the squareroot they will have problems and maybe not even realize it (if they ignore the warning). What I think should happen, irrespective of whatever functions we use as default, is that if a transformation leads to "problematic" values, an error is raised instead of a warning. Otherwise the data is transformed in unintuitive/problematic things and/or droppig values (I'm actually not sure what happens here). See an implementation of this here It's actually a good thing that the warnings are popping up: it shows us the fringe effects of our code and informs us that weird things are happening that we need to deal with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment and let me know if you need help implementing that :)
This PR is stale because it has been open for 14 days with no activity. |
Shall we meet and discuss the best way to resolve @gcroci2 and @joyceljy ? I think last time we decided to leave it open until we discover the cause of the warning messages. |
I discussed a bit what to do for the next steps with Giulia yesterday since she is gonna be on holiday for three days next week. I can start to implement the function and unit tests while she is away, and maybe we can have a follow-up meeting after Giulia comes back. But there can be still a meeting between two of us to confirm more details if you like! @DaniBodor |
I think we had a discussion about whether we want to raise an error and stop the entire run, or only drop (and list) the entries that have NaNs and continue with the rest. At the time, we had said let's first check the source of the warnings and then decide how to proceed. |
Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log) |
I just checked online to see if we can do standardization first before applying the transformation. It is said it is a common approach that standardization is done afterward. Also, standardization is an optional method in the project, the user can choose to do transformation only and without standardization(which they can specify in the |
Indeed, the reason why we introduced transformations was to make the data more normal in order to then apply standardization, because standardization should be applied to normal data only (ideally). Also, standardization centers data around 0 with a standard dev of 1, which doesn't mean at all that there aren't negative values after standardizing the data. |
Indeed, an update about that: there was a bug in the code that was causing a transformation to be applied to all the features even if only a few were specified in the dict (because of the Now that is fixed with this PR. What I suggested Joyce and what I think is best to do at this stage is to raise an error in the get() methods if warnings within the transformations are caught, and print such warnings' messages. The next step could be handling nans and filling them automatically, or giving the user the possibility to decide the filling nans values, but I am not sure it is worth it to implement. I have many reasons to advise against it, but we can discuss it together and decide what to do @DaniBodor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, lots of work ended up going into what looked to be a quick and simple PR.
Thanks @joyceljy for taking care of this!
deeprankcore/dataset.py
Outdated
raise ValueError(f"Invalid value occurs when applying {transform} for feature {feat}." | ||
f"Please change the transformation function for {feat}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently unsure whether this will crash the entire run or just the entry.
Either way, I think it would be nice to state which entry this is occurring on, so that users can troubleshoot potential problems in the data rather than in the transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it crashes the entire run. Are we okay with that? I modified the message as you suggested.
This PR is stale because it has been open for 14 days with no activity. |
Fixing the warning error caused by applying the transformation function to features.
For
test_only_transform_graphdataset
&test_transform_standardize_graphdataset
&test_features_transform_logic_graphdataset
I changed the transformation function applied to feature
electrostatic
andall
to usingnp.cbrt()
, and now it can handle both negative and positive data without throwing a warning message.For
test_transform_standardize_graphdataset
I didn't see any warning messages from my side.
For
test_only_transform_all_graphdataset
This will throw a warning message because some types of features will get invalid values when applying log transformation.
But in fact, we have a feature dictionary in our real training part that assign the suitable type of transformation method for each feature. And in this test, we want to test if the
all
option works, so we just give it a random transformation method to test(which isnp.log(t+10)
). So it makes sense that it will cause a warning message. I will just keep this one without changing:)