Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: warning messages for invalid data in test_dataset.py #442

Merged
merged 17 commits into from
Sep 4, 2023

Conversation

joyceljy
Copy link
Collaborator

@joyceljy joyceljy commented Jun 2, 2023

Fixing the warning error caused by applying the transformation function to features.

  • For test_only_transform_graphdataset & test_transform_standardize_graphdataset & test_features_transform_logic_graphdataset
    I changed the transformation function applied to feature electrostatic and allto using np.cbrt(), and now it can handle both negative and positive data without throwing a warning message.

  • For test_transform_standardize_graphdataset
    I didn't see any warning messages from my side.

  • For test_only_transform_all_graphdataset
    This will throw a warning message because some types of features will get invalid values when applying log transformation.
    But in fact, we have a feature dictionary in our real training part that assign the suitable type of transformation method for each feature. And in this test, we want to test if the all option works, so we just give it a random transformation method to test(which is np.log(t+10)). So it makes sense that it will cause a warning message. I will just keep this one without changing:)

@joyceljy joyceljy self-assigned this Jun 2, 2023
@joyceljy joyceljy linked an issue Jun 2, 2023 that may be closed by this pull request
@joyceljy joyceljy changed the title Fix warning messages for invalid data in test_dataset.py fix: warning messages for invalid data in test_dataset.py Jun 2, 2023
@joyceljy joyceljy requested review from gcroci2 and DaniBodor June 2, 2023 15:15
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gcroci2 : How do we feel about users being able to make transformations that result in invalid data and what concretely happens in these cases to the data? Does it return a NaN or a 0 or a complex number or what?
I am surprised that numpy, or whichever package does the transformation, throws warnings instead of errors. Are we ok with warnings in our case, or do we want to throw an error if an invalid data point is detected?

Apart from that, if we keep it like this, I would be in favor of suppressing the warning in the test module. You can check in PR #249 how I did that there.

@DaniBodor
Copy link
Collaborator

apologies, I accidentally pushed something unrelated onto this branch, but now reverted it again

Copy link
Collaborator

@gcroci2 gcroci2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not "fixing" the warning messages, we're only suppressing them. We were rushing a bit for the experiments, so we stopped trying to figure this out. But I do think it would be better to understand what is going on, so why those messages appear even when not expected and as Dani was saying having clear what happens to the data when they are invalid (do we have nans, complex numbers?)

What I'd do to start understanding what is going on would be to look at the features before and after the transformations, using the testing data that we are using in the transformations tests, and see

  • which value is problematic
  • why is that
  • what it becomes if we apply the transformation anyway

You can retrieve the np.ndarray from the dataset easily, and then apply the transformation apart and see what is going on.

@joyceljy
Copy link
Collaborator Author

joyceljy commented Jun 13, 2023

I tried to print out the data array directly(passing features_transform to the _compute_features_manually function in test_dataset.py file ), and print the result like this print(arr+10) before applying the np.log function.
Every value seems to be good (no 0s, no inf, no NaNs). Also print(np.log(arr+10)) shows good without any warnings.
But if I use the original code that passes the features_transform to the GraphDataset Object which calls the hdf5_to_pandas function to compute the dataset, it will show the invalid value warning, which is really strange.

I tried to search a bit on the Internet, it may be because of this issue mentioned here https://www.geeksforgeeks.org/how-to-fix-runtimewarning-invalid-value-encountered-in-double_scalars/ (Scroll down to Method2). The reason is NumPy library cannot handle this large number on so complex a structure, it will give an invalid value warning on that. Maybe our hdf5_to_pandas function is too complicated because it uses a pandas data frame? _compute_features_manually function works fine because it uses a data array instead.

I can show you the script and maybe discuss it together if my explanation is not clear enough.
But now I will suppress the warnings using the method Dani mentioned.

@joyceljy joyceljy requested review from DaniBodor and gcroci2 June 13, 2023 16:19
@DaniBodor DaniBodor requested review from DaniBodor and removed request for DaniBodor June 26, 2023 13:52
@DaniBodor
Copy link
Collaborator

DaniBodor commented Jun 28, 2023

Every value seems to be good (no 0s, no inf, no NaNs). Also print(np.log(arr+10)) shows good without any warnings.

There seem to be negative values. I tested this line features_transform = {'all': {'transform': lambda t: np.log(abs(t+10))}}, i.e. running the log function on an absolute value and then it doesn't throw the warning. Did you check for NaNs or missing values in the output as well (see example here)?

At this point, my main concern is not so much whether/which transformations are valid and which aren't. Cuberoots instead of squareroots work, because they can handle negative values. That is fine for the test not throwing a warning, but it means that if a user in a real world case uses the squareroot they will have problems and maybe not even realize it (if they ignore the warning).

What I think should happen, irrespective of whatever functions we use as default, is that if a transformation leads to "problematic" values, an error is raised instead of a warning. Otherwise the data is transformed in unintuitive/problematic things and/or droppig values (I'm actually not sure what happens here). See an implementation of this here

It's actually a good thing that the warnings are popping up: it shows us the fringe effects of our code and informs us that weird things are happening that we need to deal with.

Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment and let me know if you need help implementing that :)

@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label Jul 14, 2023
@joyceljy
Copy link
Collaborator Author

joyceljy commented Jul 27, 2023

  1. The reason causing RuntimeWarning: invalid value encountered in log is because of giving negative values to the log function. It will cause a NaN value and therefore throws the warnings.

For example, a features_transform = {'electrostatic': {'transform': lambda t: np.log(t+10)}} was applied to the dataset. And in the first graph(residue-ppi-BA-278809:M-P) in train.hdf5 file which contains the electrostatic features, we can see it contains multiple edge_attr values less than -10, which will cause a NaN when applying lamba function np.log(t+10). (See picture below)

image
  1. Same reason causing RuntimeWarning: invalid value encountered in sqrt is because of giving negative values to the square root function. It will cause a NaN value and therefore throws the warnings.
    For example, a features_transform = {'electrostatic': {'transform': lambda t: np.sqrt(t+50)} was applied to the dataset. And in the first graph(residue-ppi-BA-278809:M-P) in train.hdf5 file which contains the electrostatic features, we can see it contains one edge_attr values less than -50, which will cause a NaN when applying lamba function np.sqrt(t+50. (See picture below)
image

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

@github-actions github-actions bot removed the stale issue not touched from too much time label Jul 28, 2023
@DaniBodor
Copy link
Collaborator

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

Shall we meet and discuss the best way to resolve @gcroci2 and @joyceljy ? I think last time we decided to leave it open until we discover the cause of the warning messages.

@joyceljy
Copy link
Collaborator Author

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

Shall we meet and discuss the best way to resolve @gcroci2 and @joyceljy ? I think last time we decided to leave it open until we discover the cause of the warning messages.

I discussed a bit what to do for the next steps with Giulia yesterday since she is gonna be on holiday for three days next week. I can start to implement the function and unit tests while she is away, and maybe we can have a follow-up meeting after Giulia comes back. But there can be still a meeting between two of us to confirm more details if you like! @DaniBodor

@DaniBodor
Copy link
Collaborator

I think we had a discussion about whether we want to raise an error and stop the entire run, or only drop (and list) the entries that have NaNs and continue with the rest. At the time, we had said let's first check the source of the warnings and then decide how to proceed.
Not sure what you now decided.

@DaniBodor
Copy link
Collaborator

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

@joyceljy
Copy link
Collaborator Author

joyceljy commented Aug 1, 2023

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

I just checked online to see if we can do standardization first before applying the transformation. It is said it is a common approach that standardization is done afterward. Also, standardization is an optional method in the project, the user can choose to do transformation only and without standardization(which they can specify in the features_transformation parameter).

@gcroci2
Copy link
Collaborator

gcroci2 commented Aug 4, 2023

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

I just checked online to see if we can do standardization first before applying the transformation. It is said it is a common approach that standardization is done afterward. Also, standardization is an optional method in the project, the user can choose to do transformation only and without standardization(which they can specify in the features_transformation parameter).

Indeed, the reason why we introduced transformations was to make the data more normal in order to then apply standardization, because standardization should be applied to normal data only (ideally). Also, standardization centers data around 0 with a standard dev of 1, which doesn't mean at all that there aren't negative values after standardizing the data.

@gcroci2
Copy link
Collaborator

gcroci2 commented Aug 4, 2023

I think we had a discussion about whether we want to raise an error and stop the entire run, or only drop (and list) the entries that have NaNs and continue with the rest. At the time, we had said let's first check the source of the warnings and then decide how to proceed. Not sure what you now decided.

Indeed, an update about that: there was a bug in the code that was causing a transformation to be applied to all the features even if only a few were specified in the dict (because of the transform flag which was kept True). This was raising the warnings in other features which indeed contained invalid values. This explained why we were confused about the unexpected warnings: they were raised not from the feature we wanted to modify (which contained valid values only), but from one of the other features, that contained invalid values but that we were not supposed to touch with the transformation.

Now that is fixed with this PR. What I suggested Joyce and what I think is best to do at this stage is to raise an error in the get() methods if warnings within the transformations are caught, and print such warnings' messages.

The next step could be handling nans and filling them automatically, or giving the user the possibility to decide the filling nans values, but I am not sure it is worth it to implement. I have many reasons to advise against it, but we can discuss it together and decide what to do @DaniBodor

@gcroci2 gcroci2 requested a review from DaniBodor August 4, 2023 14:05
Copy link
Collaborator

@DaniBodor DaniBodor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, lots of work ended up going into what looked to be a quick and simple PR.
Thanks @joyceljy for taking care of this!

Comment on lines 806 to 807
raise ValueError(f"Invalid value occurs when applying {transform} for feature {feat}."
f"Please change the transformation function for {feat}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently unsure whether this will crash the entire run or just the entry.
Either way, I think it would be nice to state which entry this is occurring on, so that users can troubleshoot potential problems in the data rather than in the transformation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it crashes the entire run. Are we okay with that? I modified the message as you suggested.

deeprankcore/dataset.py Outdated Show resolved Hide resolved
tests/test_dataset.py Outdated Show resolved Hide resolved
@gcroci2 gcroci2 requested a review from DaniBodor August 11, 2023 10:21
@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label Aug 28, 2023
@gcroci2 gcroci2 merged commit cad0824 into main Sep 4, 2023
@gcroci2 gcroci2 deleted the 441_Warning_test_dataset.py_joyceljy branch September 4, 2023 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale issue not touched from too much time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warnings thrown during standardization
3 participants