fix: warning messages for invalid data in test_dataset.py #442

joyceljy · 2023-06-02T14:51:54Z

Fixing the warning error caused by applying the transformation function to features.

For test_only_transform_graphdataset & test_transform_standardize_graphdataset & test_features_transform_logic_graphdataset
I changed the transformation function applied to feature electrostatic and allto using np.cbrt(), and now it can handle both negative and positive data without throwing a warning message.
For test_transform_standardize_graphdataset
I didn't see any warning messages from my side.
For test_only_transform_all_graphdataset
This will throw a warning message because some types of features will get invalid values when applying log transformation.
But in fact, we have a feature dictionary in our real training part that assign the suitable type of transformation method for each feature. And in this test, we want to test if the all option works, so we just give it a random transformation method to test(which is np.log(t+10)). So it makes sense that it will cause a warning message. I will just keep this one without changing:)

…_hdf5" This reverts commit 003aafa.

DaniBodor

@gcroci2 : How do we feel about users being able to make transformations that result in invalid data and what concretely happens in these cases to the data? Does it return a NaN or a 0 or a complex number or what?
I am surprised that numpy, or whichever package does the transformation, throws warnings instead of errors. Are we ok with warnings in our case, or do we want to throw an error if an invalid data point is detected?

Apart from that, if we keep it like this, I would be in favor of suppressing the warning in the test module. You can check in PR #249 how I did that there.

DaniBodor · 2023-06-09T12:25:34Z

apologies, I accidentally pushed something unrelated onto this branch, but now reverted it again

gcroci2

We're not "fixing" the warning messages, we're only suppressing them. We were rushing a bit for the experiments, so we stopped trying to figure this out. But I do think it would be better to understand what is going on, so why those messages appear even when not expected and as Dani was saying having clear what happens to the data when they are invalid (do we have nans, complex numbers?)

What I'd do to start understanding what is going on would be to look at the features before and after the transformations, using the testing data that we are using in the transformations tests, and see

which value is problematic
why is that
what it becomes if we apply the transformation anyway

You can retrieve the np.ndarray from the dataset easily, and then apply the transformation apart and see what is going on.

joyceljy · 2023-06-13T16:00:20Z

I tried to print out the data array directly(passing features_transform to the _compute_features_manually function in test_dataset.py file ), and print the result like this print(arr+10) before applying the np.log function.
Every value seems to be good (no 0s, no inf, no NaNs). Also print(np.log(arr+10)) shows good without any warnings.
But if I use the original code that passes the features_transform to the GraphDataset Object which calls the hdf5_to_pandas function to compute the dataset, it will show the invalid value warning, which is really strange.

I tried to search a bit on the Internet, it may be because of this issue mentioned here https://www.geeksforgeeks.org/how-to-fix-runtimewarning-invalid-value-encountered-in-double_scalars/ (Scroll down to Method2). The reason is NumPy library cannot handle this large number on so complex a structure, it will give an invalid value warning on that. Maybe our hdf5_to_pandas function is too complicated because it uses a pandas data frame? _compute_features_manually function works fine because it uses a data array instead.

I can show you the script and maybe discuss it together if my explanation is not clear enough.
But now I will suppress the warnings using the method Dani mentioned.

DaniBodor · 2023-06-28T10:33:46Z

Every value seems to be good (no 0s, no inf, no NaNs). Also print(np.log(arr+10)) shows good without any warnings.

There seem to be negative values. I tested this line features_transform = {'all': {'transform': lambda t: np.log(abs(t+10))}}, i.e. running the log function on an absolute value and then it doesn't throw the warning. Did you check for NaNs or missing values in the output as well (see example here)?

At this point, my main concern is not so much whether/which transformations are valid and which aren't. Cuberoots instead of squareroots work, because they can handle negative values. That is fine for the test not throwing a warning, but it means that if a user in a real world case uses the squareroot they will have problems and maybe not even realize it (if they ignore the warning).

What I think should happen, irrespective of whatever functions we use as default, is that if a transformation leads to "problematic" values, an error is raised instead of a warning. Otherwise the data is transformed in unintuitive/problematic things and/or droppig values (I'm actually not sure what happens here). See an implementation of this here

It's actually a good thing that the warnings are popping up: it shows us the fringe effects of our code and informs us that weird things are happening that we need to deal with.

DaniBodor

See my comment and let me know if you need help implementing that :)

github-actions · 2023-07-14T03:32:51Z

This PR is stale because it has been open for 14 days with no activity.

joyceljy · 2023-07-27T12:49:24Z

The reason causing RuntimeWarning: invalid value encountered in log is because of giving negative values to the log function. It will cause a NaN value and therefore throws the warnings.

For example, a features_transform = {'electrostatic': {'transform': lambda t: np.log(t+10)}} was applied to the dataset. And in the first graph(residue-ppi-BA-278809:M-P) in train.hdf5 file which contains the electrostatic features, we can see it contains multiple edge_attr values less than -10, which will cause a NaN when applying lamba function np.log(t+10). (See picture below)

Same reason causing RuntimeWarning: invalid value encountered in sqrt is because of giving negative values to the square root function. It will cause a NaN value and therefore throws the warnings.
For example, a features_transform = {'electrostatic': {'transform': lambda t: np.sqrt(t+50)} was applied to the dataset. And in the first graph(residue-ppi-BA-278809:M-P) in train.hdf5 file which contains the electrostatic features, we can see it contains one edge_attr values less than -50, which will cause a NaN when applying lamba function np.sqrt(t+50. (See picture below)

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

DaniBodor · 2023-07-28T09:52:44Z

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

Shall we meet and discuss the best way to resolve @gcroci2 and @joyceljy ? I think last time we decided to leave it open until we discover the cause of the warning messages.

joyceljy · 2023-07-28T15:19:42Z

Improvements:
I will add a checking function to check if there contains any NaN in each data point after doing the feature transformation.
An error will be thrown and tells the user which feature causes the NaN values.

Shall we meet and discuss the best way to resolve @gcroci2 and @joyceljy ? I think last time we decided to leave it open until we discover the cause of the warning messages.

I discussed a bit what to do for the next steps with Giulia yesterday since she is gonna be on holiday for three days next week. I can start to implement the function and unit tests while she is away, and maybe we can have a follow-up meeting after Giulia comes back. But there can be still a meeting between two of us to confirm more details if you like! @DaniBodor

DaniBodor · 2023-08-01T07:56:35Z

I think we had a discussion about whether we want to raise an error and stop the entire run, or only drop (and list) the entries that have NaNs and continue with the rest. At the time, we had said let's first check the source of the warnings and then decide how to proceed.
Not sure what you now decided.

DaniBodor · 2023-08-01T08:03:37Z

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

joyceljy · 2023-08-01T08:20:01Z

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

I just checked online to see if we can do standardization first before applying the transformation. It is said it is a common approach that standardization is done afterward. Also, standardization is an optional method in the project, the user can choose to do transformation only and without standardization(which they can specify in the features_transformation parameter).

gcroci2 · 2023-08-04T09:52:22Z

Another option that we could consider (not sure whether you had already) is to first standardize and then apply the transformation. Standardized data should not contain any negative values and we could create a rule that 0 always transforms to 0, even if the chosen transformation cannot handle 0s (e.g. log)

I just checked online to see if we can do standardization first before applying the transformation. It is said it is a common approach that standardization is done afterward. Also, standardization is an optional method in the project, the user can choose to do transformation only and without standardization(which they can specify in the features_transformation parameter).

Indeed, the reason why we introduced transformations was to make the data more normal in order to then apply standardization, because standardization should be applied to normal data only (ideally). Also, standardization centers data around 0 with a standard dev of 1, which doesn't mean at all that there aren't negative values after standardizing the data.

gcroci2 · 2023-08-04T10:08:05Z

I think we had a discussion about whether we want to raise an error and stop the entire run, or only drop (and list) the entries that have NaNs and continue with the rest. At the time, we had said let's first check the source of the warnings and then decide how to proceed. Not sure what you now decided.

Indeed, an update about that: there was a bug in the code that was causing a transformation to be applied to all the features even if only a few were specified in the dict (because of the transform flag which was kept True). This was raising the warnings in other features which indeed contained invalid values. This explained why we were confused about the unexpected warnings: they were raised not from the feature we wanted to modify (which contained valid values only), but from one of the other features, that contained invalid values but that we were not supposed to touch with the transformation.

Now that is fixed with this PR. What I suggested Joyce and what I think is best to do at this stage is to raise an error in the get() methods if warnings within the transformations are caught, and print such warnings' messages.

The next step could be handling nans and filling them automatically, or giving the user the possibility to decide the filling nans values, but I am not sure it is worth it to implement. I have many reasons to advise against it, but we can discuss it together and decide what to do @DaniBodor

DaniBodor

again, lots of work ended up going into what looked to be a quick and simple PR.
Thanks @joyceljy for taking care of this!

DaniBodor · 2023-08-10T09:05:52Z

deeprankcore/dataset.py

+                                    raise ValueError(f"Invalid value occurs when applying {transform} for feature {feat}."
+                                                    f"Please change the transformation function for {feat}.")


I am currently unsure whether this will crash the entire run or just the entry.
Either way, I think it would be nice to state which entry this is occurring on, so that users can troubleshoot potential problems in the data rather than in the transformation.

Right now it crashes the entire run. Are we okay with that? I modified the message as you suggested.

deeprankcore/dataset.py

tests/test_dataset.py

github-actions · 2023-08-28T03:21:09Z

This PR is stale because it has been open for 14 days with no activity.

Fix warning messages for invalid data

5a9ebff

joyceljy self-assigned this Jun 2, 2023

joyceljy linked an issue Jun 2, 2023 that may be closed by this pull request

Warnings thrown during standardization #441

Closed

Fix linting

63a36b2

joyceljy mentioned this pull request Jun 2, 2023

Warnings thrown during standardization #441

Closed

joyceljy changed the title ~~Fix warning messages for invalid data in test_dataset.py~~ fix: warning messages for invalid data in test_dataset.py Jun 2, 2023

joyceljy requested review from gcroci2 and DaniBodor June 2, 2023 15:15

DaniBodor added 2 commits June 9, 2023 02:59

decrease sensitivity of test_graph_augmented_write_as_grid_to_hdf5

003aafa

Revert "decrease sensitivity of test_graph_augmented_write_as_grid_to…

a8d5bb3

…_hdf5" This reverts commit 003aafa.

DaniBodor reviewed Jun 9, 2023

View reviewed changes

gcroci2 requested changes Jun 9, 2023

View reviewed changes

Ignore warning for test_only_transform_all_graphdataset

594e082

joyceljy requested review from DaniBodor and gcroci2 June 13, 2023 16:19

DaniBodor requested review from DaniBodor and removed request for DaniBodor June 26, 2023 13:52

DaniBodor requested changes Jun 28, 2023

View reviewed changes

github-actions bot added the stale issue not touched from too much time label Jul 14, 2023

Fix error that transform and standard is not reset

bd8fbaa

github-actions bot removed the stale issue not touched from too much time label Jul 28, 2023

Chia Yu Lin and others added 4 commits August 3, 2023 11:13

Add invalid value handling

7c36dc2

merge main

f0c74a8

Fix linting

7fa7bfa

Move invalid value checking to load_one_graph

2150ef3

Fix linting

5b94d47

simplify logic for applying transformation and std

54627db

gcroci2 approved these changes Aug 4, 2023

View reviewed changes

unify logic in hdf5_to_pandas

25f92ca

gcroci2 requested a review from DaniBodor August 4, 2023 14:05

DaniBodor approved these changes Aug 10, 2023

View reviewed changes

gcroci2 added 3 commits August 11, 2023 12:17

add entry name and fname in error message

0341643

add entry name and fname in error message

08707ec

improve test transformation dict

4290c1c

gcroci2 requested a review from DaniBodor August 11, 2023 10:21

github-actions bot added the stale issue not touched from too much time label Aug 28, 2023

Merge branch 'main' into 441_Warning_test_dataset.py_joyceljy

9250a86

DaniBodor approved these changes Sep 4, 2023

View reviewed changes

gcroci2 merged commit cad0824 into main Sep 4, 2023

gcroci2 deleted the 441_Warning_test_dataset.py_joyceljy branch September 4, 2023 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: warning messages for invalid data in test_dataset.py #442

fix: warning messages for invalid data in test_dataset.py #442

joyceljy commented Jun 2, 2023 •

edited

Loading

DaniBodor left a comment •

edited

Loading

DaniBodor commented Jun 9, 2023

gcroci2 left a comment

joyceljy commented Jun 13, 2023 •

edited

Loading

DaniBodor commented Jun 28, 2023 •

edited

Loading

DaniBodor left a comment

github-actions bot commented Jul 14, 2023

joyceljy commented Jul 27, 2023 •

edited

Loading

DaniBodor commented Jul 28, 2023

joyceljy commented Jul 28, 2023

DaniBodor commented Aug 1, 2023

DaniBodor commented Aug 1, 2023

joyceljy commented Aug 1, 2023

gcroci2 commented Aug 4, 2023 •

edited

Loading

gcroci2 commented Aug 4, 2023 •

edited

Loading

DaniBodor left a comment

DaniBodor Aug 10, 2023

gcroci2 Aug 11, 2023

github-actions bot commented Aug 28, 2023

		raise ValueError(f"Invalid value occurs when applying {transform} for feature {feat}."
		f"Please change the transformation function for {feat}.")

fix: warning messages for invalid data in test_dataset.py #442

fix: warning messages for invalid data in test_dataset.py #442

Conversation

joyceljy commented Jun 2, 2023 • edited Loading

DaniBodor left a comment • edited Loading

Choose a reason for hiding this comment

DaniBodor commented Jun 9, 2023

gcroci2 left a comment

Choose a reason for hiding this comment

joyceljy commented Jun 13, 2023 • edited Loading

DaniBodor commented Jun 28, 2023 • edited Loading

DaniBodor left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 14, 2023

joyceljy commented Jul 27, 2023 • edited Loading

DaniBodor commented Jul 28, 2023

joyceljy commented Jul 28, 2023

DaniBodor commented Aug 1, 2023

DaniBodor commented Aug 1, 2023

joyceljy commented Aug 1, 2023

gcroci2 commented Aug 4, 2023 • edited Loading

gcroci2 commented Aug 4, 2023 • edited Loading

DaniBodor left a comment

Choose a reason for hiding this comment

DaniBodor Aug 10, 2023

Choose a reason for hiding this comment

gcroci2 Aug 11, 2023

Choose a reason for hiding this comment

github-actions bot commented Aug 28, 2023

joyceljy commented Jun 2, 2023 •

edited

Loading

DaniBodor left a comment •

edited

Loading

joyceljy commented Jun 13, 2023 •

edited

Loading

DaniBodor commented Jun 28, 2023 •

edited

Loading

joyceljy commented Jul 27, 2023 •

edited

Loading

gcroci2 commented Aug 4, 2023 •

edited

Loading

gcroci2 commented Aug 4, 2023 •

edited

Loading