Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Standardize method applied on individual variables in OTC/dOTC #1896

Merged
merged 7 commits into from
Sep 4, 2024

Conversation

coxipi
Copy link
Contributor

@coxipi coxipi commented Aug 29, 2024

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • CHANGELOG.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

  • The mean and std in the 'standardize' normalization method of OTC/dOTC now is correctly applied on each variable individually
  • While we're at it, I suggest changing the argument transform to normalization which I feel is more specific and informative. Also, in optimal_transport, we should follow the snake convention, numIterMax -> num_iter_max (I guess this was done to imitate the signature of ot.emd, but I think we need to stick to a standard in xclim functions)

Does this PR introduce a breaking change?

  • transform -> normalization in input arguments.

Other information:

@github-actions github-actions bot added the sdba Issues concerning the sdba submodule. label Aug 29, 2024
@coxipi coxipi changed the title FIX: Standardize method applied on individual variables FIX: Standardize method applied on individual variables in dOTC Aug 29, 2024
@coxipi coxipi changed the title FIX: Standardize method applied on individual variables in dOTC FIX: Standardize method applied on individual variables in OTC/dOTC Aug 29, 2024
Copy link
Collaborator

@Zeitsperre Zeitsperre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fine change. The transform signature is very new, so no need for a deprecation warning message. Just need to update the CHANGELOG.

@github-actions github-actions bot added the approved Approved for additional tests label Aug 29, 2024
@coxipi
Copy link
Contributor Author

coxipi commented Aug 29, 2024

@SarahG-579462 Thoughts on these changes? Especially transform -> normalization which is a matter of taste.

@coveralls
Copy link

coveralls commented Aug 29, 2024

Coverage Status

coverage: 89.393%. remained the same
when pulling 1999981 on fix_dotc_standardization
into 32cb8f5 on main.

@coxipi
Copy link
Contributor Author

coxipi commented Aug 29, 2024

It's a step in the right direction, but I realze that's not even what I had in mind for the standardization. Right now, we just remove the mean/std of the cells, but those are simply evenly distributed. So, it doesn't say much about the distributions at hand, we don't have the information of the probability of each cell (or the count). Perhaps a proper standardization would not work well, but this remains to be seen

Copy link
Contributor

@SarahG-579462 SarahG-579462 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Did you want to include your work with the time step vs grid cell motion, or do you want that in a different PR?

@coxipi
Copy link
Contributor Author

coxipi commented Aug 30, 2024

Maybe the dOTC evolution through cells (more fidel to the paper) can wait another PR.

But, as I was refering in my last comment, I don't think what we were doing previously is what I had in mind for standardization. Both are interesting in their own way, and don't yield very different results as far as I can see. The new use muX to weight over different cells, in this way we truly standardize the time series X and not its grid (which are just the position of the bins without their count, this is why I need to use muX to obtain a proper weighting)

    def _standardize(grid, mu):
        mu0 = mu.reshape(len(mu), 1)
        mean = (mu0*grid).sum(axis=0)
        std =  np.sqrt((mu0*(grid**2-mean**2)).sum(axis=0))
        return (grid-mean)/std
   
    # this actually computes  `( X - mean(X) ) / std(X)`
    if normalization == "standardize":
        gridX = _standardize(gridX, muX)
        gridY = _standardize(gridY, muY)

    # this is the old way, it computes `(gridX - mean(gridX) ) / std(gridX)`
    elif normalization  == "standardize_cells":
        gridX = (gridX - gridX.mean(axis=0)) / gridX.std(axis=0)
        gridY = (gridY - gridY.mean(axis=0)) / gridY.std(axis=0)
        

The differences are not huge though:

image

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added the docs Improvements to documenation label Sep 4, 2024
@coxipi
Copy link
Contributor Author

coxipi commented Sep 4, 2024

I'll just merge this PR now, other changes can be in another PR.

@coxipi coxipi merged commit 4b23e12 into main Sep 4, 2024
21 checks passed
@coxipi coxipi deleted the fix_dotc_standardization branch September 4, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved for additional tests docs Improvements to documenation sdba Issues concerning the sdba submodule.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants