Skip to content

Code to implement the AND-mask and geometric mean to do gradient based optimization, from the paper "Learning explanations that are hard to vary"

Notifications You must be signed in to change notification settings

gibipara92/learning-explanations-hard-to-vary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning explanations that are hard to vary

This repo contains the code to implement the methods from the paper Learning explanations that are hard to vary arxiv.org/abs/2009.00329.

Interested in trying it on a new dataset?

In our experience, here are the very important hyperparameters to tune, that we would include in a wide hyperparameter search:

  • higher lr: In the hyper parameter search we usually set it on a log_range(1e-4, 1e-0)
  • weight decay:
    • Potentially much higher than usual, we usually search on a log_range(1e-5, 1e-0)
    • (already default) It's applied after the mask (so it affects the weights even for masked features)
  • inverse scaling: Rescales the remaining gradients by the ratio of entries that survived the mask in each layer. (This is a pretty extreme re-scaling, we haven’t tried any other so far). We add scale_grad_inverse_sparsity as a boolean hyperparam in the search.
  • geom mean: In some cases (e.g. if there is some noise and few environments, as it's the case for the notebook) the and_mask approximation is worse, and it’s best to go for the geom mean (the downside is that gradients get even smaller). We also just set this as an option in the hyperaparameter search (method).
  • optimizer: Adam or SGD. Adam rescales gradients, so the two can behave quite differently.
  • agreement_threshold: 1 might work best in some synthetic environments, but might be too strict for real life environments. Definitely search this too.

Let us know what ends up working best, so hopefully over time we can make this list and ranges shorter =)

Instructions

To run the baseline (standard SGD), use method='and_mask' and agreement_threshold=0.

There are two examples:

Synthetic dataset

python -m and_mask.run_synthetic \
        --method=and_mask \
        --agreement_threshold=1. \
        --n_train_envs=16 \
        --n_agreement_envs=16 \
        --batch_size=256 \
        --n_dims=16 \
        --scale_grad_inverse_sparsity=1 \
        --use_cuda=1 \
        --n_hidden_units=256

CIFAR-10

python -m and_mask.run_cifar \
        --random_labels_fraction 1.0 \
        --agreement_threshold 0.2 \
        --method and_mask \
        --epochs 80 \
        --weight_decay 1e-06 \
        --scale_grad_inverse_sparsity 1 \
        --init_lr 0.0005 \
        --weight_decay_order before \
        --output_dir /tmp/

A simple linear regression example

See the notebook folder.

BibTeX

@misc{parascandolo2020learning,
      title={Learning explanations that are hard to vary}, 
      author={Giambattista Parascandolo and Alexander Neitz and Antonio Orvieto and Luigi Gresele and Bernhard Schölkopf},
      year={2020},
      eprint={2009.00329},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

About

Code to implement the AND-mask and geometric mean to do gradient based optimization, from the paper "Learning explanations that are hard to vary"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages