GitHub - gibipara92/learning-explanations-hard-to-vary: Code to implement the AND-mask and geometric mean to do gradient based optimization, from the paper "Learning explanations that are hard to vary"

Learning explanations that are hard to vary

This repo contains the code to implement the methods from the paper Learning explanations that are hard to vary arxiv.org/abs/2009.00329.

Interested in trying it on a new dataset?

In our experience, here are the very important hyperparameters to tune, that we would include in a wide hyperparameter search:

higher lr: In the hyper parameter search we usually set it on a log_range(1e-4, 1e-0)
weight decay:
- Potentially much higher than usual, we usually search on a log_range(1e-5, 1e-0)
- (already default) It's applied after the mask (so it affects the weights even for masked features)
inverse scaling: Rescales the remaining gradients by the ratio of entries that survived the mask in each layer. (This is a pretty extreme re-scaling, we haven’t tried any other so far). We add scale_grad_inverse_sparsity as a boolean hyperparam in the search.
geom mean: In some cases (e.g. if there is some noise and few environments, as it's the case for the notebook) the and_mask approximation is worse, and it’s best to go for the geom mean (the downside is that gradients get even smaller). We also just set this as an option in the hyperaparameter search (method).
optimizer: Adam or SGD. Adam rescales gradients, so the two can behave quite differently.
agreement_threshold: 1 might work best in some synthetic environments, but might be too strict for real life environments. Definitely search this too.

Let us know what ends up working best, so hopefully over time we can make this list and ranges shorter =)

Instructions

To run the baseline (standard SGD), use method='and_mask' and agreement_threshold=0.

There are two examples:

Synthetic dataset

python -m and_mask.run_synthetic \
        --method=and_mask \
        --agreement_threshold=1. \
        --n_train_envs=16 \
        --n_agreement_envs=16 \
        --batch_size=256 \
        --n_dims=16 \
        --scale_grad_inverse_sparsity=1 \
        --use_cuda=1 \
        --n_hidden_units=256

CIFAR-10

python -m and_mask.run_cifar \
        --random_labels_fraction 1.0 \
        --agreement_threshold 0.2 \
        --method and_mask \
        --epochs 80 \
        --weight_decay 1e-06 \
        --scale_grad_inverse_sparsity 1 \
        --init_lr 0.0005 \
        --weight_decay_order before \
        --output_dir /tmp/

A simple linear regression example

See the notebook folder.

BibTeX

@misc{parascandolo2020learning,
      title={Learning explanations that are hard to vary}, 
      author={Giambattista Parascandolo and Alexander Neitz and Antonio Orvieto and Luigi Gresele and Bernhard Schölkopf},
      year={2020},
      eprint={2009.00329},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
and_mask		and_mask
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning explanations that are hard to vary

Interested in trying it on a new dataset?

Instructions

Synthetic dataset

CIFAR-10

A simple linear regression example

BibTeX

About

Releases

Packages

Languages

gibipara92/learning-explanations-hard-to-vary

Folders and files

Latest commit

History

Repository files navigation

Learning explanations that are hard to vary

Interested in trying it on a new dataset?

Instructions

Synthetic dataset

CIFAR-10

A simple linear regression example

BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages