Skip to content

Commit

Permalink
feature mind the spikes
Browse files Browse the repository at this point in the history
  • Loading branch information
MoHawastaken committed Nov 11, 2024
1 parent ec0a8dd commit a79e985
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 16 deletions.
2 changes: 1 addition & 1 deletion content/publication/featurelearning-ssms/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: 'On Feature Learning in Structured State Space Models'
authors:
- Leena Chennuru Vankadara*
- Jin Xu*
- admin
- Moritz Haas
- Volkan Cevher

# Author notes (optional)
Expand Down
43 changes: 30 additions & 13 deletions content/publication/haas-2023-mindthespikes/index.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,33 @@
---
title: 'Mind the spikes: Benign overfitting of kernels and neural networks in fixed
dimension'
title: 'Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension'

authors:
- Moritz Haas
- David Holzmüller
- Ulrike Luxburg
- Ingo Steinwart
date: '2023-05-01'
publishDate: '2023-05-01'
publication_types:
- paper-conference
publication: '*Advances in Neural Information Processing Systems*'
url_pdf:
https://proceedings.neurips.cc/paper_files/paper/2023/file/421f83663c02cdaec8c3c38337709989-Paper-Conference.pdf
- Moritz Haas*
- David Holzmüller*
- Ulrike Luxburg
- Ingo Steinwart


date: '2023-05-23T00:00:00Z'
doi: ''

publishDate: '2023-05-23T00:00:00Z'

publication: In NeurIPS 2024.

featured: true

url_pdf: 'https://arxiv.org/pdf/2305.14077'
url_code: 'https://github.com/moritzhaas/mind-the-spikes'
url_dataset: ''
url_poster: ''
url_project: ''
url_slides: ''
url_source: ''
url_video: ''

---

When can kernel as well as neural network models that overfit noisy data generalize nearly optimally?

Previous literature had suggested that kernel methods can only exhibit such `benign overfitting' if the input dimension grows with the number of data points. We show that, while overfitting leads to inconsistency with common estimators, adequately designed spiky-smooth estimators can achieve benign overfitting in arbitrary fixed dimension. For neural networks with NTK parametrization, you just have to add tiny fluctuations to the activation function. It remains to study whether a similar adaptation of the activation function or some other inductive bias towards spiky-smooth functions can also lead to benign overfitting with feature-learning neural architectures and complex datasets.
10 changes: 8 additions & 2 deletions content/publication/haas-2024-effective/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ title: 'mup^2: Effective Sharpness Aware Minimization Requires Layerwise Perturb
# If you created a profile for a user (e.g. the default `admin` user), write the username (folder name) here
# and it will be replaced with their full name and linked to their profile.
authors:
- admin
- Moritz Haas
- Jin Xu
- Volkan Cevher
- Leena Chennuru Vankadara

# Author notes (optional)
#author_notes:
Expand Down Expand Up @@ -75,4 +78,7 @@ projects:
#slides: example
---

Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show the same for sharpness aware minimization (SAM) algorithms: There exists a unique nontrivial width-dependent and layerwise perturbation scaling for SAM that effectively perturbs all layers and provides in width-independent dynamics. A crucial practical benefit is transfer of optimal learning rate and perturbation radius jointly across model scales. In a second paper, we show that for the popular Mamba architecture, the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.
Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show the same for sharpness aware minimization (SAM) algorithms: There exists a unique nontrivial width-dependent and layerwise perturbation scaling for SAM that effectively perturbs all layers and provides in width-independent dynamics. A crucial practical benefit is transfer of optimal learning rate and perturbation radius jointly across model scales.


In a [second paper](https://mohawastaken.github.io/publication/featurelearning-ssms/), we show that for the popular Mamba architecture, the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.

0 comments on commit a79e985

Please sign in to comment.