feature mind the spikes

MoHawastaken · Nov 11, 2024 · a79e985 · a79e985
1 parent ec0a8dd
commit a79e985
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 16 deletions.
diff --git a/content/publication/featurelearning-ssms/index.md b/content/publication/featurelearning-ssms/index.md
@@ -7,7 +7,7 @@ title: 'On Feature Learning in Structured State Space Models'
 authors:
   - Leena Chennuru Vankadara*
   - Jin Xu*
-  - admin
+  - Moritz Haas
   - Volkan Cevher
 
 # Author notes (optional)

diff --git a/content/publication/haas-2023-mindthespikes/index.md b/content/publication/haas-2023-mindthespikes/index.md
@@ -1,16 +1,33 @@
 ---
-title: 'Mind the spikes: Benign overfitting of kernels and neural networks in fixed
-  dimension'
+title: 'Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension'
+
 authors:
-- Moritz Haas
-- David Holzmüller
-- Ulrike Luxburg
-- Ingo Steinwart
-date: '2023-05-01'
-publishDate: '2023-05-01'
-publication_types:
-- paper-conference
-publication: '*Advances in Neural Information Processing Systems*'
-url_pdf: 
-  https://proceedings.neurips.cc/paper_files/paper/2023/file/421f83663c02cdaec8c3c38337709989-Paper-Conference.pdf
+  - Moritz Haas*
+  - David Holzmüller*
+  - Ulrike Luxburg
+  - Ingo Steinwart
+
+
+date: '2023-05-23T00:00:00Z'
+doi: ''
+
+publishDate: '2023-05-23T00:00:00Z'
+
+publication: In NeurIPS 2024.
+
+featured: true
+
+url_pdf: 'https://arxiv.org/pdf/2305.14077'
+url_code: 'https://github.com/moritzhaas/mind-the-spikes'
+url_dataset: ''
+url_poster: ''
+url_project: ''
+url_slides: ''
+url_source: ''
+url_video: ''
+
 ---
+
+When can kernel as well as neural network models that overfit noisy data generalize nearly optimally?
+
+Previous literature had suggested that kernel methods can only exhibit such `benign overfitting' if the input dimension grows with the number of data points. We show that, while overfitting leads to inconsistency with common estimators, adequately designed spiky-smooth estimators can achieve benign overfitting in arbitrary fixed dimension. For neural networks with NTK parametrization, you just have to add tiny fluctuations to the activation function. It remains to study whether a similar adaptation of the activation function or some other inductive bias towards spiky-smooth functions can also lead to benign overfitting with feature-learning neural architectures and complex datasets. 
diff --git a/content/publication/haas-2024-effective/index.md b/content/publication/haas-2024-effective/index.md
@@ -5,7 +5,10 @@ title: 'mup^2: Effective Sharpness Aware Minimization Requires Layerwise Perturb
 # If you created a profile for a user (e.g. the default `admin` user), write the username (folder name) here
 # and it will be replaced with their full name and linked to their profile.
 authors:
-  - admin
+  - Moritz Haas
+  - Jin Xu
+  - Volkan Cevher
+  - Leena Chennuru Vankadara
 
 # Author notes (optional)
 #author_notes:
@@ -75,4 +78,7 @@ projects:
 #slides: example
 ---
 
-Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show the same for sharpness aware minimization (SAM) algorithms: There exists a unique nontrivial width-dependent and layerwise perturbation scaling for SAM that effectively perturbs all layers and provides in width-independent dynamics. A crucial practical benefit is transfer of optimal learning rate and perturbation radius jointly across model scales. In a second paper, we show that for the popular Mamba architecture, the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.
+Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show the same for sharpness aware minimization (SAM) algorithms: There exists a unique nontrivial width-dependent and layerwise perturbation scaling for SAM that effectively perturbs all layers and provides in width-independent dynamics. A crucial practical benefit is transfer of optimal learning rate and perturbation radius jointly across model scales.
+
+
+In a [second paper](https://mohawastaken.github.io/publication/featurelearning-ssms/), we show that for the popular Mamba architecture, the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.