-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
443 lines (442 loc) · 22.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Matthew Finlayson's personal website.">
<title>Matt Fin</title>
<link rel="apple-touch-icon" sizes="180x180" href="img/fin-180.png">
<link rel="icon" type="image/png" sizes="32x32" href="img/fin-32.png">
<link rel="icon" type="image/png" sizes="16x16" href="img/fin-16.png">
<link rel="manifest" href="favicon/site.webmanifest">
<link rel="stylesheet" href="style/main.css">
</head>
<body>
<header>
<img src='img/profile3.jpg'>
<h1 id="matthew-finlayson">Matthew Finlayson</h1>
</header>
<nav>
<ul>
<li><a href='feed.xml'>RSS</a></li>
<li><a href="files/cv.pdf">CV</a></li>
<li><a href="https://scholar.google.com/citations?user=37YtY2EAAAAJ&hl=en&oi=ao">Google Scholar</a></li>
<li><a href='https://bsky.app/profile/mattf1n.bsky.social'>Bluesky</a></li>
<!-- <li><a href='https://twitter.com/mattf1n'>Twitter</a></li> -->
<!-- <li><a href="https://www.semanticscholar.org/author/Matthew-Finlayson/1580418311">Semantic Scholar</a></li> -->
<li><a href='https://github.com/mattf1n'>GitHub</a></li>
</ul>
</nav>
<main>
<section>
<h2 id=About>About</h2>
<p>
Hello!
I am a PhD student at USC, advised by Swabha Swa­yam­dip­ta and Xiang Ren.
Previously, I was a Predoctoral Researcher at AI2,
and before that I studied computer science and linguistics at Harvard.
</p>
<p>
My current research focuses on improving language modeling, sampling, and interpretability methods
by building and exploiting our theoretical understanding of neural language models.
</p>
<p>You can reach me at <code>mattbnfin[at]gmail[dot]com</code>.</p>
</section>
<section>
<h2 id=News>News</h2>
<table>
<tr>
<td><time>Oct 2024</time></td>
<td>Decoding survey paper accepted to TMLR.</td>
</tr>
<tr>
<td><time>Sep 2024</time></td>
<td>Tutorial on decoding methods accepted to NeurIPS.</td>
</tr>
<tr>
<td><time>Jul 2024</time></td>
<td>Paper accepted to COLM.</td>
</tr>
<tr>
<td><time>Jun 2024</time></td>
<td>
Interning at Meta GenAI.
</td>
</tr><tr>
<td><time>Apr 2024</time></td>
<td>
Spoke at FAIR and USC ISI on stealing ChatGPT's hidden size.
<ul>
</ul>
</td>
</tr>
<tr>
<td><time>Jan 2024</time></td>
<td>
<a href="files/ccc.pdf">Spoke</a> at CMU LTI on decoding and the softmax bottleneck.
</td>
</tr>
<tr>
<td><time>Jan 2024</time></td>
<td>Paper accepted to ICLR.</td>
</tr>
<tr>
<td><time>Oct 2023</time></td>
<td>Paper accepted to EMNLP.</td>
</tr>
<tr>
<td><time>Aug 2023</time></td>
<td>Joined USC as a PhD student in NLP.</td>
</tr>
<tr>
<td><time>Mar 2023</time></td>
<td>Selected for NSF GRFP Honorable Mention.</td>
</tr>
<tr>
<td><time>Feb 2023</time></td>
<td><a href="files/math.pdf">Spoke</a> at IST/Unbabel on math reasoning evaluation.</td>
</tr>
<tr>
<td><time>Jan 2023</time></td><td><q>Decomposed Prompting</q> accepted to ICLR.</td>
</tr>
<tr>
<td><time>Nov 2022</time></td>
<td>
<a href=files/instructions.pdf>Spoke</a>
at <a href=https://flann.super.site>FLaNN</a>
on formal languages and instruction learning.
</td>
</tr>
<tr>
<td><time>Oct 2022</time></td><td>Two papers accepted to EMNLP.</td>
</tr>
<tr>
<td><time>Aug 2021</time></td><td>Joined AI2 as a pre-doctoral researcher.</td>
</tr>
</table>
</section>
<section>
<h2 id=posts>Posts</h2>
<ul>
<!-- <li><a href="apologies.html">Don't apologize.</a></li> -->
<li><a href="ensemble.html">The <q>right way</q> to ensemble language models.</a></li>
<li><a href="differentiable-binary-to-onehot.html">A differentiable function from binary to one-hot representations.</a></li>
<li><a href="deep-ba-sampling.html">Deep BA sampling (extending BAT).</a></li>
<li><a href="interest-demo.html">Research interest demos for working with me.</a></li>
<li><a href="openlogprobs.html">Obtaining logprobs from an LLM API.</a></li>
<li><a href="smislinear.html">The softmax function is linear.</a></li>
<li><a href="gallery.html">Visualizations</a></li>
</ul>
</section>
<section>
<h2 id=Software>Software</h2>
<ul>
<li><a href="https://github.com/justinchiu/openlogprobs">OpenLogProbs</a>: a library for obtaining logprobs from API-protected language models.</li>
<li><a href="https://github.com/mattf1n/ss">SS.py</a>: my personal command line tool for searching and citing academic papers via Semantic Scholar.</li>
</ul>
</section>
<section>
<h2 id=Publications>Preprints & publications</h2>
<ol>
<li>
<a href="https://arxiv.org/abs/2406.16838"><h3>From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models</h3></a>
<p>
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui
</p>
<ul>
<cite>TMLR <time>2024</time> </cite>
<li>
<a href="https://arxiv.org/abs/2406.16838">Paper</a>
</li>
</ul>
<details>
<summary>Abstract</summary>
<p>
One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.
</p>
</details>
</li>
<li>
<h3><a href="https://arxiv.org/abs/2403.09539">Logits of API-Protected LLMs Leak Proprietary Information</a></h3>
<p>
Matthew Finlayson, Xiang Ren, and Swabha Swa­yam­dip­ta
</p>
<ul>
<cite>COLM</cite> <time>2024</time>
<li>
<a href="https://arxiv.org/abs/2403.09539">Paper</a>
</li>
<li><a href="files/lll.pdf">Slides</a></li>
<li><a href="https://www.youtube.com/watch?v=3U9nA-l2YAs">Video</a></li>
</ul>
<details>
<summary>Abstract</summary>
<p>The commercialization of large language models (LLMs) has led to the
common practice of high-level API-only access to proprietary models. In
this work, we show that even with a conservative assumption about the
model architecture, it is possible to learn a surprisingly large amount
of non-public information about an API-protected LLM from a relatively
small number of API queries (e.g., costing under $1,000 for OpenAI’s
gpt-3.5-turbo). Our findings are centered on one key observation: most
modern LLMs suffer from a softmax bottleneck, which restricts the model
outputs to a linear subspace of the full output space. We show that this
lends itself to a model image or a model signature which unlocks several
capabilities with affordable cost: efficiently discovering the LLM’s
hidden size, obtaining full-vocabulary outputs, detecting and
disambiguating different model updates, identifying the source LLM given
a single full LLM output, and even estimating the output layer
parameters. Our empirical investigations show the effectiveness of our
methods, which allow us to estimate the embedding size of OpenAI’s
gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM
providers can guard against these attacks, as well as how these
capabilities can be viewed as a feature (rather than a bug) by allowing
for greater transparency and accountability.</p>
</details>
</li>
<li>
<h3><a href="http://arxiv.org/abs/2310.01693">Closing the Curious Case of Neural Text Degeneration</a></h3>
<p>
Matthew Finlayson, John Hewitt, Alexander Koller, Swabha Swa­yam­dip­ta, and Ashish Sabharwal
</p>
<ul>
<cite>ICLR</cite> <time>2024</time>
<li><a href="http://arxiv.org/abs/2310.01693">Paper</a></li>
<li><a href="files/ccc.pdf">Slides</a></li>
<li><a href="https://github.com/mattf1n/basis-aware-threshold">Code</a></li>
</ul>
<details>
<summary>Abstract</summary>
<p>Despite their ubiquity in language generation, it remains unknown why
truncation sampling heuristics like nucleus sampling are so effective.
We provide a theoretical explanation for the effectiveness of the
truncation sampling by proving that truncation methods that discard
tokens below some probability threshold (the most common type of
truncation) can guarantee that all sampled tokens have nonzero true
probability. However, thresholds are a coarse heuristic, and necessarily
discard some tokens with nonzero true probability as well. In pursuit of
a more precise sampling strategy, we show that we can leverage a known
source of model errors, the softmax bottleneck, to prove that certain
tokens have nonzero true probability, without relying on a threshold.
Based on our findings, we develop an experimental truncation strategy
and the present pilot studies demonstrating the promise of this type of
algorithm. Our evaluations show that our method outperforms its
threshold-based counterparts under automatic and human evaluation
metrics for low-entropy (i.e., close to greedy) open-ended text
generation. Our theoretical findings and pilot experiments provide both
insight into why truncation sampling works, and make progress toward
more expressive sampling algorithms that better surface the generative
capabilities of large language models.</p>
</details>
</li>
<li>
<h3>Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy</h3>
<p>
Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord,
Peter Clark, and Ashish Sabharwal
</p>
<ul>
<cite>EMNLP</cite> <time>2023</time>
<li>
<a href="https://arxiv.org/abs/2305.14596">Paper</a>
</li>
<li>
<a href="https://github.com/allenai/revisiting_surface_form_competition">Code</a>
</li>
</ul>
<details>
<summary>Abstract</summary>
<p>When pretrained language models (LMs) are applied to discriminative
tasks such as multiple-choice questions, they place probability mass on
vocabulary tokens that aren’t among the given answer choices. Spreading
probability mass across multiple surface forms with identical meaning
(such as”bath”and”bathtub”) is thought to cause an underestimation of a
model’s true performance, referred to as the”surface form
competition”(SFC) hypothesis. This has motivated the introduction of
various probability normalization methods. However, many core questions
remain unanswered. How do we measure SFC? Are there direct ways of
reducing it, and does doing so improve task performance? We propose a
mathematical formalism for SFC which allows us to quantify and bound its
impact for the first time. We identify a simple method for reducing it –
namely, increasing probability mass on the given answer choices by a)
including them in the prompt and b) using in-context learning with even
just one example. We show this method eliminates the impact of SFC in
the majority of instances. Our experiments on three diverse datasets and
six LMs reveal several additional surprising findings. For example, both
normalization and prompting methods for reducing SFC can be ineffective
or even detrimental to task performance for some LMs. We conclude with
practical insights for effectively prompting LMs for multiple-choice
tasks.</p>
</details>
</li>
<li>
<h3>Decomposed Prompting: A Modular Approach for Solving Complex Tasks</h3>
<p>
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu,
Kyle Richardson, Peter Clark and Ashish Sabharwal
</p>
<ul>
<cite>ICLR</cite> <time>2023</time>
<li>
<a href="https://arxiv.org/abs/2210.02406">Paper</a>
</li>
<li>
<a href="https://github.com/allenai/DecomP">Code</a>
</li>
</ul>
<details>
<summary>Abstract</summary>
<p>Few-shot prompting is a surprisingly powerful way to use Large
Language Models (LLMs) to solve various tasks. However, this approach
struggles as the task complexity increases or when the individual
reasoning steps of the task themselves are hard to learn, especially
when embedded in more complex tasks. To address this, we propose
Decomposed Prompting, a new approach to solve complex tasks by
decomposing them (via prompting) into simpler sub-tasks that can be
delegated to a library of prompting-based LLMs dedicated to these
sub-tasks. This modular structure allows each prompt to be optimized for
its specific sub-task, further decomposed if necessary, and even easily
replaced with more effective prompts, trained models, or symbolic
functions if desired. We show that the flexibility and modularity of
Decomposed Prompting allows it to outperform prior work on few-shot
prompting using GPT3. On symbolic reasoning tasks, we can further
decompose sub-tasks that are hard for LLMs into even simpler solvable
sub-tasks. When the complexity comes from the input length, we can
recursively decompose the task into the same task but with smaller
inputs. We also evaluate our approach on textual multi-step reasoning
tasks: on long-context multi-hop QA task, we can more effectively teach
the sub-tasks via our separate sub-tasks prompts; and on open-domain
multi-hop QA, we can incorporate a symbolic information retrieval within
our decomposition framework, leading to improved performance on both
tasks.</p>
</details>
</li>
<li>
<h3>Līla: A Unified Benchmark for Mathematical Reasoning</h3>
<p>
{Matthew Finlayson, Swaroop Mishra,}
Pan Lu, Leonard Tang, Sean Welleck,
Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord,
Ashish Sabharwal, Peter Clark, and Ashwin Kalyan
</p>
<ul>
<cite>EMNLP</cite> <time>2022</time>
<li>
<a href="https://arxiv.org/abs/2210.17517">Paper</a>
</li>
<li> <a href="files/math.pdf">Slides</a> </li>
<li>
<a href="https://github.com/allenai/Lila">Data</a>
</li>
<li>
<a href="https://huggingface.co/allenai/bhaskara">Model</a>
</li>
<li>
<a href="https://lila.apps.allenai.org">Website</a>
</li>
</ul>
<details>
<summary>Abstract</summary>
<p>Mathematical reasoning skills are essential for general-purpose
intelligentsystems to perform tasks from grocery shopping to climate
modeling. Towards evaluating and improving AI systems in this domain, we
propose LILA, a unified mathematical reasoning benchmark consisting of 23
diversetasks along four dimensions:(i) mathematical abilities e.g.,
arithmetic, calculus (ii) language format e.g., question-answering,
fill-in-the-blanks (iii) language diversity e.g., no language, simple
language (iv) external knowledge e.g., commonsense, physics. We
construct our benchmark by extending 20 datasets benchmark by collecting
task instructions and solutions in the form of Python programs, thereby
obtaining explainable solutions in addition to the correct answer. We
additionally introduce two evaluation datasets to measure
out-of-distribution performance and robustness to language
perturbation. Finally, we introduce BHASKARA, a general-purpose
mathematical reasoning model trained on LILA. Importantly, we find that
multi-tasking leads to significant improvements (average relative
improvement of 21.83% F1 score vs. single-task models), while the best
performing model only obtains 60.40%, indicating the room for improvement
in general mathematical reasoning and understanding.</p>
</details>
</li>
<li>
<h3>
What Makes Instruction Learning Hard?
An Investigation and a New Challenge in a Synthetic Environment
</h3>
<p>
Matthew Finlayson, Kyle Richardon, Ashish Sabharwal, and Peter Clark
</p>
<ul>
<cite>EMNLP</cite> <time>2022</time>
<li>
<a href="https://arxiv.org/abs/2204.09148">Paper</a>
</li>
<li><a href=files/instructions.pdf>Slides</a></li>
<li><a href=https://youtu.be/MhlzxbfIys4>Video</a></li>
<li>
<a href="https://github.com/allenai/RegSet">Code</a>
</li>
</ul><details>
<summary>Abstract</summary>
<p>The instruction learning paradigm—where a model learns to perform new
tasks from task descriptions alone—has become popular in research on
general-purpose models. The capabilities of large transformer models as
instruction learners, however, remain poorly understood. We use a
controlled synthetic environment to characterize such capabilities.
Specifically, we use the task of deciding whether a given string matches
a regular expression (viewed as an instruction) to identify properties
of tasks, instructions, and instances that make instruction learning
challenging. For instance, we find that our model, a fine-tuned T5-based
text2text transformer, struggles with large regular languages,
suggesting that less precise instructions are challenging for models.
Instruction executions that require tracking longer contexts of prior
steps are also difficult. We use our findings to systematically
construct a challenging instruction learning dataset, which we call Hard
RegSet. Fine-tuning on Hard RegSet, our large transformer learns to
correctly interpret (with at least 90% accuracy) only 65.6% of test
instructions, and 11%-24% of the instructions in out-of-distribution
generalization settings. We thus propose Hard RegSet as a challenging
instruction learning dataset, and a controlled environment for studying
instruction learning.</p>
</details>
</li>
<li>
<a href="https://aclanthology.org/2021.acl-long.144/">
<h3>Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models</h3>
</a>
<p>
{Matthew Finlayson, Aaron Mueller,}
Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov
</p>
<ul>
<cite>ACL</cite> <time>2021</time>
<li>
<a href="https://aclanthology.org/2021.acl-long.144/">Paper</a>
</li>
<li>
<a href="https://github.com/mattf1n/lm-intervention">Code</a>
</li>
</ul>
<details>
<summary>Abstract</summary>
<p>Targeted syntactic evaluations have demonstrated the ability of
language models to perform subject-verb agreement given difficult
contexts. To elucidate the mechanisms by which the models accomplish
this behavior, this study applies causal mediation analysis to
pre-trained neural language models. We investigate the magnitude of
models’ preferences for grammatical inflections, as well as whether
neurons process subject-verb agreement similarly across sentences with
different syntactic structures. We uncover similarities and differences
across architectures and model sizes—notably, that larger models do not
necessarily learn stronger preferences. We also observe two distinct
mechanisms for producing subject-verb agreement depending on the
syntactic structure of the input sentence. Finally, we find that
language models rely on similar sets of neurons when given sentences
with similar syntactic structure.</p>
</details>
</li>
</ol>
</section>
</main>
<footer><img src="img/fin.png"></footer>
</body>
</html>