Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

jbloomAus · 2023-04-16T23:27:49Z

Analysis features

Static

Composition

Make composition maps
Replace composition scores with strip plots?
Create a meta-composition score. Something that measures total influence?
How do we check for composition between MLP_in and W_out? (seems expensive?, maybe tie to very specific hypotheses)

Dynamic

Logit Lens

By Layer
By Layer accumulated
By Head

Attention Maps:

Make it easier to export a nice visualization of the attention map (cv is actually not great for that).
Make it possible to calculate the rank(k) approximation to the attention map.

Causal

Activation Patching (features)

RTG Scan

Switch to using t-lens for decomp
Provide more than one level of decomp
Add a clustergram to show heads which mediate a similar relationship between RTG and logits/logit diff

Congruence -> If features aren't in superposition, what effect do they have on the predictions?

Renew old features:

QK circuit visualizations for action and RTG embeddings

SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.

QK Circuit SVD SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.#69
OV Circuit SVD

Cache Characterization?

Plot L2 norm of residual streams (along with mean and std)

Advanced

Implement Path Patching

Understand Callum's code.

Implement AVEC

Reread post to see if we can find.

Several things I feel are missing which are required for exploratory analysis to be more complete:

visualise dot product of time embeddings with each other
visualise dot product of positional embeddings with each other
Use Jay's head type analysis but write specific patterns for attending to RTG, attending to positive RTG, attending to states, and attending to actions.

Several things I feel will be required for falsifying predictions of how the model is working:

implement a variant of path patching for DTs either in a notebook or as part of the app.
CaSc, not sure how feasible this is but it has always been the goal.

jbloomAus · 2023-04-16T23:56:02Z

Would storing/calculating mean kurtosis of activations be interesting? https://transformer-circuits.pub/2023/privileged-basis/index.html

jbloomAus · 2023-04-17T22:29:13Z

On a wim I added basic history visualization. Main issues are:

one hot encoded obs aren't amenable to visualization via co-opting the grid render method making this difficult. I just rendered the whole state view but this feels inaccurate/bad.
indexing is a little messy with adjustment but I think I sorted it.

I also started time embedding dot product viz but didn't finish but I'll leave it there. It didn't seem super interesting.

jbloomAus · 2023-05-08T02:29:44Z

Plot L2 norm of residual streams (gives sense for amount of info in a layer as compared to the amount of info going into the logit).

jbloomAus added this to Decision Transformer Interpretability Apr 16, 2023

jbloomAus converted this from a draft issue Apr 16, 2023

jbloomAus changed the title ~~Improve Analysis App in various ways to facilitate better interpretability analysis of the new models~~ Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models Apr 16, 2023

jbloomAus added the enhancement New feature or request label Apr 16, 2023

jbloomAus moved this from Todo to In Progress in Decision Transformer Interpretability May 10, 2023

jbloomAus moved this from In Progress to Todo in Decision Transformer Interpretability May 10, 2023

jbloomAus moved this from Todo to In Progress in Decision Transformer Interpretability May 10, 2023

jbloomAus removed this from Decision Transformer Interpretability May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

jbloomAus commented Apr 16, 2023 •

edited

Loading

jbloomAus commented Apr 16, 2023

jbloomAus commented Apr 17, 2023

jbloomAus commented May 8, 2023

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

Comments

jbloomAus commented Apr 16, 2023 • edited Loading

Analysis features

Static

Dynamic

Causal

Advanced

jbloomAus commented Apr 16, 2023

jbloomAus commented Apr 17, 2023

jbloomAus commented May 8, 2023

jbloomAus commented Apr 16, 2023 •

edited

Loading