Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

Open
23 of 41 tasks
jbloomAus opened this issue Apr 16, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@jbloomAus
Copy link
Owner

jbloomAus commented Apr 16, 2023

Analysis features

Static

Composition

  • Make composition maps
  • Replace composition scores with strip plots?
  • Create a meta-composition score. Something that measures total influence?
  • How do we check for composition between MLP_in and W_out? (seems expensive?, maybe tie to very specific hypotheses)

Dynamic

Logit Lens

  • By Layer
  • By Layer accumulated
  • By Head

Attention Maps:

  • Make it easier to export a nice visualization of the attention map (cv is actually not great for that).
  • Make it possible to calculate the rank(k) approximation to the attention map.

Causal

Activation Patching (features)

  • Set up component
  • Set up RTG Metric
  • Residual stream patching.
  • Patching via Attn and MLP
  • Head All Pos Patching
  • Head Specific Pos Patching (do later)
  • Head All Pos by Component
  • MLP at different Positions
  • Show counterfactual attention map (ie: show difference in attention given intervention)
  • Show what the logit diff is for each metric score.
    Activation Patching (token variations):
  • Action (fairly easy)
  • Key/Ball (important!)
  • Timestep (also fairly easy)

RTG Scan

  • Switch to using t-lens for decomp
  • Provide more than one level of decomp
  • Add a clustergram to show heads which mediate a similar relationship between RTG and logits/logit diff

Congruence -> If features aren't in superposition, what effect do they have on the predictions?

  • - Pos
  • - Time
  • - W_in
  • - W_Out
  • - MLP Out

Renew old features:

  • QK circuit visualizations for action and RTG embeddings

SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.

Cache Characterization?

  • Plot L2 norm of residual streams (along with mean and std)

Advanced

Implement Path Patching

  • Understand Callum's code.

Implement AVEC

  • Reread post to see if we can find.

Several things I feel are missing which are required for exploratory analysis to be more complete:

  • visualise dot product of time embeddings with each other
  • visualise dot product of positional embeddings with each other
  • Use Jay's head type analysis but write specific patterns for attending to RTG, attending to positive RTG, attending to states, and attending to actions.

Several things I feel will be required for falsifying predictions of how the model is working:

  • implement a variant of path patching for DTs either in a notebook or as part of the app.
  • CaSc, not sure how feasible this is but it has always been the goal.
@jbloomAus jbloomAus converted this from a draft issue Apr 16, 2023
@jbloomAus jbloomAus changed the title Improve Analysis App in various ways to facilitate better interpretability analysis of the new models Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models Apr 16, 2023
@jbloomAus jbloomAus added the enhancement New feature or request label Apr 16, 2023
@jbloomAus
Copy link
Owner Author

Would storing/calculating mean kurtosis of activations be interesting? https://transformer-circuits.pub/2023/privileged-basis/index.html

@jbloomAus
Copy link
Owner Author

On a wim I added basic history visualization. Main issues are:

  1. one hot encoded obs aren't amenable to visualization via co-opting the grid render method making this difficult. I just rendered the whole state view but this feels inaccurate/bad.
  2. indexing is a little messy with adjustment but I think I sorted it.

I also started time embedding dot product viz but didn't finish but I'll leave it there. It didn't seem super interesting.

@jbloomAus
Copy link
Owner Author

Plot L2 norm of residual streams (gives sense for amount of info in a layer as compared to the amount of info going into the logit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant