You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Plot L2 norm of residual streams (along with mean and std)
Advanced
Implement Path Patching
Understand Callum's code.
Implement AVEC
Reread post to see if we can find.
Several things I feel are missing which are required for exploratory analysis to be more complete:
visualise dot product of time embeddings with each other
visualise dot product of positional embeddings with each other
Use Jay's head type analysis but write specific patterns for attending to RTG, attending to positive RTG, attending to states, and attending to actions.
Several things I feel will be required for falsifying predictions of how the model is working:
implement a variant of path patching for DTs either in a notebook or as part of the app.
CaSc, not sure how feasible this is but it has always been the goal.
The text was updated successfully, but these errors were encountered:
jbloomAus
changed the title
Improve Analysis App in various ways to facilitate better interpretability analysis of the new models
Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models
Apr 16, 2023
On a wim I added basic history visualization. Main issues are:
one hot encoded obs aren't amenable to visualization via co-opting the grid render method making this difficult. I just rendered the whole state view but this feels inaccurate/bad.
indexing is a little messy with adjustment but I think I sorted it.
I also started time embedding dot product viz but didn't finish but I'll leave it there. It didn't seem super interesting.
Analysis features
Static
Composition
Dynamic
Logit Lens
Attention Maps:
Causal
Activation Patching (features)
Activation Patching (token variations):
RTG Scan
Congruence -> If features aren't in superposition, what effect do they have on the predictions?
Renew old features:
SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.
Cache Characterization?
Advanced
Implement Path Patching
Implement AVEC
Several things I feel are missing which are required for exploratory analysis to be more complete:
Several things I feel will be required for falsifying predictions of how the model is working:
The text was updated successfully, but these errors were encountered: