Mechanistic_Interpretability

Eval number 3:

Variable stack length data - VV gud acc -> led us to believe that it was using a stack -> We also probed the model to see if it was using a stack and the results were inconclusive, because the accuracy was varying on the stack depth.
More higher value of stack depth data -> lesser acc -> this was not very accurate as the generation of unbalanced bracket strings for high stack depth was not very good and generated very less samples. (standardised length)
Added perturbation to generate more unbalanced data with a particular stack depth -> 50% acc (loss doesnt decrease) -> Hypothesis became Transformer DOES NOT USE STACK, but uses counts -> This is because the perturbed data had very low count. The unbalanced strings were very less unbalanced, so the Transformer probably used the count of brackets. The unbalanced strings were very similar to the balanced strings because count bohot kam tha.
To test this hypothesis, we generated a dataset with unbalanced strings of count 0 and tested it on the previously trained model -> 0% acc -> Then we trained on the this new dataset as well and saw that the model could still not solve the problem, but it learnt an additional pattern which is that any string starting with a closing bracket is unbalanced. This further supported our hypothesis as it was not able to learn the unbalanced strings with count 0. It is only failing on the case where counting fails and the stack succeeds.

DATA NEEDED:

SCALING DATA VERY LARGE
NORMAL TRAINING DATA WITH ALL COUNTS FOR FAIRNESS TESTING DATA WITH LOW COUNTS -> SHOULD GET LOW ACC TESTING DATA WITH HIGH COUNTS -> SHOULD GET HIGH ACC

Next Steps:

Setting up Mechanistic Interpretability to find where the model is using counts.
We will be looking at the attention weights, the embeddings given as outputs of the transformer and the logits given as the final output by the MLP.

NEW DATA:

plot:

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Data		Data
cot		cot
mid		mid
models		models
prefix_sum		prefix_sum
results		results
.gitignore		.gitignore
InterpretingTransformers_Bracket_Matching.ipynb		InterpretingTransformers_Bracket_Matching.ipynb
README.md		README.md
bracket.ipynb		bracket.ipynb
data_gen.py		data_gen.py
data_loading.py		data_loading.py
main.py		main.py
new.py		new.py
plot.ipynb		plot.ipynb
probe.py		probe.py
requirements.txt		requirements.txt
run.py		run.py
test_data_loading.py		test_data_loading.py
train.ipynb		train.ipynb
train_new.ipynb		train_new.ipynb
transformer_predictor.py		transformer_predictor.py
transformer_predictor_2.py		transformer_predictor_2.py

Provide feedback