Lai Wei *, Zhiquan Tan *, Chenghai Li, Jindong Wang, Weiran Huang (*Equal Contribution).
Shanghai Jiao Tong University & Tsinghua University & Microsoft Research Asia
We introduce matrix entropy, a novel metric rooted in information theory and geometry principles to quantify the data compression proficiency in LLMs. It reflects the model's ability to extract relevant information and eliminate unnecessary elements, thereby providing insight into the language model's intrinsic capability. Specifically, we demonstrate its applicability in both single-modal (language) and multi-modal settings. For language models, our findings reveal that the matrix entropy of representations follows a scaling law type reduction when the model scales up, serving as a complement to the traditional loss scaling law. For multi-modal models, we also propose an evaluation method based on matrix entropy for assessing alignment quality and we find that modern multi-modal large language models exhibit good alignment performance.
from transformers import AutoTokenizer, AutoModel
import torch
import math
# R input N*d
def normalize(R):
with torch.no_grad():
mean = R.mean(dim=0)
R = R - mean
norms = torch.norm(R, p=2, dim=1, keepdim=True)
R = R/norms
return R
def cal_cov(R):
with torch.no_grad():
Z = torch.nn.functional.normalize(R, dim=1)
A = torch.matmul(Z.T, Z)/Z.shape[0]
return A
def cal_entropy(A):
with torch.no_grad():
eig_val = torch.svd(A / torch.trace(A))[1]
entropy = - (eig_val * torch.log(eig_val)).nansum().item()
normalized_entropy = entropy/math.log(A.shape[0])
return normalized_entropy
model_path = "cerebras/Cerebras-GPT-1.3B" # for example
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, device_map="auto").cuda()
text = "I love Generative AI very much." # for example
inputs = tokenizer(text, return_tensors="pt").to('cuda')
with torch.no_grad():
R = model(inputs.input_ids)[0][0, :, :]
R = normalize(R)
A = cal_cov(R)
Entropy = cal_entropy(A)
print(Entropy)
cd utils
python entropy_single_sentence.py
Please download the datasets of wiki-en, dolly-15k, openwebtext2, hh-rlhf in huggingface and edit the data path in your scripts.
cd utils
python entropy_dataset.py
If you're using Matrix Entropy in your research or applications, please cite using this BibTeX:
@article{wei2024large,
title={Large Language Model Evaluation via Matrix Entropy},
author={Wei, Lai and Tan, Zhiquan and Li, Chenghai and Wang, Jindong and Huang, Weiran},
journal={arXiv preprint arXiv:2401.17139},
year={2024}
}