LogIX: Logging for Interpretable and Explainable AI
Warning
This repository is under active development. If you have suggestions or find bugs in LogIX, please open a GitHub issue or reach out.
With a few additional lines of code, (traditional) logging supports tracking loss, hyperparameters, etc., providing basic insights for users' AI/ML experiments. But...can we also enable in-depth understanding of large-scale training data, the most important ingredient in AI/ML, with a similar logging interface? Try out LogIX that is built upon our cutting-edge data valuation/attribution research (Support Huggingface Transformers and PyTorch Lightning integrations)!
- PyPI
pip install logix-ai
- From source (Latest, recommended)
git clone https://github.com/logix-project/logix.git
cd logix
pip install -e .
Our software design allows for the seamless integration with popular high-level frameworks including HuggingFace Transformer and PyTorch Lightning, that conveniently handles distributed training, data loading, etc. Advanced users, who don't use high-level frameworks, can still integrate LogIX into their existing training code similarly to any traditional logging software (See our Tutorial).
A full example can be found here.
from transformers import Trainer, Seq2SeqTrainer
from logix.huggingface import patch_trainer, LogIXArguments
# Define LogIX arguments
logix_args = LogIXArguments(project="myproject",
config="config.yaml",
lora=True,
hessian="raw",
save="grad")
# Patch HF Trainer
LogIXTrainer = patch_trainer(Trainer)
# Pass LogIXArguments as TrainingArguments
trainer = LogIXTrainer(logix_args=logix_args,
model=model,
train_dataset=train_dataset,
*args,
**kwargs)
# Instead of trainer.train(), use
trainer.extract_log()
trainer.influence()
trainer.self_influence()
A full example can be found here.
from lightning import LightningModule, Trainer
from logix.lightning import patch, LogIXArguments
class MyLitModule(LightningModule):
...
def data_id_extractor(batch):
return tokenizer.batch_decode(batch["input_ids"])
# Define LogIX arguments
logix_args = LogIXArguments(project="myproject",
config="config.yaml",
lora=True,
hessian="raw",
save="grad")
# Patch Lightning Module and Trainer
LogIXModule, LogIXTrainer = patch(MyLitModule,
Trainer,
logix_args=logix_args,
data_id_extractor=data_id_extractor)
# Use patched Module and Trainer as before
module = LogIXModule(user_args)
trainer = LogIXTrainer(user_args)
# Instead of trainer.fit(module, train_loader), use
trainer.extract_log(module, train_loader)
trainer.influence(module, train_loader)
Training log extraction with LogIX is as simple as adding one with
statement to the existing
training code. LogIX automatically extracts user-specified logs using PyTorch hooks, and stores
it as a tuple of ([data_ids], log[module_name][log_type])
. If needed, LogIX writes these logs
to disk efficiently with memory-mapped files.
import logix
# Initialze LogIX
run = logix.init(project="my_project")
# Specify modules to be tracked for logging
run.watch(model, name_filter=["mlp"], type_filter=[nn.Linear])
# Specify plugins to be used in logging
run.setup({"grad": ["log", "covariance"]})
run.save(True)
for batch in data_loader:
# Set `data_id` (and optionally `mask`) for the current batch
with run(data_id=batch["input_ids"], mask=batch["attention_mask"]):
model.zero_grad()
loss = model(batch)
loss.backward()
# Synchronize statistics (e.g. covariance) and write logs to disk
run.finalize()
As a part of our initial research, we implemented influence functions using LogIX. We plan to provide more pre-implemented interpretability algorithms if there is a demand.
# Build PyTorch DataLoader from saved log data
log_loader = run.build_log_dataloader()
with run(data_id=test_batch["input_ids"]):
test_loss = model(test_batch)
test_loss.backward()
test_log = run.get_log()
run.influence.compute_influence_all(test_log, log_loader) # Data attribution
run.influence.compute_self_influence(test_log) # Uncertainty estimation
Please check out Examples for more detailed examples!
Logs from neural networks are difficult to handle due to the large size. For example, the size of the gradient of each training datapoint is about as large as the whole model. Therefore, we provide various systems support to efficiently scale neural network analysis to billion-scale models. Below are a few features that LogIX currently supports:
- Gradient compression (compression ratio: 1,000-100,000x)
- Memory-map-based data IO
- CPU offloading of statistics
DistributedDataParallel | Mixed Precision | Gradient Checkpointing | torch.compile | FSDP |
---|---|---|---|---|
✅ | ✅ | ✅ | ✅ | ✅ |
We welcome contributions from the community. Please see our contributing guidelines for details on how to contribute to LogIX.
To cite this repository:
@article{choe2024your,
title={What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions},
author={Choe, Sang Keun and Ahn, Hwijeen and Bae, Juhan and Zhao, Kewen and Kang, Minsoo and Chung, Youngseog and Pratapa, Adithya and Neiswanger, Willie and Strubell, Emma and Mitamura, Teruko and others},
journal={arXiv preprint arXiv:2405.13954},
year={2024}
}
LogIX is licensed under the Apache 2.0 License.