AlphaZip

Welcome to the AlphaZip - Neural Networks Enhanced Lossless Text Compression project! This project explores leveraging the power of Large Language Models (LLMs) for compressing text in a lossless manner using rank-based prediction followed by compression techniques.

Find the pre-print here: https://arxiv.org/abs/2409.15046

Overview

Our approach utilizes advanced neural network models to achieve efficient and effective text compression. For real-time text file compression on your personal computer refer to the instructions below.

Requirements

To get started, ensure you have the following:

NVIDIA GPU: GeForce RTX 4080
Python: 3.10.12
PyTorch: 2.2.2
TensorFlow: 2.11.0 (for XLA support)

Installation

Clone the repository:

git clone https://github.com/Swathi-Shree-Narashiman/AlphaZip.git

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Compressing Text Files:
- Use the compress.py script to compress any text file.
- Modify the path variable in the script to point to your file.
PDF Files:
- Utilize the read_PDF function to extract text from a PDF.
- Save the extracted text to a file and then compress it using the compress.py script.
Adaptive Huffman Method:
- To use the adaptive Huffman method, either copy and paste the function or import it as a user-defined library.

Hyperparameters

input_length: Number of ASCII characters from the input text to compress.
context_size: Number of characters used as context for the transformer block to predict the next token.

Domain-Specific Compression

Fine-Tuning

To fine-tune the model:

Create a directory inside your current directory using

    mkdir fine_tuning_weights

Run the PEFT/fine_tuning.py script with your input file, e.g., input.txt.
```
python PEFT/fine_tuning.py <input.txt>
```
Test compression performance from any checkpoint using compress.py:
```
python compress.py <file_to_be_compressed_path> <current_directory_path/fine_tuning_weights/checkpoint-XXXX>
```
Replace XXXX with the checkpoint you would like to load.

Knowledge Distillation

To perform knowledge distillation on GPT-2:

Create a directory inside your current directory using

    mkdir knowledge_distillation_weights

Run the PEFT/knowledge_distillation.py script with your input file, e.g., input.txt.
```
python PEFT/knowledge_distillation.py <input.txt>
```

Test compression performance from any checkpoint using compress.py:

python compress.py <file_to_be_compressed_path> <current_directory_path/knowledge_distillation_weights/checkpoint-XXXX>

Replace XXXX with the checkpoint you would like to load.

Feel free to modify any paths or links to fit your specific repository details. This README.md file provides a clear and organized overview of your project, making it easier for users to get started and understand how to use the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AlphaZip

Overview

Requirements

Installation

Usage

Hyperparameters

Domain-Specific Compression

Fine-Tuning

Knowledge Distillation

Files

README.md

Latest commit

History

README.md

File metadata and controls

AlphaZip

Overview

Requirements

Installation

Usage

Hyperparameters

Domain-Specific Compression

Fine-Tuning

Knowledge Distillation