This repository contains the code and configuration to use a transformer-based language model built with a custom Linformer architecture. The model is designed to handle long-sequence tasks more efficiently by incorporating a low-rank projection mechanism for attention. This allows scaling the model to longer sequences while maintaining manageable memory and computational requirements.
This project features a Linformer-based language model designed to optimize attention mechanism efficiency, reducing the quadratic complexity typical in transformer architectures to linear complexity. The Linformer model achieves this through low-rank projections, making it ideal for processing long sequences efficiently.
The model is available for download from Hugging Face and can be easily integrated into projects via pip installation. The weights for the pre-trained model are also hosted on Hugging Face.
The core of this project revolves around a Linformer-based Transformer architecture, which optimizes the self-attention mechanism by reducing its quadratic complexity to linear time, making it more efficient for long sequences.
-
Efficient Attention with Linformer:
-
The Linformer architecture reduces the quadratic complexity of self-attention to linear time. In traditional transformers, the self-attention mechanism has a time complexity of
$O(n^2)$ , where$n$ is the sequence length. Linformer addresses this issue by projecting the attention matrix into a lower dimension using low-rank projections, which reduces the overall memory and computational load to$O(n)$ . -
In the standard transformer, the self-attention is computed as:
-
$Q \in \mathbb{R}^{n \times d}$ are the queries, -
$K \in \mathbb{R}^{n \times d}$ are the keys, -
$V \in \mathbb{R}^{n \times d}$ are the values, and -
$d_k$ is the dimension of the keys/queries. - Linformer modifies this by introducing a projection matrix
$P \in \mathbb{R}^{n \times k}$ , reducing the dimension of$K$ and$V$ $$K' = K P, \quad V' = V P$$
-
-
-
Low-Rank Linear Projections:
-
LowRankLinear is used throughout the architecture to reduce dimensionality while maintaining model expressiveness. This is achieved by factorizing the linear transformation into two smaller matrices
$U$ and$V$ , where:$$W \approx U V^\top$$ -
Here,
$U \in \mathbb{R}^{d \times r}$ and$V \in \mathbb{R}^{d \times r}$ , where$r$ is the rank of the projection. This reduces the total number of parameters in the projection. -
This method helps in compressing the model, lowering the computational cost of matrix multiplications in dense layers.
-
-
Self-Attention Mechanism:
-
The SelfAttention module implements a multi-head self-attention mechanism without low-rank projections in this architecture. Each attention head operates on the input sequence and computes self-attention as in a standard transformer. The attention matrix remains
$n \times n$ , ensuring full expressivity. -
For each attention head, the queries, keys, and values are computed as follows:
$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$ -
$X \in \mathbb{R}^{n \times d}$ is the input sequence, and$W_Q, W_K, W_V \in \mathbb{R}^{d \times d}$ are learned projection matrices for queries, keys, and values. -
The self-attention is then calculated using the scaled dot-product attention mechanism:
-
The complexity of this operation remains
$O(n^2 \cdot d)$ , as we do not reduce the attention matrix with low-rank projections.
-
-
Factorized Feed-Forward Layers:
-
Each transformer block includes a Feed-Forward Neural Network (FFN) that follows the attention layer. In this implementation, the FFN is factorized using LowRankLinear layers, reducing the computational burden of the FFN while maintaining performance.
-
The FFN consists of two linear layers with a GELU non-linearity.
-
Instead of directly projecting from
$d$ to$d$ , the factorized layers project from$d$ to$r$ and back to$d$ , where$r$ is the reduced rank.
-
-
PreNorm with LayerNorm and LayerScale:
-
Instead of applying normalization after each module (post-norm), we use a PreNorm architecture where LayerNorm is applied before the attention and feed-forward layers. This ensures smoother gradient flow and better model stability, particularly during training.
-
In this architecture, LayerNorm normalizes each vector
$x \in \mathbb{R}^{d}$ by subtracting the mean and dividing by the standard deviation: -
Additionally, we incorporate LayerScale, a technique where a learned scaling factor is applied to the residual connection output. This helps in modulating the output of each transformer block and improves the model's ability to learn deeper representations. The output of the residual connection is scaled by a learned parameter
$\lambda$ : -
The scale factor
$\lambda$ is initialized to a small value (e.g., 0.1) and learned during training.
-
-
Dropout and Residual Connections:
-
To prevent overfitting, dropout layers are applied after the attention mechanism and feed-forward layers. Dropout helps regularize the model during training by randomly zeroing some of the activations.
-
Residual connections are included around the attention and feed-forward layers, allowing for better gradient flow during backpropagation and preventing vanishing gradients in deep networks.
-
The model architecture is highly configurable through several hyperparameters:
-
vocab_size
: The size of the vocabulary (default: 50,257). -
embed_dim
: Dimensionality of the token and positional embeddings (default: 768). -
depth
: Number of Linformer transformer layers (default: 8). -
heads
: Number of attention heads (default: 8). -
seq_length
: Maximum sequence length (default: 768). -
dropout
: Dropout rate applied throughout the network (default: 1/17). -
k
: The projection dimension for the low-rank attention (default: 384). -
rank
: Defines the reduced dimensionality for low-rank projections (default: 256).
To install the model, use pip:
pip install lumenspark
This will install the Linformer-based language model and its dependencies.
After installing the package, you can easily load the pre-trained model and tokenizer from Hugging Face to generate text.
from lumenspark import LumensparkModel
import torch
# 1. Set up the device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# 2. Load the model and move it to the device
model = LumensparkModel.from_pretrained("anto18671/lumenspark").to(device)
# 3. Example input text
input_text = "Once upon a time"
# 4. Generate text
output_text = model.generate(
input_text,
max_length=100, # Maximum length of the generated sequence
temperature=0.7, # Controls randomness in predictions
top_k=50, # Top-k sampling to filter high-probability tokens
top_p=0.9, # Nucleus sampling to control diversity
repetition_penalty=1.2 # Penalize repetition
)
# 5. Print the generated text
print(output_text)
This example demonstrates loading the model and tokenizer, and generating a text sequence based on an initial prompt.
We would like to extend our gratitude to RunPod for their generous sponsorship, supporting the training and development of Lumenspark. Their contribution has been instrumental in pushing the project forward.
If you find Lumenspark valuable and would like to support its ongoing development, consider becoming a sponsor!
Click the Sponsor button above or visit GitHub Sponsors to choose a sponsorship tier that suits you.
Thank you for your support!
This project is licensed under the MIT License. See the LICENSE file for more details.