The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
-
Updated
Dec 12, 2024 - Python
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
FlashInfer: Kernel Library for LLM Serving
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Triton implementation of FlashAttention2 that adds Custom Masks.
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
Python package for rematerialization-aware gradient checkpointing
Utilities for efficient fine-tuning, inference and evaluation of code generation models
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Flash Attention Implementation with Multiple Backend Support and Sharding This module provides a flexible implementation of Flash Attention with support for different backends (GPU, TPU, CPU) and platforms (Triton, Pallas, JAX).
An simple pytorch implementation of Flash MultiHead Attention
🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations
Long term project about a custom AI architecture. Consist of cutting-edge technique in machine learning such as Flash-Attention, Group-Query-Attention, ZeRO-Infinity, BitNet, etc.
Poplar implementation of FlashAttention for IPU
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."