-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neox Flash attn #31
neox Flash attn #31
Conversation
arnocandel
commented
Apr 11, 2023
•
edited
Loading
edited
- compile flash attention
- perform fusion and flash-self/cross-attention on model state
- get it to run
- combine with LoRA
- combine with 8-bit
as is: If trying to steal stuff like this: |
Install GPT-NEOXsource ~/.bashrc.mamba
mamba create -n gptneox
conda activate gptneox
mamba install python=3.8 -y
mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
cd gpt-neox/
pip install -r requirements/requirements.txt
mamba install cudatoolkit-dev=11.7 cudatoolkit=11.7 -c conda-forge -c nvidia -y
unset CUDA_HOME
python ./megatron/fused_kernels/setup.py install
pip install -r ./requirements/requirements-flashattention.txt
cd ..
git clone https://github.com/EleutherAI/DeeperSpeed.git
cd DeeperSpeed
./install.sh
python prepare_data.py -d ./data
wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints Now can train, fine-tune, inference with Flash attention by changing the config file for neox to specify attention_type to flash.
The change to model parallel size is to use one pipeline per GPU, required to satisfy deep.py ./deepy.py generate.py ./configs/20B.yml Use fast attention with LLaMa from Vicunda/FastChat repo: |
edfa2ad
to
1187a5c
Compare
Flash attention now native in Torch 2.0.1 for float16. |