Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized model updates, switch to recommending TheBloke #208

Merged
merged 10 commits into from
May 30, 2023
3 changes: 2 additions & 1 deletion .env_gpt4all
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
model_name_gptj=ggml-gpt4all-j-v1.3-groovy.bin

# llama-cpp-python type, supporting version 3 quantization, here from locally built llama.cpp q4 v3 quantization
model_path_llama=./models/7B/ggml-model-q4_0.bin
# below uses prompt_type=wizard2
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
# below assumes max_new_tokens=256
n_ctx=1792

Expand Down
15 changes: 12 additions & 3 deletions FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,13 @@ etc.

### CPU with no AVX2 or using LLaMa.cpp

For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead,
e.g. by compiling the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
For GPT4All based models, require AVX2, unless one recompiles that project on your system. Until then, use llama.cpp models instead.

So we recommend downloading models from [TheBloke](https://huggingface.co/TheBloke) that are version 3 quantized ggml files to work with latest llama.cpp. See main [README.md](README.md#cpu).

The below example is for base LLaMa model, not instruct-tuned, so is not recommended for chatting. It just gives an example of how to quantize if you are an expert.

Compile the llama model on your system by following the [instructions](https://github.com/ggerganov/llama.cpp#build) and [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), e.g. for Linux:
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Expand Down Expand Up @@ -295,7 +300,7 @@ python convert.py models/7B/
# test by running the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
```
then adding an entry in the .env file like (assumes version 3 quantization)
then adding an entry in the `.env_gpt4all` file like (assumes version 3 quantization)
```.env_gpt4all
# model path and model_kwargs
model_path_llama=./models/7B/ggml-model-q4_0.bin
Expand Down Expand Up @@ -358,6 +363,10 @@ Ignore this warning.

These can be usful on HuggingFace spaces, where one sets secret tokens because CLI options cannot be used.

### h2oGPT LLM not producing output.

To be fixed soon: https://github.com/h2oai/h2ogpt/issues/192

### GPT4All not producing output.

Please contact GPT4All team. Even a basic test can give empty result.
Expand Down
32 changes: 22 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Also check out [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio) for our

### Getting Started

For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)
First one needs a Python 3.10 environment. For help installing a Python 3.10 environment, see [Install Python 3.10 Environment](INSTALL.md#install-python-environment)

#### GPU (CUDA)

Expand All @@ -90,7 +90,17 @@ cd h2ogpt
pip install -r requirements.txt
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b --load_8bit=True
```
Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`.
Then point browser at http://0.0.0.0:7860 (linux) or http://localhost:7860 (windows/mac) or the public live URL printed by the server (disable shared link with `--share=False`). For 4-bit or 8-bit support, older GPUs may require older bitsandbytes installed as `pip uninstall bitsandbytes -y ; pip install bitsandbytes==0.38.1`. For production uses, we recommend at least the 12B model, ran as:
```
python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b --load_8bit=True --debug=True
```
and one can use `--h2ocolors=False` to get soft blue-gray colors instead of H2O.ai colors.

Note if you download the model yourself and point `--base_model` to that location, you'll need to specify the prompt_type as well by running:
```
python generate.py --base_model=<user path> --load_8bit=True --prompt_type=human_bot
```
for some user path `<user path>`.

For quickly using a private document collection for Q/A, place documents (PDFs, text, etc.) into a folder called `user_path` and run
```bash
Expand All @@ -112,7 +122,7 @@ Any other instruct-tuned base models can be used, including non-h2oGPT ones. [L

CPU support is obtained after installing two optional requirements.txt files. GPU support is also present if one has GPUs.

1) Install base, langchain, and GPT4All dependencies:
1) Install base, langchain, and GPT4All, and python LLaMa dependencies:
```bash
git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
Expand All @@ -125,25 +135,27 @@ One can run `make req_constraints.txt` to ensure that the constraints file is co

2. Change `.env_gpt4all` model name if desired.
```.env_gpt4all
# model path and model_kwargs
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
```
You can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). Do not need to download, the gp4all package will download at runtime and put it into `.cache` like huggingface would.
See [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions on getting model for `--base_model=llama` case.
For `gptj` and `gpt4all_llama`, you can choose a different model than our default choice by going to GPT4All Model explorer [GPT4All-J compatible model](https://gpt4all.io/index.html). One does not need to download manually, the gp4all package will download at runtime and put it into `.cache` like huggingface would. However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT.

So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from [TheBloke](https://huggingface.co/TheBloke). For example, [13B Vicuna Quantized](https://huggingface.co/TheBloke/wizardLM-13B-1.0-GGML) or [7B WizardLM Quantized](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML). TheBloke has a variety of model types, quantization bit, and memory consumption. Choose what is best for your system's specs. However, be aware that LLaMa-based models are not [commercially viable](FAQ.md#commercial-viability).

For 7B case, download [WizardLM-7B-uncensored.ggmlv3.q8_0.bin](https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/WizardLM-7B-uncensored.ggmlv3.q8_0.bin) into local path. Then one sets `model_path_llama` in `.env_gpt4all`, which is currently the default.

3. Run generate.py

For LangChain support using documents in `user_path` folder, run h2oGPT like:
```bash
python generate.py --base_model=gptj --score_model=None --langchain_mode='UserData' --user_path=user_path
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
```
See [LangChain Readme](README_LangChain.md) for more details.
For no langchain support (still uses LangChain package as model wrapper), run as:
```bash
python generate.py --base_model=gptj --score_model=None
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
```
However, `gpjt` model often gives [no output](FAQ.md#gpt4all-not-producing-output), even outside h2oGPT, so we recommend using a [llama.cpp](FAQ.md#cpu-with-no-avx2-or-using-llamacpp) based model,
although such models perform much worse than standard non-quantized models.

#### MACOS

Expand Down
Loading