-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA does not work anymore with llama backend #840
Comments
As an addition: I strace'd the local-ai process inside the container and found out that it is searching for |
try to bump your cuda version to 12.2. It should work with 12.0 no issue, but i had the same error and upgrading to 12.2 made it dissapear. I'm running it on a AW 17R5 with a GTX 1080 and a I9 and driver version 535 |
Updated my cuda packages to 12.2 and switched the image to (Update: I had to revert to cuda 11.8 and the .525 driver from the original Ubuntu repositories because the drivers from Nvidia's repository seem not to be working correctly with Ubuntu 23.04.) |
I think there is something wrong with the image as Other open source ai projects are working around these problems by using |
Bumping this issue. I wrote my comment a bit too late on issue #812, so adding it here to hopefully get more support. I've now tested on two sets of hardware, my primary computer and my NAS, and both had the same issue. PC 1: PC 2: My logs report the same "error when dialing" issue all the way back to 1.21.0. I haven't gone back farther than that. I was using cuda 12.2 and -cuda12. I'll try doing what djmaze mentioned and reverting cuda version and the old driver to see if that works. |
@Polkadoty To be clear, reverting the cuda version did not fix this problem. Updating the cuda version using nvidia's repositories just prevented any cuda stuff from working on my system, so I had to revert to at least make the other stuff work again. |
Hmm weird, 1.23.2 cuda12 version works fine for me.
Maybe you need to update your nvidia driver? |
hi, I have tried every possible way (from localai's documentation, github issues in the repo, searching hours on internet, my own testing...) but I cannot get localai running on GPU. I have tested quay images from master back to v1.21, but none is working for me. => Please help. Here is my setup: On my docker's host:# nvidia-smi
Wed Aug 16 09:22:26 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 On | 00000000:03:00.0 Off | N/A |
| 0% 28C P8 7W / 220W | 10MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:0A:00.0 Off | N/A |
| 0% 25C P8 5W / 370W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:0B:00.0 Off | N/A |
| 0% 26C P8 6W / 370W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3070 On | 00000000:0C:00.0 Off | N/A |
| 0% 29C P8 6W / 220W | 10MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1280 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1280 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1280 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1280 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+ In the docker container:root@96bdf1a0d925:/build# ls -alF
total 7399444
drwxr-xr-x 2 root root 4096 Aug 3 09:55 ./
drwxr-xr-x 1 root root 4096 Aug 16 07:35 ../
-rw-r--r-- 1 root root 3785248281 Apr 15 13:17 ggml-gpt4all-j
-rw-r--r-- 1 root root 257 Aug 16 07:35 gpt-3.5-turbo.yaml
-rw-r--r-- 1 root root 3791749248 Aug 3 08:53 open-llama-7b-q4_0.bin
-rw-r--r-- 1 root root 18 Aug 16 07:35 openllama-chat.tmpl
-rw-r--r-- 1 root root 48 Aug 16 07:35 openllama-completion.tmpl
root@96bdf1a0d925:/build# nvidia-smi
Tue Aug 15 16:34:32 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 On | 00000000:03:00.0 Off | N/A |
| 0% 30C P8 7W / 220W | 10MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:0A:00.0 Off | N/A |
| 0% 28C P8 5W / 370W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:0B:00.0 Off | N/A |
| 0% 28C P8 6W / 370W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3070 On | 00000000:0C:00.0 Off | N/A |
| 0% 31C P8 6W / 220W | 10MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+ I have Here is the Master branch:docker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=false -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg Rebuild: Offdocker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=false -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:v1.24.1-cublas-cuda12-ffmpeg Rebuild: Ondocker run --rm -ti --gpus all -p 51080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=8 -e BUILD_TYPE=cublas -e REBUILD=true -v /data/EXAMPLE/containers/apps/localai/models:/models quay.io/go-skynet/local-ai:v1.24.1-cublas-cuda12-ffmpeg Example Output of an execution:NOTE: It gives errors:
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
CPU: AVX found OK
CPU: AVX2 found OK
CPU: no AVX512 found
@@@@@
4:50PM INF Starting LocalAI using 8 threads, with models path: /models
4:50PM INF LocalAI version: v1.24.1 (9cc8d9086580bd2a96f5c96a6b873242879c70bc)
4:50PM DBG Model: gpt-3.5-turbo (config: {PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Name:gpt-3.5-turbo F16:true Threads:0 Debug:false Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 NUMA:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false} Step:0})
4:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
4:50PM DBG Config overrides map[batch:512 f16:true gpu_layers:35 mmap:true]
4:50PM DBG Checking "open-llama-7b-q4_0.bin" exists and matches SHA
4:50PM DBG File "open-llama-7b-q4_0.bin" already exists and matches the SHA. Skipping download
4:50PM DBG Prompt template "openllama-completion" written
4:50PM DBG Prompt template "openllama-chat" written
4:50PM DBG Written config file /models/gpt-3.5-turbo.yaml
┌───────────────────────────────────────────────────┐
│ Fiber v2.48.0 │
│ http://127.0.0.1:8080 │
│ (bound on host 0.0.0.0 and port 8080) │
│ │
│ Handlers ............ 56 Processes ........... 1 │
│ Prefork ....... Disabled PID ................ 14 │
└───────────────────────────────────────────────────┘
[127.0.0.1]:43530 200 - GET /readyz
[127.0.0.1]:41682 200 - GET /readyz
4:52PM DBG Request received:
4:52PM DBG `input`: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Context:context.Background.WithCancel Cancel:0x4b9060 File: ResponseFormat: Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil> Backend: ModelBaseName:}
4:52PM DBG Parameter Config: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false} Name: F16:false Threads:8 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false} Step:0}
4:52PM DBG Loading model 'open-llama-7b-q4_0.bin' greedly from all the available backends: llama, gpt4all, falcon, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, bloomz, rwkv, whisper, stablediffusion, piper, /build/extra/grpc/exllama/exllama.py, /build/extra/grpc/huggingface/huggingface.py, /build/extra/grpc/autogptq/autogptq.py, /build/extra/grpc/bark/ttsbark.py, /build/extra/grpc/diffusers/backend_diffusers.py
4:52PM DBG [llama] Attempting to load
4:52PM DBG Loading model llama from open-llama-7b-q4_0.bin
4:52PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
4:52PM DBG Loading GRPC Model llama: {backendString:llama model:open-llama-7b-q4_0.bin threads:8 assetDir:/tmp/localai/backend_data context:0xc00003e0b0 gRPCOptions:0xc0001c2000 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
4:52PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
4:52PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:40559'
4:52PM DBG GRPC Service state dir: /tmp/go-processmanager1136322216
4:52PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40559: connect: connection refused"
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr 2023/08/15 16:52:31 gRPC Server listening at 127.0.0.1:40559
4:52PM DBG GRPC Service Ready
4:52PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:open-llama-7b-q4_0.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/open-llama-7b-q4_0.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false}
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr create_gpt_params: loading model /models/open-llama-7b-q4_0.bin
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:40559): stderr CUDA error 999 at /build/go-llama/llama.cpp/ggml-cuda.cu:4235: unknown error
4:52PM DBG [llama] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
4:52PM DBG [gpt4all] Attempting to load
4:52PM DBG Loading model gpt4all from open-llama-7b-q4_0.bin
4:52PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
4:52PM DBG Loading GRPC Model gpt4all: {backendString:gpt4all model:open-llama-7b-q4_0.bin threads:8 assetDir:/tmp/localai/backend_data context:0xc00003e0b0 gRPCOptions:0xc0001c2000 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
4:52PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
4:52PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:33361'
4:52PM DBG GRPC Service state dir: /tmp/go-processmanager940124080
4:52PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33361: connect: connection refused"
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr 2023/08/15 16:52:32 gRPC Server listening at 127.0.0.1:33361
4:52PM DBG GRPC Service Ready
4:52PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:open-llama-7b-q4_0.bin ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/open-llama-7b-q4_0.bin Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false}
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: format = ggjt v3 (latest)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_vocab = 32000
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_ctx = 2048
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_embd = 4096
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_mult = 256
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_head = 32
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_layer = 32
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_rot = 128
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: ftype = 2 (mostly Q4_0)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_ff = 11008
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: n_parts = 1
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: model size = 7B
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: ggml ctx size = 0.07 MB
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
4:52PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:33361): stderr llama_new_context_with_model: kv self size = 1024.00 MB
4:52PM DBG [gpt4all] Loads OK
[127.0.0.1]:33344 200 - GET /readyz
4:54PM DBG Response: {"object":"text_completion","model":"open-llama-7b-q4_0.bin","choices":[{"index":0,"finish_reason":"stop","text":"… a film called Star Wars: The Force Awakens was released in theaters.\nThe film has become the biggest box office hit of all time, but that wasn’t always the case.\nThe movie was originally slated to be released in April of 2015, but Disney decided to push it back for a few months.\nThe film was originally supposed to be released in December of 2015, but Disney decided to move it back to March of 2016.\nThe film was finally released on March 17, 2016 and has since grossed over $2 billion worldwide.\nThe Force Awakens was originally slated for release in December 2015, but Disney delayed it to April 2016.\nThe movie was released in theaters on March 16, 2017.\nThe film was originally slated for release on March 24, 2018.\nThe film was originally scheduled for release on April 2, 2019.\nThe film was released on April 1, 2020.\nThe film was released on December 21, 2020.\nThe film was released on December 23, 2021.\nThe film was released on December 24, 2022.\nThe film was released on December 31, 2024.\nThe film was released on January 1, 2025.\nThe film was released on February 1, 2026.\nThe film was released on March 1, 2027.\nThe film was released on April 1, 2028.\nThe film was released on May 1, 2029.\nThe film was released on June 1, 2030.\nThe film was released on July 1, 2031.\nThe film was released on August 1, 2032.\nThe film was released on September 1, 2033.\nThe film was released on October 1, 2034.\nThe film was released on November 1, 2035.\nThe film was released on December 1, 2036.\nThe film was released on January 1, 2037"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[192.168.215.248]:62738 200 - POST /v1/completions
[127.0.0.1]:36808 200 - GET /readyz |
Could you try and see if this helps you out: https://cloud.apex-migrations.net/s/8sTpCjG44jqxcyw I've created a folder with some tmpl files, and a yml file which is a config for the model binary. |
|
I'm afraid I'm having the same problem. After getting it to work just with the CPU, I am now trying to get it to work also with an NVIDIA GPU. So I set up a new VM with Debian 12, installed docker, nvidia drivers, container toolkit etc from scratch. nvidia-smi output: +-----------------------------------------------------------------------------+ Set up local.ai as per instructions. Loaded model as per instructions. But when I run a query, it returns: And in the logs I see: Only thing I can think of ouf of the ordinary would be that I am trying to run this in a rootless docker container (and to get that working I needed to toggle off cgroups). Any solution in sight? Thanks. |
So I recreated everything with rootful docker but it just does not work. Same errors as before. Anybody figured it out yet? What does it mean that the gRPC service is refusing the connection? |
on my install i also have the same error, but my model does offload to the GPU just fine. Make sure all YML files are defined as YAML. I had problems with getting my YAML config files detected |
Thanks for the feedback. May I ask (because I am a noob and really wouldn't know how to tell) how do you know that your model offloads to the GPU? Do you get a log entry like "model successfully loaded"? And does it overall work for you (despite the error) or does ist not work? If it does work, can I please ask what your system environment looks like (bare metal or VM, operating system (including version), rootful or rootless docker (including version) or direct build, gpu driver version, etc.)? I would like to try and replicate it here. I actually did rename my docker-compose.yaml to .yml (as this is what I am used to). But it didn't work before that already and that should not have an impact, I'm guessing. My lunademo.yaml is actually a .yaml (and I will remember not to change that). Thanks! |
yes the logs indeed should indicate that the model has been succesfully loaded (depending on the amount of layers, it offloaded some data to your VRAM). You could also run "watch nvidia-smi" to launch a host process to monitor what happends on the driver level of your gpu. It should indicat when a model was loaded, and offloaded. |
Okay, I found the problem in my case. I am using swarm mode and it turns out I needed to explicitely set the env variable It seems to me all the other problems reported here have different causes, so I will close this issue. Feel free to open new issues as necessary. |
LocalAI version:
quay.io/go-skynet/local-ai:v1.22.0-cublas-cuda11
Environment, CPU architecture, OS, and Version:
Linux glados 6.2.0-26-generic #26-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 23:39:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0
RTX 3090, Ubuntu 23.04
Describe the bug
Previously, I had the v1.18.0 image with cuda11 running correctly. Now, after updating the image to v1.22.0, I get the following error in the debug log when trying to do a chat completion with a llama-based model:
stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:2478: CUDA driver version is insufficient for CUDA runtime version
To Reproduce
PRELOAD_MODELS
to e.g.'[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]'
Expected behavior
The completion result is returned.
Logs
Additional context
The text was updated successfully, but these errors were encountered: