Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] ollama returns garbage for longer texts #12702

Open
frzifus opened this issue Jan 12, 2025 · 3 comments
Open

[bug] ollama returns garbage for longer texts #12702

frzifus opened this issue Jan 12, 2025 · 3 comments
Assignees

Comments

@frzifus
Copy link

frzifus commented Jan 12, 2025

I have installed ollama on a system with an intel arc a770 and loaded llama3.2:3b.
The initial loading of the model takes a long time, but it works.
Initial requests are successfully answered with ~1000t/s. As the chat continues, things get a bit weird. In the middle of a story, the text turned into javascript and then into pure garbage.

screenshot

grafik

Thats the deployment I used.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: open-webui-config
  namespace: ollama
data:
  OLLAMA_BASE_URL: "http://ollama:11434"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: intelanalytics/ipex-llm-inference-cpp-xpu:2.2.0-SNAPSHOT
          env:
            - name: OLLAMA_HOST
              value: "0.0.0.0:11434"
            - name: ZES_ENABLE_SYSMAN
              value: "1"
            - name: OLLAMA_INTEL_GPU
              value: "true"
          command:
            - /bin/sh
            - -c
            - |
              mkdir -p /llm/ollama
              cd /llm/ollama
              init-ollama
              ./ollama serve
          ports:
            - containerPort: 11434
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /root/.ollama
              name: ollama-data
          resources:
            requests:
              memory: "4096Mi"
              cpu: "1"
            limits:
              cpu: "4"
              memory: "8192Mi"
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-data

Logs:

found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.6|    512|    1024|   32| 16225M|            1.3.31294|
llama_kv_cache_init:      SYCL0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     2.00 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   256.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 790
llama_new_context_with_model: graph splits = 2
time=2025-01-12T19:31:36.457+08:00 level=WARN source=runner.go:894 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-01-12T19:31:45.794+08:00 level=INFO source=server.go:619 msg="llama runner started in 11.28 seconds"

Linking:

@sgwhat sgwhat self-assigned this Jan 13, 2025
@ACupofAir
Copy link
Collaborator

The problem cannot be reproduced, and the output is still normal after trying multiple rounds of sessions.

@frzifus
Copy link
Author

frzifus commented Jan 14, 2025

mh.. Let me try again and come back to you.

@frzifus
Copy link
Author

frzifus commented Jan 17, 2025

It worked until this log line occured:

time=2025-01-17T09:34:37.836+08:00 level=WARN source=runner.go:129 msg="truncating input prompt" limit=2048 prompt=2175 keep=5 new=2048

Update

It seems to have nothing to do with the previously listed log line.
The next test did not show anything. Nevertheless, the following error occurred:

screenshot

Image

Update

It seems to be a problem when the vram is running full. As soon as I reduce the context length, the problem disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants