Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress stuck while doing Entity Extraction #76

Open
Greenhub-1215 opened this issue Oct 12, 2024 · 4 comments
Open

progress stuck while doing Entity Extraction #76

Greenhub-1215 opened this issue Oct 12, 2024 · 4 comments

Comments

@Greenhub-1215
Copy link

sometimes, while using nano-graphrag, the progress stuck while doing Entity Extraction process.
while stuck, the output of terminal looks like below:

"Processed 26 chunks, 378 entities found(duplicated), 263 relations(duplicated)"

demo.py code is shown below:

import os
import sys

sys.path.append("..")
import logging
import ollama
import numpy as np
from nano_graphrag import GraphRAG, QueryParam
from nano_graphrag.base import BaseKVStorage
from nano_graphrag._utils import compute_args_hash, wrap_embedding_func_with_attrs

logging.basicConfig(level=logging.WARNING)
logging.getLogger("nano-graphrag").setLevel(logging.INFO)

# Assumed llm model settings
MODEL = "qwen2.5:ctx32k"

# Assumed embedding model settings
EMBEDDING_MODEL = "bge-m3"
EMBEDDING_MODEL_DIM = 1024
EMBEDDING_MODEL_MAX_TOKENS = 8192


async def ollama_model_if_cache(
    prompt, system_prompt=None, history_messages=[], **kwargs
) -> str:
    # remove kwargs that are not supported by ollama
    kwargs.pop("max_tokens", None)
    kwargs.pop("response_format", None)

    ollama_client = ollama.AsyncClient()
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})

    # Get the cached response if having-------------------
    hashing_kv: BaseKVStorage = kwargs.pop("hashing_kv", None)
    messages.extend(history_messages)
    messages.append({"role": "user", "content": prompt})
    if hashing_kv is not None:
        args_hash = compute_args_hash(MODEL, messages)
        if_cache_return = await hashing_kv.get_by_id(args_hash)
        if if_cache_return is not None:
            return if_cache_return["return"]
    # -----------------------------------------------------
    response = await ollama_client.chat(model=MODEL, messages=messages, **kwargs)

    result = response["message"]["content"]
    # Cache the response if having-------------------
    if hashing_kv is not None:
        await hashing_kv.upsert({args_hash: {"return": result, "model": MODEL}})
    # -----------------------------------------------------
    return result


def remove_if_exist(file):
    if os.path.exists(file):
        os.remove(file)


WORKING_DIR = "./nano_graphrag_cache_1"


def query():
    rag = GraphRAG(
        working_dir=WORKING_DIR,
        best_model_func=ollama_model_if_cache,
        cheap_model_func=ollama_model_if_cache,
        embedding_func=ollama_embedding,
    )
    print(
        rag.query(
            "第一章主要讲述了什么内容?", param=QueryParam(mode="global")
        )
    )


def insert():
    from time import time

    with open("./计算机网络(第7版)-谢希仁_第一章.docx
[计算机网络(第7版)-谢希仁_第一章.docx](https://github.com/user-attachments/files/17350050/7.-._.docx)
", encoding="utf-8-sig") as f:
        FAKE_TEXT = f.read()

    remove_if_exist(f"{WORKING_DIR}/vdb_entities.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_full_docs.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_text_chunks.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_community_reports.json")
    remove_if_exist(f"{WORKING_DIR}/graph_chunk_entity_relation.graphml")

    rag = GraphRAG(
        working_dir=WORKING_DIR,
        enable_llm_cache=True,
        best_model_func=ollama_model_if_cache,
        cheap_model_func=ollama_model_if_cache,
        embedding_func=ollama_embedding,
    )
    start = time()
    rag.insert(FAKE_TEXT)
    print("indexing time:", time() - start)
    # rag = GraphRAG(working_dir=WORKING_DIR, enable_llm_cache=True)
    # rag.insert(FAKE_TEXT[half_len:])


# We're using Ollama to generate embeddings for the BGE model
@wrap_embedding_func_with_attrs(
    embedding_dim=EMBEDDING_MODEL_DIM,
    max_token_size=EMBEDDING_MODEL_MAX_TOKENS,
)
async def ollama_embedding(texts: list[str]) -> np.ndarray:
    embed_text = []
    for text in texts:
        data = ollama.embeddings(model=EMBEDDING_MODEL, prompt=text)
        embed_text.append(data["embedding"])

    return embed_text


if __name__ == "__main__":
    insert()
    query()

this file is provided in Chinese Simplified.

@Sangeeth123sj
Copy link

ollama gets stuck, using smaller text inputs gets the process completed but still doesn;t work in extracting well.
OpenAI seems the only bet

@rangehow
Copy link
Collaborator

same :( hope someone can give us some advice.

@lxz12
Copy link

lxz12 commented Nov 12, 2024

same
卡在了这里

nohup: 忽略输入
INFO:datasets:PyTorch version 2.5.1 available.
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-large-zh-v1.5
INFO:nano-graphrag:Load KV full_docs with 0 data
INFO:nano-graphrag:Load KV text_chunks with 0 data
INFO:nano-graphrag:Load KV llm_response_cache with 0 data
INFO:nano-graphrag:Load KV community_reports with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0 data
INFO:nano-graphrag:[New Docs] inserting 1 docs
INFO:nano-graphrag:[New Chunks] inserting 44 chunks
INFO:nano-graphrag:[Entity Extraction]...

@lxz12
Copy link

lxz12 commented Nov 12, 2024

same 相同的 卡在了这里

nohup: 忽略输入 INFO:datasets:PyTorch version 2.5.1 available.信息:数据集:PyTorch 版本 2.5.1 可用。 INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-large-zh-v1.5INFO:sentence_transformers.SentenceTransformer:加载预训练的SentenceTransformer: BAAI/bge-large-zh-v1.5 INFO:nano-graphrag:Load KV full_docs with 0 dataINFO:nano-graphrag:用 0 条数据加载 KV full_docs INFO:nano-graphrag:Load KV text_chunks with 0 dataINFO:nano-graphrag:用 0 数据加载 KV text_chunks INFO:nano-graphrag:Load KV llm_response_cache with 0 dataINFO:nano-graphrag:用 0 数据加载 KV llm_response_cache INFO:nano-graphrag:Load KV community_reports with 0 dataINFO:nano-graphrag:用 0 条数据加载 KV Community_reports INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0 data信息:nano-vectordb:初始化{'embedding_dim':1024,'metric':'余弦','storage_file':'./nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0数据 INFO:nano-graphrag:[New Docs] inserting 1 docsINFO:nano-graphrag:[新文档] 插入 1 个文档 INFO:nano-graphrag:[New Chunks] inserting 44 chunksINFO:nano-graphrag:[新块]插入 44 个块 INFO:nano-graphrag:[Entity Extraction]...信息:nano-graphrag:[实体提取]...

The above log is my answer model using ollama, as well as the embedding model.
When I use gpt3.5turbo, the display is as follows
Stuck at :INFO:nano-graphrag:Inserting 1751 vectors to entities
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants