progress stuck while doing Entity Extraction #76

Greenhub-1215 · 2024-10-12T12:35:44Z

sometimes, while using nano-graphrag, the progress stuck while doing Entity Extraction process.
while stuck, the output of terminal looks like below:

"Processed 26 chunks, 378 entities found(duplicated), 263 relations(duplicated)"

demo.py code is shown below:

import os
import sys

sys.path.append("..")
import logging
import ollama
import numpy as np
from nano_graphrag import GraphRAG, QueryParam
from nano_graphrag.base import BaseKVStorage
from nano_graphrag._utils import compute_args_hash, wrap_embedding_func_with_attrs

logging.basicConfig(level=logging.WARNING)
logging.getLogger("nano-graphrag").setLevel(logging.INFO)

# Assumed llm model settings
MODEL = "qwen2.5:ctx32k"

# Assumed embedding model settings
EMBEDDING_MODEL = "bge-m3"
EMBEDDING_MODEL_DIM = 1024
EMBEDDING_MODEL_MAX_TOKENS = 8192


async def ollama_model_if_cache(
    prompt, system_prompt=None, history_messages=[], **kwargs
) -> str:
    # remove kwargs that are not supported by ollama
    kwargs.pop("max_tokens", None)
    kwargs.pop("response_format", None)

    ollama_client = ollama.AsyncClient()
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})

    # Get the cached response if having-------------------
    hashing_kv: BaseKVStorage = kwargs.pop("hashing_kv", None)
    messages.extend(history_messages)
    messages.append({"role": "user", "content": prompt})
    if hashing_kv is not None:
        args_hash = compute_args_hash(MODEL, messages)
        if_cache_return = await hashing_kv.get_by_id(args_hash)
        if if_cache_return is not None:
            return if_cache_return["return"]
    # -----------------------------------------------------
    response = await ollama_client.chat(model=MODEL, messages=messages, **kwargs)

    result = response["message"]["content"]
    # Cache the response if having-------------------
    if hashing_kv is not None:
        await hashing_kv.upsert({args_hash: {"return": result, "model": MODEL}})
    # -----------------------------------------------------
    return result


def remove_if_exist(file):
    if os.path.exists(file):
        os.remove(file)


WORKING_DIR = "./nano_graphrag_cache_1"


def query():
    rag = GraphRAG(
        working_dir=WORKING_DIR,
        best_model_func=ollama_model_if_cache,
        cheap_model_func=ollama_model_if_cache,
        embedding_func=ollama_embedding,
    )
    print(
        rag.query(
            "第一章主要讲述了什么内容？", param=QueryParam(mode="global")
        )
    )


def insert():
    from time import time

    with open("./计算机网络（第7版）-谢希仁_第一章.docx
[计算机网络（第7版）-谢希仁_第一章.docx](https://github.com/user-attachments/files/17350050/7.-._.docx)
", encoding="utf-8-sig") as f:
        FAKE_TEXT = f.read()

    remove_if_exist(f"{WORKING_DIR}/vdb_entities.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_full_docs.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_text_chunks.json")
    remove_if_exist(f"{WORKING_DIR}/kv_store_community_reports.json")
    remove_if_exist(f"{WORKING_DIR}/graph_chunk_entity_relation.graphml")

    rag = GraphRAG(
        working_dir=WORKING_DIR,
        enable_llm_cache=True,
        best_model_func=ollama_model_if_cache,
        cheap_model_func=ollama_model_if_cache,
        embedding_func=ollama_embedding,
    )
    start = time()
    rag.insert(FAKE_TEXT)
    print("indexing time:", time() - start)
    # rag = GraphRAG(working_dir=WORKING_DIR, enable_llm_cache=True)
    # rag.insert(FAKE_TEXT[half_len:])


# We're using Ollama to generate embeddings for the BGE model
@wrap_embedding_func_with_attrs(
    embedding_dim=EMBEDDING_MODEL_DIM,
    max_token_size=EMBEDDING_MODEL_MAX_TOKENS,
)
async def ollama_embedding(texts: list[str]) -> np.ndarray:
    embed_text = []
    for text in texts:
        data = ollama.embeddings(model=EMBEDDING_MODEL, prompt=text)
        embed_text.append(data["embedding"])

    return embed_text


if __name__ == "__main__":
    insert()
    query()

this file is provided in Chinese Simplified.

The text was updated successfully, but these errors were encountered:

Sangeeth123sj · 2024-10-15T11:03:32Z

ollama gets stuck, using smaller text inputs gets the process completed but still doesn;t work in extracting well.
OpenAI seems the only bet

rangehow · 2024-10-17T06:19:51Z

same ：( hope someone can give us some advice.

lxz12 · 2024-11-12T02:06:30Z

same
卡在了这里

nohup: 忽略输入
INFO:datasets:PyTorch version 2.5.1 available.
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-large-zh-v1.5
INFO:nano-graphrag:Load KV full_docs with 0 data
INFO:nano-graphrag:Load KV text_chunks with 0 data
INFO:nano-graphrag:Load KV llm_response_cache with 0 data
INFO:nano-graphrag:Load KV community_reports with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0 data
INFO:nano-graphrag:[New Docs] inserting 1 docs
INFO:nano-graphrag:[New Chunks] inserting 44 chunks
INFO:nano-graphrag:[Entity Extraction]...

lxz12 · 2024-11-12T04:18:23Z

same 相同的卡在了这里

nohup: 忽略输入 INFO:datasets:PyTorch version 2.5.1 available.信息：数据集：PyTorch 版本 2.5.1 可用。 INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-large-zh-v1.5INFO:sentence_transformers.SentenceTransformer:加载预训练的SentenceTransformer: BAAI/bge-large-zh-v1.5 INFO:nano-graphrag:Load KV full_docs with 0 dataINFO:nano-graphrag:用 0 条数据加载 KV full_docs INFO:nano-graphrag:Load KV text_chunks with 0 dataINFO:nano-graphrag:用 0 数据加载 KV text_chunks INFO:nano-graphrag:Load KV llm_response_cache with 0 dataINFO:nano-graphrag:用 0 数据加载 KV llm_response_cache INFO:nano-graphrag:Load KV community_reports with 0 dataINFO:nano-graphrag:用 0 条数据加载 KV Community_reports INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0 data信息：nano-vectordb：初始化{'embedding_dim'：1024，'metric'：'余弦'，'storage_file'：'./nano_graphrag_cache_ollama_TEST/vdb_entities.json'} 0数据 INFO:nano-graphrag:[New Docs] inserting 1 docsINFO:nano-graphrag:[新文档] 插入 1 个文档 INFO:nano-graphrag:[New Chunks] inserting 44 chunksINFO:nano-graphrag:[新块]插入 44 个块 INFO:nano-graphrag:[Entity Extraction]...信息：nano-graphrag：[实体提取]...

The above log is my answer model using ollama, as well as the embedding model.
When I use gpt3.5turbo, the display is as follows
Stuck at ：INFO:nano-graphrag:Inserting 1751 vectors to entities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress stuck while doing Entity Extraction #76

progress stuck while doing Entity Extraction #76

Greenhub-1215 commented Oct 12, 2024

Sangeeth123sj commented Oct 15, 2024

rangehow commented Oct 17, 2024

lxz12 commented Nov 12, 2024

lxz12 commented Nov 12, 2024

progress stuck while doing Entity Extraction #76

progress stuck while doing Entity Extraction #76

Comments

Greenhub-1215 commented Oct 12, 2024

Sangeeth123sj commented Oct 15, 2024

rangehow commented Oct 17, 2024

lxz12 commented Nov 12, 2024

lxz12 commented Nov 12, 2024