Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a TextSplitter on multiple documents with filetype="recursive_paths" fails #11

Closed
rfishermonteith opened this issue Nov 11, 2024 · 1 comment

Comments

@rfishermonteith
Copy link

rfishermonteith commented Nov 11, 2024

Using a TextSplitter on multiple documents with filetype="recursive_paths" fails with the below error.

This seems to be fixed by changing https://github.com/thiswillbeyourgithub/wdoc/blame/main/wdoc/utils/misc.py#L459 to:

return text_splitters[task][modelname] 

Command I'm running:

python -m wdoc
--path="data_for_wdoc"
--filetype="recursive_paths"
--task=search
--query="How can I make wdoc run faster?"
--query_retrievers='default_multiquery'
--top_k=auto_200_500
--llms_api_bases="{'model':'http://localhost:11434','query_eval_model':'http://localhost:11434'}"
--modelname="ollama/gemma2:2b"
--query_eval_modelname="ollama/gemma2:2b"
--recursed_filetype="txt"
--pattern="*.txt"

Error:

Error when loading doc with filetype txt: ''dict' object has no attribute 'transform_documents''. Arguments: {'llm_name': 'ollama/gemma2:2b', 'task': 'search', 'temp_dir': PosixPath('XXXX'), 'path': 'data_for_wdoc/fe061b430a2c4991a002f039c8ca6cb9.txt', 'filetype': 'txt', 'recur_parent_id': '206b66c9-9d44-4138-a413-fc1561d601a3', 'file_hash': '74a0d0bb291717058af1'}
Line number: 340
Full traceback:
  File "XXXX/venv/lib/python3.11/site-packages/wdoc/utils/loaders.py", line 340, in load_one_doc_wrapped
    out = load_one_doc(**doc_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "<@beartype(wdoc.utils.loaders.load_one_doc) at 0x12b15aca0>", line 205, in load_one_doc

  File "XXXX/venv/lib/python3.11/site-packages/wdoc/utils/loaders.py", line 507, in load_one_doc
    docs = text_splitter.transform_documents(docs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'm seeing some issues with using recursed_filetype, which I'll open a separate issue for.

thiswillbeyourgithub added a commit that referenced this issue Nov 12, 2024
Signed-off-by: thiswillbeyourgithub <26625900+thiswillbeyourgithub@users.noreply.github.com>
@thiswillbeyourgithub
Copy link
Owner

Taking a closer look during my commute it appears to be a nobrainer that your suggested fix is right. Thank you very much. I just pushed that to the dev branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants