Add weaviate support #218

hsm207 · 2023-06-01T09:43:48Z

Fixes #145

Output of running pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate:

================================================================================= test session starts =================================================================================
platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /usr/local/py-utils/venvs/pytest/bin/python
cachedir: .pytest_cache
rootdir: /workspaces/h2ogpt
plugins: xdist-3.2.1, anyio-3.7.0
collected 1 item

tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 1792
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Downloading (…)e9125/.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 5.09MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 1.04MB/s]
Downloading (…)7e55de9125/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.6k/10.6k [00:00<00:00, 26.3MB/s]
Downloading (…)55de9125/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 4.07MB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 743kB/s]
Downloading (…)125/data_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 87.5MB/s]
Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:00<00:00, 376MB/s]
Downloading (…)nce_bert_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 341kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 854kB/s]
Downloading (…)e9125/tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 64.4MB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 2.45MB/s]
Downloading (…)9125/train_script.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 57.2MB/s]
Downloading (…)7e55de9125/vocab.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 71.1MB/s]
Downloading (…)5de9125/modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 2.52MB/s]
Binary /home/vscode/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.19.3/weaviate-v1.19.3-linux-amd64.tar.gz
Started /home/vscode/.cache/weaviate-embedded: process ID 4442
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to "none", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-06-02T10:21:13Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to "true"","time":"2023-06-02T10:21:13Z"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-06-02T10:21:13Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-06-02T10:21:13Z"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-06-02T10:21:13Z"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"langchain_cb113cea4c3246fc9584cb8f01423ca2_JyjC68qeAYYp","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-06-02T10:21:18Z","took":81193}
('\n\nLinux is an open-source operating system that is free to use and modify, while Windows is a proprietary operating system that is owned by Microsoft Corporation.\n\nLinux is based on the Unix operating system, which is known for its stability and security, while Windows is based on the MS-DOS operating system, which is known for its ease of use.\n\nLinux is typically used for servers and other high-performance computing environments, while Windows is more commonly used for personal computers and laptops.\n\nLinux is also known for its flexibility and customizability, as users can modify the operating system to suit their needs, while Windows is more locked down and requires more user input to customize.\n\nOverall, the main differences between Linux and Windows are their licensing models, design principles, and target audience.\n\nSources [Score | Link]:
0.61 | https://en.wikipedia.org/wiki/Linux
End Sources
', '\nSources [Score | Link]:
0.61 | https://en.wikipedia.org/wiki/Linux
End Sources
')
PASSED

================================================================================== warnings summary ===================================================================================
../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:121
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('mpl_toolkits').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 1 passed, 3 warnings in 205.00s (0:03:24) ======================================================================

…vectorstore

hsm207 · 2023-06-01T16:40:39Z

@pseudotensor could you please review this PR?

pseudotensor

Awesome thanks!

Just three things:

Please provide an output of:

pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate

To verify new test works. That should be minimal required thing.

I'll test the rest.

I notice when running the test there is verbose output from weaviate.

Binary /home/jon/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.19.3/weaviate-v1.19.3-linux-amd64.tar.gz
Started /home/jon/.cache/weaviate-embedded: process ID 67378
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-06-01T14:02:40-07:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-06-01T14:02:40-07:00"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-06-01T14:02:40-07:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-06-01T14:02:40-07:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-06-01T14:02:40-07:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"langchain_92432e47dd174559a7ffaa9bba83fbaa_zo3rVHt6AaEy","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-06-01T14:02:42-07:00","took":41977}
('\n\nLinux is an open-source operating system that is free to use and modify, while Windows is a proprietary operating system that is owned by Microsoft Corporation.\n\nLinux is based on the Unix operating system, which is known for its stability and security, while Windows is based on the MS-DOS operating system, which is known for its ease of use.\n\nLinux is typically used for servers and other high-performance computing environments, while Windows is more commonly used for personal computers and laptops.\n\nLinux is also known for its flexibility and customizability, as users can modify the operating system to suit their needs, while Windows is more locked down and requires more user input to customize.\n\nOverall, the main differences between Linux and Windows are their licensing models, design principles, and target audience.\n\nSources [Score | Link]:<p><ul><li>0.61 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>', '\nSources [Score | Link]:<p><ul><li>0.61 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>')

Let's pass a verbose through to the weaviate client or other places. I don't see that currently.

Please update make_db to handle either Chroma or weaviate more naturally so not hardcoded chroma stuff in there. E.g. maybe add_to_db should handle case when db=None and load the db from persistent_directory. Or something like that. For your comment: Add support for weaviate #145 (comment)

For why db_type='chroma' was hardcoded itself in make_db, same reason, only persistent case was chroma before. You are extending that, so that has to be adjusted.

pseudotensor · 2023-06-01T21:01:12Z

@achraf-mer we are getting contributors, would be great to have jenkins smoke going!

pseudotensor

Just a few changes, that the Author of PR also brought up.

hsm207 · 2023-06-02T00:35:15Z

@pseudotensor

For point 1:

Please provide an output of:
pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate

What do you mean by "provide the output of the test"? Do you mean you want me to paste the output of the test in this PR's description?

For point 2:

Let's pass a verbose through to the weaviate client or other places. I don't see that currently.

Are you okay with me introducing a new param called verbose to the get_db() function to control the embedded weaviate's verbosity?

I also plan to add an option to connect to weaviate on a local docker instance too. For this method, the user needs to provide url, username and password (only the url is compulsory).

I'm leaning towards configuring the connection methods using environment variables instead of adding additional params to the functions. What do you think?

pseudotensor · 2023-06-02T00:56:55Z

Hi, I just mean please run the test and show it pass etc.

Yes, please add a new verbose parameter.

For make_db, yes let's do minimal required work for now and un-hardcode things.

Yes, for docker etc. can then add the remote client etc. In a different PR.

hsm207 · 2023-06-02T17:39:47Z

@pseudotensor I've addressed your feedback.

For point 2, it is not possible to silent the weaviate client. I've raised a feature request for this (see weaviate/weaviate-python-client#346 )

…vectorstore

pseudotensor · 2023-06-02T19:46:42Z

============================================================================================================ short test summary info ============================================================================================================
FAILED tests/test_langchain_units.py::test_md_add - AssertionError: assert 'h2oGPT is a large language model' in 'LangChain file types supported\n\nCLI Database control\n\nWhy h2oGPT for Doc Q&A\n\nFAQ\n\nUseful Links\n\nFine-Tuning\n\nDocker\n\nTriton\n\nAcknowledgements\n\nWhy H2O.ai?\...
================================================================================== 1 failed, 25 passed, 7 skipped, 1 xpassed, 8 warnings in 400.35s (0:06:40) ===================================================================================
(h2ollm) jon@pseudotensor:~/h2ogpt$

I'll fix that test, unrelated to this PR, and add make_db_main() tests in another PR.

pseudotensor · 2023-06-02T19:58:43Z

@hsm207 Thanks so much for your contribution!!

pseudotensor · 2023-06-02T20:21:27Z

@hsm207 Some issues with the changes when adding more coverage: #231

pseudotensor · 2023-06-02T20:23:32Z

Even pycharm showed problems, and db wasn't returned. Trying tests again after changes

hsm207 added 6 commits June 1, 2023 09:38

add test for weaviate

b40061a

add weaviate dependency

947a417

add weaviate support

cb51472

Merge branch 'main' of https://github.com/h2oai/h2ogpt into weaviate-…

6c00a42

…vectorstore

add support for weaviate in add_to_db

38b385d

update docs

0fc9f21

hsm207 marked this pull request as ready for review June 1, 2023 16:40

pseudotensor approved these changes Jun 1, 2023

View reviewed changes

pseudotensor self-requested a review June 1, 2023 21:16

pseudotensor requested changes Jun 1, 2023

View reviewed changes

hsm207 added 2 commits June 2, 2023 16:13

support custom index name

a2d9d21

refactor make_db_main to support other dbs

e89368c

Merge branch 'main' of https://github.com/h2oai/h2ogpt into weaviate-…

8070b4b

…vectorstore

pseudotensor self-requested a review June 2, 2023 19:46

pseudotensor approved these changes Jun 2, 2023

View reviewed changes

pseudotensor merged commit 9371e51 into h2oai:main Jun 2, 2023

hsm207 deleted the weaviate-vectorstore branch June 2, 2023 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add weaviate support #218

Add weaviate support #218

hsm207 commented Jun 1, 2023 •

edited

Loading

hsm207 commented Jun 1, 2023

pseudotensor left a comment •

edited

Loading

pseudotensor commented Jun 1, 2023

pseudotensor left a comment

hsm207 commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

hsm207 commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

Add weaviate support #218

Add weaviate support #218

Conversation

hsm207 commented Jun 1, 2023 • edited Loading

hsm207 commented Jun 1, 2023

pseudotensor left a comment • edited Loading

Choose a reason for hiding this comment

pseudotensor commented Jun 1, 2023

pseudotensor left a comment

Choose a reason for hiding this comment

hsm207 commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

hsm207 commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

pseudotensor commented Jun 2, 2023

hsm207 commented Jun 1, 2023 •

edited

Loading

pseudotensor left a comment •

edited

Loading