Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weaviate support #218

Merged
merged 9 commits into from
Jun 2, 2023
Merged

Conversation

hsm207
Copy link
Contributor

@hsm207 hsm207 commented Jun 1, 2023

Fixes #145

Output of running pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate:

================================================================================= test session starts =================================================================================
platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /usr/local/py-utils/venvs/pytest/bin/python
cachedir: .pytest_cache
rootdir: /workspaces/h2ogpt
plugins: xdist-3.2.1, anyio-3.7.0
collected 1 item

tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 1792
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Downloading (…)e9125/.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.18k/1.18k [00:00<00:00, 5.09MB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 1.04MB/s]
Downloading (…)7e55de9125/README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.6k/10.6k [00:00<00:00, 26.3MB/s]
Downloading (…)55de9125/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 4.07MB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 743kB/s]
Downloading (…)125/data_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 87.5MB/s]
Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:00<00:00, 376MB/s]
Downloading (…)nce_bert_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 341kB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 854kB/s]
Downloading (…)e9125/tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 64.4MB/s]
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 2.45MB/s]
Downloading (…)9125/train_script.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.2k/13.2k [00:00<00:00, 57.2MB/s]
Downloading (…)7e55de9125/vocab.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 71.1MB/s]
Downloading (…)5de9125/modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 2.52MB/s]
Binary /home/vscode/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.19.3/weaviate-v1.19.3-linux-amd64.tar.gz
Started /home/vscode/.cache/weaviate-embedded: process ID 4442
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to "none", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-06-02T10:21:13Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to "true"","time":"2023-06-02T10:21:13Z"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-06-02T10:21:13Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-06-02T10:21:13Z"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-06-02T10:21:13Z"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"langchain_cb113cea4c3246fc9584cb8f01423ca2_JyjC68qeAYYp","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-06-02T10:21:18Z","took":81193}
('\n\nLinux is an open-source operating system that is free to use and modify, while Windows is a proprietary operating system that is owned by Microsoft Corporation.\n\nLinux is based on the Unix operating system, which is known for its stability and security, while Windows is based on the MS-DOS operating system, which is known for its ease of use.\n\nLinux is typically used for servers and other high-performance computing environments, while Windows is more commonly used for personal computers and laptops.\n\nLinux is also known for its flexibility and customizability, as users can modify the operating system to suit their needs, while Windows is more locked down and requires more user input to customize.\n\nOverall, the main differences between Linux and Windows are their licensing models, design principles, and target audience.\n\nSources [Score | Link]:

End Sources

', '\nSources [Score | Link]:

End Sources

')
PASSED

================================================================================== warnings summary ===================================================================================
../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:121
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('google').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

../../usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870
/usr/local/py-utils/shared/lib/python3.10/site-packages/pkg_resources/init.py:2870: DeprecationWarning: Deprecated call to pkg_resources.declare_namespace('mpl_toolkits').
Implementing implicit namespace packages (as specified in PEP 420) is preferred to pkg_resources.declare_namespace. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 1 passed, 3 warnings in 205.00s (0:03:24) ======================================================================

@hsm207 hsm207 marked this pull request as ready for review June 1, 2023 16:40
@hsm207
Copy link
Contributor Author

hsm207 commented Jun 1, 2023

@pseudotensor could you please review this PR?

Copy link
Collaborator

@pseudotensor pseudotensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks!

Just three things:

  1. Please provide an output of:
pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate

To verify new test works. That should be minimal required thing.

I'll test the rest.

  1. I notice when running the test there is verbose output from weaviate.
Binary /home/jon/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.19.3/weaviate-v1.19.3-linux-amd64.tar.gz
Started /home/jon/.cache/weaviate-embedded: process ID 67378
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-06-01T14:02:40-07:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-06-01T14:02:40-07:00"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-06-01T14:02:40-07:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-06-01T14:02:40-07:00"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-06-01T14:02:40-07:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"langchain_92432e47dd174559a7ffaa9bba83fbaa_zo3rVHt6AaEy","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-06-01T14:02:42-07:00","took":41977}
('\n\nLinux is an open-source operating system that is free to use and modify, while Windows is a proprietary operating system that is owned by Microsoft Corporation.\n\nLinux is based on the Unix operating system, which is known for its stability and security, while Windows is based on the MS-DOS operating system, which is known for its ease of use.\n\nLinux is typically used for servers and other high-performance computing environments, while Windows is more commonly used for personal computers and laptops.\n\nLinux is also known for its flexibility and customizability, as users can modify the operating system to suit their needs, while Windows is more locked down and requires more user input to customize.\n\nOverall, the main differences between Linux and Windows are their licensing models, design principles, and target audience.\n\nSources [Score | Link]:<p><ul><li>0.61 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>', '\nSources [Score | Link]:<p><ul><li>0.61 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>')

Let's pass a verbose through to the weaviate client or other places. I don't see that currently.

  1. Please update make_db to handle either Chroma or weaviate more naturally so not hardcoded chroma stuff in there. E.g. maybe add_to_db should handle case when db=None and load the db from persistent_directory. Or something like that. For your comment: Add support for weaviate #145 (comment)

For why db_type='chroma' was hardcoded itself in make_db, same reason, only persistent case was chroma before. You are extending that, so that has to be adjusted.

@pseudotensor
Copy link
Collaborator

@achraf-mer we are getting contributors, would be great to have jenkins smoke going!

@pseudotensor pseudotensor self-requested a review June 1, 2023 21:16
Copy link
Collaborator

@pseudotensor pseudotensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few changes, that the Author of PR also brought up.

@hsm207
Copy link
Contributor Author

hsm207 commented Jun 2, 2023

@pseudotensor

For point 1:

Please provide an output of:
pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate

What do you mean by "provide the output of the test"? Do you mean you want me to paste the output of the test in this PR's description?

For point 2:

Let's pass a verbose through to the weaviate client or other places. I don't see that currently.

Are you okay with me introducing a new param called verbose to the get_db() function to control the embedded weaviate's verbosity?

I also plan to add an option to connect to weaviate on a local docker instance too. For this method, the user needs to provide url, username and password (only the url is compulsory).

I'm leaning towards configuring the connection methods using environment variables instead of adding additional params to the functions. What do you think?

@pseudotensor
Copy link
Collaborator

Hi, I just mean please run the test and show it pass etc.

Yes, please add a new verbose parameter.

For make_db, yes let's do minimal required work for now and un-hardcode things.

Yes, for docker etc. can then add the remote client etc. In a different PR.

@hsm207
Copy link
Contributor Author

hsm207 commented Jun 2, 2023

@pseudotensor I've addressed your feedback.

For point 2, it is not possible to silent the weaviate client. I've raised a feature request for this (see weaviate/weaviate-python-client#346 )

@pseudotensor
Copy link
Collaborator

============================================================================================================ short test summary info ============================================================================================================
FAILED tests/test_langchain_units.py::test_md_add - AssertionError: assert 'h2oGPT is a large language model' in 'LangChain file types supported\n\nCLI Database control\n\nWhy h2oGPT for Doc Q&A\n\nFAQ\n\nUseful Links\n\nFine-Tuning\n\nDocker\n\nTriton\n\nAcknowledgements\n\nWhy H2O.ai?\...
================================================================================== 1 failed, 25 passed, 7 skipped, 1 xpassed, 8 warnings in 400.35s (0:06:40) ===================================================================================
(h2ollm) jon@pseudotensor:~/h2ogpt$ 

I'll fix that test, unrelated to this PR, and add make_db_main() tests in another PR.

@pseudotensor pseudotensor self-requested a review June 2, 2023 19:46
@pseudotensor pseudotensor merged commit 9371e51 into h2oai:main Jun 2, 2023
@pseudotensor
Copy link
Collaborator

@hsm207 Thanks so much for your contribution!!

@pseudotensor
Copy link
Collaborator

@hsm207 Some issues with the changes when adding more coverage: #231

@pseudotensor
Copy link
Collaborator

Even pycharm showed problems, and db wasn't returned. Trying tests again after changes

@hsm207 hsm207 deleted the weaviate-vectorstore branch June 2, 2023 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for weaviate
2 participants