-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cassandra support for chat history using CassIO library #6771
Cassandra support for chat history using CassIO library #6771
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Trying to fix linter/build errors I see only here in the PR discussion. |
@rlancemartin could you take a look at this Cassandra extension? |
78287f0
to
6f64f5b
Compare
Thank you for taking action @rlancemartin ! |
Ya, re-generating it now from the updated Just discussed a bit w/ @dev2049 and typical workflow when the
Which will re-generate the lock file (without updating the dependencies to their latest versions). |
6f64f5b
to
2619cd1
Compare
### Overview This PR aims at building on #4378, expanding the capabilities and building on top of the `cassIO` library to interface with the database (as opposed to using the core drivers directly). Usage of `cassIO` (a library abstracting Cassandra access for ML/GenAI-specific purposes) is already established since #6426 was merged, so no new dependencies are introduced. In the same spirit, we try to uniform the interface for using Cassandra instances throughout LangChain: all our appreciation of the work by @jj701 notwithstanding, who paved the way for this incremental work (thank you!), we identified a few reasons for changing the way a `CassandraChatMessageHistory` is instantiated. Advocating a syntax change is something we don't take lighthearted way, so we add some explanations about this below. Additionally, this PR expands on integration testing, enables use of Cassandra's native Time-to-Live (TTL) features and improves the phrasing around the notebook example and the short "integrations" documentation paragraph. We would kindly request @hwchase to review (since this is an elaboration and proposed improvement of #4378 who had the same reviewer). ### About the __init__ breaking changes There are [many](https://docs.datastax.com/en/developer/python-driver/3.28/api/cassandra/cluster/) options when creating the `Cluster` object, and new ones might be added at any time. Choosing some of them and exposing them as `__init__` parameters `CassandraChatMessageHistory` will prove to be insufficient for at least some users. On the other hand, working through `kwargs` or adding a long, long list of arguments to `__init__` is not a desirable option either. For this reason, (as done in #6426), we propose that whoever instantiates the Chat Message History class provide a Cassandra `Session` object, ready to use. This also enables easier injection of mocks and usage of Cassandra-compatible connections (such as those to the cloud database DataStax Astra DB, obtained with a different set of init parameters than `contact_points` and `port`). We feel that a breaking change might still be acceptable since LangChain is at `0.*`. However, while maintaining that the approach we propose will be more flexible in the future, room could be made for a "compatibility layer" that respects the current init method. Honestly, we would to that only if there are strong reasons for it, as that would entail an additional maintenance burden. ### Other changes We propose to remove the keyspace creation from the class code for two reasons: first, production Cassandra instances often employ RBAC so that the database user reading/writing from tables does not necessarily (and generally shouldn't) have permission to create keyspaces, and second that programmatic keyspace creation is not a best practice (it should be done more or less manually, with extra care about schema mismatched among nodes, etc). Removing this (usually unnecessary) operation from the `__init__` path would also improve initialization performance (shorter time). We suggest, likewise, to remove the `__del__` method (which would close the database connection), for the following reason: it is the recommended best practice to create a single Cassandra `Session` object throughout an application (it is a resource-heavy object capable to handle concurrency internally), so in case Cassandra is used in other ways by the app there is the risk of truncating the connection for all usages when the history instance is destroyed. Moreover, the `Session` object, in typical applications, is best left to garbage-collect itself automatically. As mentioned above, we defer the actual database I/O to the `cassIO` library, which is designed to encode practices optimized for LLM applications (among other) without the need to expose LangChain developers to the internals of CQL (Cassandra Query Language). CassIO is already employed by the LangChain's Vector Store support for Cassandra. We added a few more connection options in the companion notebook example (most notably, Astra DB) to encourage usage by anyone who cannot run their own Cassandra cluster. We surface the `ttl_seconds` option for automatic handling of an expiration time to chat history messages, a likely useful feature given that very old messages generally may lose their importance. We elaborated a bit more on the integration testing (Time-to-live, separation of "session ids", ...). ### Remarks from linter & co. We reinstated `cassio` as a dependency both in the "optional" group and in the "integration testing" group of `pyproject.toml`. This might not be the right thing do to, in which case the author of this PR offer his apologies (lack of confidence with Poetry - happy to be pointed in the right direction, though!). During linter tests, we were hit by some errors which appear unrelated to the code in the PR. We left them here and report on them here for awareness: ``` langchain/vectorstores/mongodb_atlas.py:137: error: Argument 1 to "insert_many" of "Collection" has incompatible type "List[Dict[str, Sequence[object]]]"; expected "Iterable[Union[MongoDBDocumentType, RawBSONDocument]]" [arg-type] langchain/vectorstores/mongodb_atlas.py:186: error: Argument 1 to "aggregate" of "Collection" has incompatible type "List[object]"; expected "Sequence[Mapping[str, Any]]" [arg-type] langchain/vectorstores/qdrant.py:16: error: Name "grpc" is not defined [name-defined] langchain/vectorstores/qdrant.py:19: error: Name "grpc" is not defined [name-defined] langchain/vectorstores/qdrant.py:20: error: Name "grpc" is not defined [name-defined] langchain/vectorstores/qdrant.py:22: error: Name "grpc" is not defined [name-defined] langchain/vectorstores/qdrant.py:23: error: Name "grpc" is not defined [name-defined] ``` In the same spirit, we observe that to even get `import langchain` run, it seems that a `pip install bs4` is missing from the minimal package installation path. Thank you!
Overview
This PR aims at building on #4378, expanding the capabilities and building on top of the
cassIO
library to interface with the database (as opposed to using the core drivers directly).Usage of
cassIO
(a library abstracting Cassandra access for ML/GenAI-specific purposes) is already established since #6426 was merged, so no new dependencies are introduced.In the same spirit, we try to uniform the interface for using Cassandra instances throughout LangChain: all our appreciation of the work by @jj701 notwithstanding, who paved the way for this incremental work (thank you!), we identified a few reasons for changing the way a
CassandraChatMessageHistory
is instantiated. Advocating a syntax change is something we don't take lighthearted way, so we add some explanations about this below.Additionally, this PR expands on integration testing, enables use of Cassandra's native Time-to-Live (TTL) features and improves the phrasing around the notebook example and the short "integrations" documentation paragraph.
We would kindly request @hwchase to review (since this is an elaboration and proposed improvement of #4378 who had the same reviewer).
About the init breaking changes
There are many options when creating the
Cluster
object, and new ones might be added at any time. Choosing some of them and exposing them as__init__
parametersCassandraChatMessageHistory
will prove to be insufficient for at least some users.On the other hand, working through
kwargs
or adding a long, long list of arguments to__init__
is not a desirable option either. For this reason, (as done in #6426), we propose that whoever instantiates the Chat Message History class provide a CassandraSession
object, ready to use. This also enables easier injection of mocks and usage of Cassandra-compatible connections (such as those to the cloud database DataStax Astra DB, obtained with a different set of init parameters thancontact_points
andport
).We feel that a breaking change might still be acceptable since LangChain is at
0.*
. However, while maintaining that the approach we propose will be more flexible in the future, room could be made for a "compatibility layer" that respects the current init method. Honestly, we would to that only if there are strong reasons for it, as that would entail an additional maintenance burden.Other changes
We propose to remove the keyspace creation from the class code for two reasons: first, production Cassandra instances often employ RBAC so that the database user reading/writing from tables does not necessarily (and generally shouldn't) have permission to create keyspaces, and second that programmatic keyspace creation is not a best practice (it should be done more or less manually, with extra care about schema mismatched among nodes, etc). Removing this (usually unnecessary) operation from the
__init__
path would also improve initialization performance (shorter time).We suggest, likewise, to remove the
__del__
method (which would close the database connection), for the following reason: it is the recommended best practice to create a single CassandraSession
object throughout an application (it is a resource-heavy object capable to handle concurrency internally), so in case Cassandra is used in other ways by the app there is the risk of truncating the connection for all usages when the history instance is destroyed. Moreover, theSession
object, in typical applications, is best left to garbage-collect itself automatically.As mentioned above, we defer the actual database I/O to the
cassIO
library, which is designed to encode practices optimized for LLM applications (among other) without the need to expose LangChain developers to the internals of CQL (Cassandra Query Language). CassIO is already employed by the LangChain's Vector Store support for Cassandra.We added a few more connection options in the companion notebook example (most notably, Astra DB) to encourage usage by anyone who cannot run their own Cassandra cluster.
We surface the
ttl_seconds
option for automatic handling of an expiration time to chat history messages, a likely useful feature given that very old messages generally may lose their importance.We elaborated a bit more on the integration testing (Time-to-live, separation of "session ids", ...).
Remarks from linter & co.
We reinstated
cassio
as a dependency both in the "optional" group and in the "integration testing" group ofpyproject.toml
. This might not be the right thing do to, in which case the author of this PR offer his apologies (lack of confidence with Poetry - happy to be pointed in the right direction, though!).During linter tests, we were hit by some errors which appear unrelated to the code in the PR. We left them here and report on them here for awareness:
In the same spirit, we observe that to even get
import langchain
run, it seems that apip install bs4
is missing from the minimal package installation path.Thank you!