-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Add CIP 6 - Pipelines, The Registry and Storage #1110
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome start
|
||
## Status | ||
|
||
Current Status: `Under Discussion` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we comfortable supporting only python for ops? what is the story for other languages? in particular javascript.
how would a Go Client
use pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that:
Pipeline steps are written in python, in the future we can support other execution runtimes but for now, python.
You can write simple pipeline pointers in the client languages that will encode the right step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that constraint probably makes sense for now (we need some!)
What do you imagine this would look like in JS then? I guess we could still have types that are chainable... but not runnable locally? hmmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they could be runnable locally, they would just have to be reimplemented and then if you wanted portability, you'd have to ensure they are 1:1 in behavior. Or we could make it so that they can never be run locally but I don't love that.
With power comes tradeoffs :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah mostly just trying to think through the pros and cons. there will be cons and we will just have to live with them.
I think for a minimal surface area re-implementation is ok. (like we currently do with embedding functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB @jeffchuber, can this not be only a server-side thing? Then all we need is a DSL or whatever is "natural" in the target language to encode, as suggested above.
This is how CQL (Chroma Query Language is born)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inventing new query languages is a red flag to me! 🚩🚩🚩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason to support client-side is to allow for rapid experimentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, do you think supporting Client and Server side pipelines makes sense?
Since I'm now working on client/server encryption, there is a "world" where encrypting (public key) on the client side and decrypting on the server side (private key) can be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the individual operators in the pipeline are stateless, right?
So, there could be a list of operators that you can chain together, given that inputs are compatible with the output after the previous operation in the chain.
Since pipelines are supposed to run server-side, pipelines developed in other languages are not possible. However, what is possible is that each client assembles pipelines based on the operators. I would plan for this from the beginning.
What needs to be created on the client-side in the end to build a pipeline is a manifest file. Most users will likely reuse pipelines, and it makes sense to define these via a manifest in the repository, which also serves as good documentation for the pipeline. I think the idea of using YAML for this is a good one. There doesn't need to be a new query language for that.
In JavaScript, you could create pipelines as a manifest that you send (automatically generated by code, even possible in a pipeline syntax), for example:
const collection = client.collection('my-collection');
pdfHttpStream.pipe(
chunk(),
embed(),
quantize(),
execute(collection)
);
Alternatively, you could pass a pipeline manifest to the collection:
const collection = client.collection('my-collection', {pipeline: manifestFile});
collection.add(data);
On the server-side, the Python server needs to dynamically build and execute the pipeline based on the manifest. This way every client can control the pipeline to be used before storing the embeddings.
@@ -0,0 +1,309 @@ | |||
# CIP: Pipelines, The Registry and Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still mixed on the term pipeline ... alternatives include
- workflow (temporal)
- flow and tasks (prefect)
- transforms
- lambdas
- functions
- jobs & ops (dagster)
- chain (langchain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workflow - fine, but I think its less descriptive than pipeline and evokes a different set of tools, and implies a state machine like behavior.
Flow - Refers to a generalization of a pipeline, I am fine with this. A pipeline is 1D flow.(https://en.wikipedia.org/wiki/Pipeline_(software))
Transforms - not all encompassing, some of these have side effects or are not transformations
Lambdas - feels really specific to something not explicitly always used here, especially in python, and does not communicate sequentiality.
Functions - doesn't communicate sequence.
Jobs & Ops - Dagsters model is far more complicated (Assets, graphs etc). I think a "Job" is a bit broad and an "Op" is fine but seems like needlessly using language no one uses. Job feels really broad
Chain - feels owned by langchain and also is not descriptive for parallel execution.
I prefer pipeline or flow because they are most descriptive of a sequence of operations on data, where inputs and outputs flow into one another, and potentially with side effects. Pipeline seems like something people will be familiar with, flow will be a new term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appreciate the reactions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB @jeffchuber, I too think that pipeline conveys the meaning of chaining a bunch of discrete functions together. That said why not just "invent" a name for this? We can still call it a pipeline just so people draw a parallel but then the actual "catchy" name can be something of the sort of ChromaticFunnel :D
Naming aside, I am wondering about two additional concepts that come to mind with this - filters and error syncs.
- Filters - useful when dropping out samples is a thing
- Error sinks - basically a way to "nicely" handle inevitable errors (again user-driven functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am personally generally against making up new names for things that already have good names :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh and another "nice" concept from Java world - parallel() - the ability to split/join a sequence of pipeline steps. This can be done further down the road but I can see how a batch of 100k docs can have parallel embedding going for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah, supporting something akin to promise like error handling is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error handling is def interesting and important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my initial thought here was that pipelines are synchronous with the insert call, so error states are trivially propagated. Once we allow for async pipelines though this becomes much harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, I think that the error handling can be done optional e.g. if people want to do something with their failing batches - e.g. retry, dump them to disk so they can retry later etc.
Edit: I think that the moment this is released, people will start doing async calls over the wire.
Pipeline steps can be composed, and return a reference to the composed pipeline step. For example, if you wanted to create a pipeline that chunked a document and then embedded the chunks, you could do something like this: | ||
|
||
```python | ||
to_add = Embed(Chunk(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we make it easy to see what things can be chained? can we use type hints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, I love functional programming as much as the next guy, but do you think the same result can be achieved with a fluid (chained) API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tazarov can you provide an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An "extra" late-night thought - wdyt of YAML-based manifest, as a portable version of this (python functions must be bundled or can be installed as deps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not wedded to this specific interface tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still lean towards "its just a function" because I think its the easiest to understand and more importantly -- I think it may be the easiest to "break out of" when you want to add a new operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like "withData(data).pipe([Embed,Chunk])" as the function approach could look messy once you pass 5 pipelines, which is feasible at the moment, : FixMetadata, AssignIDs, ParentChunking, Chunking, Embedding, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using |
operator really conveys the pipe in the pipeline, and for most devs (familiar with bash), pipes will look natural and also Pythonic. Another aspect is if one wants to pass parameters to steps, it gets even messier.
Another aspect of using the chained method calls is that we can even use Lambdas for filtering and inline processing without need for decorators:
withData(data).pipe([lambda x: 'allowed' in x,Chunk,Embed])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came here to suggest a fluid/chained API, but i'm quite late to the party :)
Total subjective IMO, but i've tended to rely on using indentation when wrapping lots of functions together (clojure transducers jump to mind) as a way of visually communicating the level that code is at.
As a python noob, i genuinely don't know, but i wonder if such indentation tricks are possible with chaining in a way they're not with function wrapping. Like, i believe this is allowable, where a particular indentation level clearly indicates where each step definition starts:
(
withData(data)
.pipe(Chunk)
.pipe(Embed)
)
But idk if there's an equivalently clear way of doing it with function wrapping?
Chroma right now implicitly stores your documents, however we also want to support multimodal embeddings and part of this CIP is to propose we deprecate documents= in favor of data=. The contract then becomes that the data you input (or that is generated by your pipeline) is stored by default and can be retrieved by the user in some form. We also want to support storing mixed modalities in a collection. For example, a CLIP based collection should be able | ||
to store both text and images. | ||
|
||
We propose to add Storage Pipeline steps capable of storing the inputted data into a Storage layer, and then storing a string reference to that data in the metadata of the document. This is already how the document is currently stored and backed by the metadata segment. The metadata segment stores a KV pair "chroma:document" -> the document contents. For example, say you had an image in memory and wanted to store it in Chroma, you could do something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to use metadata, or a new field uri
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we can be flexible here, is a minor implementation detail. just didn't elaborate because its not super important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, I think we can introduce a mime-type router here so that different data types are automatically routed to a specific storage pipeline that can handle it.
Am I correct in assuming that when you mention multimodal above, you mean to be able to store any type of "data" in your collection - text, images, audio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, or binary. I also was considering a design for storage thats more like index definition. Where you declaratively specify storage and it gets routed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to adopt a Storage Adapter like concept, we can have:
from abc import abstractmethod
import overrides
class StorageAdapter(overrides.Overrides):
@abstractmethod
def handles(self,mimetype:str)->bool:
...
class GoogleDriveStorageAdapter(StorageAdapter):
def handles(self,mimetype:str)->bool:
return mimetype.startswith("application/pdf")
def extract_mimetype(file:str)->str:
...
class StorageAdapterRouter:
adapters = [GoogleDriveStorageAdapter()]
def handle_storage_request(self, file:str):
mime= extract_mimetype(file)
for adapter in self.adapters:
if adapter.handles(mime):
return adapter
raise ValueError("No adapter found for mime-type: {}".format(mime))
Of course, the actual supported mime-types (or just types of files) can be declaratively configured.
collection.query(data=[image], n_results=100) # LoadFromStorage will run before the results are returned and will load the data from the storage layer and return it to the user in memory | ||
``` | ||
|
||
#### Future work: More Storage Pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another idea I had is this...
perhaps we should embrace the idea of a "repo per database" or "repo with /mapped folders as DB"
then custom ops can be defined in that github repo.
Chroma can link a database to a repo and then enable CD - so when you update the repo (on a set branch)- it updates that database. (could also extend to have ideas of dev/staging branches)
Then we could have a chroma init
command in the CLI that would set up the scaffold for this repo to make it easy.
It's a way to use Git to manage the lifecycle of the DB itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three concepts being intermingled here.
- What is the schema of your chroma instance.
- How is that schema stored and defined.
- How is that schema updated.
I don't think OSS Chroma should become a gitops platform, which is what you are suggesting, that feels far out of scope and very specific to one way of doing things. Not all organizations will want that.
We certainly can allow you to declaratively define the schema of your DB in code, and then allow programmatic updates to that schema. This starts to feel a lot like reinventing SQL, but thats a neither here nor there.
I can see hosted chroma offering such functionality, but seems far off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely far off and very hypothetical
For example, the default query pipeline would simply Embed the text, and would be defined as follows: | ||
|
||
```python | ||
collection = client.create_collection("my_collection", pipeline=DefaultEmbeddingFunctionPipelineStep(), query_pipeline=DefaultEmbeddingFunctionPipelineStep()) # The defaults would not actually have to be specified, as they would be the default values. Here we are specifying them for clarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I like the "inversion of control". With this, LC becomes an upstream dep of Chroma
|
||
```python | ||
@pipeline_step(name="Chunk") | ||
def chunk(data: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a way for the parametrization of functions? Should we consider/support partials, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how about lambdas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we will support partials
|
||
It is important to preserve the existing behavior of our API where users can simply pass text to the add step and have it be embedded. To do this, we will create a default pipeline that is used when no pipeline is provided. This default pipeline will simply embed the text. This default pipeline will be persisted with the collection, and will be used when no pipeline is provided. This means that users can create a collection with a default pipeline, and then query the collection without having to provide the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection. | ||
|
||
Pipelines can be persisted by serializing the pipeline steps and storing them in the metadata of the collection. This means that when a collection is loaded, the pipeline steps can be deserialized and used to recreate the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for DLS/portable format. One caveat is when we get pipeline function sprawl from community contributions. Lots of projects split functions into core and contrib.
collection = client.create_collection("my_collection", pipeline=Embed(Chunk()), query_pipeline=Embed()) | ||
``` | ||
|
||
#### The I/0 of a pipeline stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, do you think it makes sense for pipelines to be allowed to operate over metadata and Ids?
This can be useful for the following:
- Generate Ids
- When doing chunking add metadata associated with chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can! You can imagine an AssignIds() step, and I gave a chunking example - did you want something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on the metadata, would this allow for me to change the metadata before inserting alongside the data? For example maybe I have a url and want to get the domain from it and add it as a new metadata field before inserting. Or would that be something to be done before the whole pipeline chaining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that pipeline steps will have access to read and update all - docs, embeddings, ids, metadata.
collection.add(data=data, ids=[0, 1]) | ||
|
||
# Input to pipeline would be {ids: [0, 1], documents: ["large_document_1", "large_document_2"], metadata: None, embeddings: None} | ||
# Output of Chunk could be {ids: [A, B, C, D], documents: ["chunk_1", "chunk_2", "chunk_3", "chunk_4"], metadata: [{"source_id": 0}, {"source_id": 0}, {"source_id": 1}, {"source_id": 1}], embeddings: None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, ok I see that my point above is kind of moot with this but still, do you think it could be useful to "evidence" somehow to the user that e.g. metadata will be injected etc - perhaps through function documentation e.g. what are the side effects of the pipeline function.
On the same note, maybe we can have a utility method that will create a local representation of the transformed data so the user can review it prior to storing it in Chroma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, one general concern I have about this, is the somewhat opaque transformation of data that happens for you. For example if you assign ids, how do those ids get returned to you? How can you see results of pipeline execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipelines need a good introspection, I reckon. But maybe just start with logging :)
collection.add(data=data, ids=[2, 3]) | ||
``` | ||
|
||
#### Server-side registration of pipeline steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, do you think there's a way to sandbox things? I like the idea of self-contained things, but let's consider unsafe code execution on the server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you are running your own server, its not unsafe, its code you defined.
For hosted chroma, we will have to sandbox the execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With community-contributed functions this can get messy, if not too complicated perhaps we can offer sandboxed across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will have to audit them quite exhaustively for now. I don't think we have to ship with sandboxed in v1
# The metadata segment would then store the KV pair "chroma:document" -> "https://storage.com/image_1" | ||
``` | ||
|
||
We can ship two StoragePipelines to start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some data types like images, we can also use the underlying DB - base64 the image and put it in string_value (TEXT size is 2GB for SQLite). This has implications, of course, for other RDMBS (distributed version) but so does datatypes in general.
Note to self: think through where_document behavior |
#### Future work: Parallel pipeline execution | ||
Not all pipeline stages need to executed sequentially. In the future, we can add support for parallel pipeline execution by changing how the pipeline is defined. | ||
|
||
## Storage Pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, just a thought: Can this be, in theory, called storage adapters/interfaces?
An adapter, as in a wrapper on top of a specific storage medium?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
First, register The "Chunk" pipeline step with the name "Chunk" as follows: | ||
|
||
```python | ||
@pipeline_step(name="Chunk") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a chained syntax:
withData(data).pipe([Chunk,Embed])
then we can even remove the decorator part - the actual wrapping can happen within the pipe()
.
Elaborating further on the chained pipeline:
.register(pipeline_name="my_pipeline")
or.upload(pipeline_name="my_pipeline")
- Packages the pipeline and uploads to server.exec_local()
- local execution for testing/dev
For example, the default pipeline would simply Embed the text, and would be persisted as follows: | ||
|
||
```python | ||
collection = client.create_collection("my_collection", pipeline=DefaultEmbeddingFunctionPipelineStep()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do something like in https://github.com/chroma-core/chroma/pull/1110/files#r1318974190 (.register(pipeline_name="my_pipeline")
) then we can use just the name for the pipeline pipeline="my_pipeline"
thus indicating a remote pipeline execution.
We can also consider alternative semantics to differentiate between local (client-side) and remote (server-side)
|
||
It is important to preserve the existing behavior of our API where users can simply pass text to the add step and have it be embedded. To do this, we will create a default pipeline that is used when no pipeline is provided. This default pipeline will simply embed the text. This default pipeline will be persisted with the collection, and will be used when no pipeline is provided. This means that users can create a collection with a default pipeline, and then query the collection without having to provide the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection. | ||
|
||
Pipelines can be persisted by serializing the pipeline steps and storing them in the metadata of the collection. This means that when a collection is loaded, the pipeline steps can be deserialized and used to recreate the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, wdyt about promoting pipelines as first-class citizens with a separate table? This promotes reusability and we can even introduce a small API for managing them.
All this comes of course with down-sides:
- If one deletes a pipeline that is used by several collections (we can prevent deletion of referenced pipelines)
- Portability - While collection data is "relatively" easy to export, external pipelines add an extra layer of complexity
- Versioning - if two cols start with the same pipeline but one decides it needs an extra step - introducing versioning might solve this problem but it too comes with its challenges
I think this needs a bit deeper analysis to fully understand all the pros/cons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking about this too, this is a good point, I'll give it more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stored procedures
is the historical comp of course.. but I think in general keeping "code" out of the database is very important. code should be managed by gitops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing pointers to code is what's being suggested here. Pipeline step definition will still be persisted however you manage that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I agree that we shouldn't version/store your code)
collection.add(data=[image], ids=[0]) # image is a PIL image, or numpy array etc | ||
collection.query(data=[image], n_results=100) # LoadFromStorage will run before the results are returned and will load the data from the storage layer and return it to the user in memory | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to think about how querying
works with multi-modality? eg right now we have query_texts
... but maybe this should be query_data(s)
? cc @HammadB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we'd change query too thanks for pointing that out
|
||
Users often want to load documents from a variety of sources. For example, users may want to load | ||
documents from a database, a file system, or a remote service, these documents may be in a variety | ||
of formats, such as text, images, or audio. Today, users are responsible for loading these documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responsible for loading
and keeping them up-to-date. e.g. triggering a sync, which brings some more challenges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this!
Description of changes
Add CIP 6 - Pipelines, The Registry and Storage for discussion.