Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add CIP 6 - Pipelines, The Registry and Storage #1110

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

HammadB
Copy link
Collaborator

@HammadB HammadB commented Sep 6, 2023

Description of changes

Add CIP 6 - Pipelines, The Registry and Storage for discussion.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

Copy link
Contributor

@jeffchuber jeffchuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome start


## Status

Current Status: `Under Discussion`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we comfortable supporting only python for ops? what is the story for other languages? in particular javascript.

how would a Go Client use pipelines

Copy link
Collaborator Author

@HammadB HammadB Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that:

Pipeline steps are written in python, in the future we can support other execution runtimes but for now, python.
You can write simple pipeline pointers in the client languages that will encode the right step

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that constraint probably makes sense for now (we need some!)

What do you imagine this would look like in JS then? I guess we could still have types that are chainable... but not runnable locally? hmmm

Copy link
Collaborator Author

@HammadB HammadB Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So they could be runnable locally, they would just have to be reimplemented and then if you wanted portability, you'd have to ensure they are 1:1 in behavior. Or we could make it so that they can never be run locally but I don't love that.

With power comes tradeoffs :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah mostly just trying to think through the pros and cons. there will be cons and we will just have to live with them.

I think for a minimal surface area re-implementation is ok. (like we currently do with embedding functions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB @jeffchuber, can this not be only a server-side thing? Then all we need is a DSL or whatever is "natural" in the target language to encode, as suggested above.

This is how CQL (Chroma Query Language is born)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inventing new query languages is a red flag to me! 🚩🚩🚩

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to support client-side is to allow for rapid experimentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, do you think supporting Client and Server side pipelines makes sense?

Since I'm now working on client/server encryption, there is a "world" where encrypting (public key) on the client side and decrypting on the server side (private key) can be helpful.

Copy link
Contributor

@perzeuss perzeuss Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the individual operators in the pipeline are stateless, right?
So, there could be a list of operators that you can chain together, given that inputs are compatible with the output after the previous operation in the chain.

Since pipelines are supposed to run server-side, pipelines developed in other languages are not possible. However, what is possible is that each client assembles pipelines based on the operators. I would plan for this from the beginning.

What needs to be created on the client-side in the end to build a pipeline is a manifest file. Most users will likely reuse pipelines, and it makes sense to define these via a manifest in the repository, which also serves as good documentation for the pipeline. I think the idea of using YAML for this is a good one. There doesn't need to be a new query language for that.

In JavaScript, you could create pipelines as a manifest that you send (automatically generated by code, even possible in a pipeline syntax), for example:

const collection = client.collection('my-collection');
 pdfHttpStream.pipe(
  chunk(),
  embed(),
  quantize(),
  execute(collection)
);

Alternatively, you could pass a pipeline manifest to the collection:

const collection = client.collection('my-collection', {pipeline: manifestFile});
collection.add(data);

On the server-side, the Python server needs to dynamically build and execute the pipeline based on the manifest. This way every client can control the pipeline to be used before storing the embeddings.

@@ -0,0 +1,309 @@
# CIP: Pipelines, The Registry and Storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still mixed on the term pipeline ... alternatives include

  • workflow (temporal)
  • flow and tasks (prefect)
  • transforms
  • lambdas
  • functions
  • jobs & ops (dagster)
  • chain (langchain)

Copy link
Collaborator Author

@HammadB HammadB Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflow - fine, but I think its less descriptive than pipeline and evokes a different set of tools, and implies a state machine like behavior.
Flow - Refers to a generalization of a pipeline, I am fine with this. A pipeline is 1D flow.(https://en.wikipedia.org/wiki/Pipeline_(software))
Transforms - not all encompassing, some of these have side effects or are not transformations
Lambdas - feels really specific to something not explicitly always used here, especially in python, and does not communicate sequentiality.
Functions - doesn't communicate sequence.
Jobs & Ops - Dagsters model is far more complicated (Assets, graphs etc). I think a "Job" is a bit broad and an "Op" is fine but seems like needlessly using language no one uses. Job feels really broad
Chain - feels owned by langchain and also is not descriptive for parallel execution.

I prefer pipeline or flow because they are most descriptive of a sequence of operations on data, where inputs and outputs flow into one another, and potentially with side effects. Pipeline seems like something people will be familiar with, flow will be a new term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appreciate the reactions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB @jeffchuber, I too think that pipeline conveys the meaning of chaining a bunch of discrete functions together. That said why not just "invent" a name for this? We can still call it a pipeline just so people draw a parallel but then the actual "catchy" name can be something of the sort of ChromaticFunnel :D

Naming aside, I am wondering about two additional concepts that come to mind with this - filters and error syncs.

  • Filters - useful when dropping out samples is a thing
  • Error sinks - basically a way to "nicely" handle inevitable errors (again user-driven functions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am personally generally against making up new names for things that already have good names :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh and another "nice" concept from Java world - parallel() - the ability to split/join a sequence of pipeline steps. This can be done further down the road but I can see how a batch of 100k docs can have parallel embedding going for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, supporting something akin to promise like error handling is a good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error handling is def interesting and important

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial thought here was that pipelines are synchronous with the insert call, so error states are trivially propagated. Once we allow for async pipelines though this becomes much harder.

Copy link
Contributor

@tazarov tazarov Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, I think that the error handling can be done optional e.g. if people want to do something with their failing batches - e.g. retry, dump them to disk so they can retry later etc.

Edit: I think that the moment this is released, people will start doing async calls over the wire.

Pipeline steps can be composed, and return a reference to the composed pipeline step. For example, if you wanted to create a pipeline that chunked a document and then embedded the chunks, you could do something like this:

```python
to_add = Embed(Chunk(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we make it easy to see what things can be chained? can we use type hints?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, I love functional programming as much as the next guy, but do you think the same result can be achieved with a fluid (chained) API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tazarov can you provide an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An "extra" late-night thought - wdyt of YAML-based manifest, as a portable version of this (python functions must be bundled or can be installed as deps)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not wedded to this specific interface tbh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still lean towards "its just a function" because I think its the easiest to understand and more importantly -- I think it may be the easiest to "break out of" when you want to add a new operation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "withData(data).pipe([Embed,Chunk])" as the function approach could look messy once you pass 5 pipelines, which is feasible at the moment, : FixMetadata, AssignIDs, ParentChunking, Chunking, Embedding, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using | operator really conveys the pipe in the pipeline, and for most devs (familiar with bash), pipes will look natural and also Pythonic. Another aspect is if one wants to pass parameters to steps, it gets even messier.

Another aspect of using the chained method calls is that we can even use Lambdas for filtering and inline processing without need for decorators:

withData(data).pipe([lambda x: 'allowed' in x,Chunk,Embed])

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came here to suggest a fluid/chained API, but i'm quite late to the party :)

Total subjective IMO, but i've tended to rely on using indentation when wrapping lots of functions together (clojure transducers jump to mind) as a way of visually communicating the level that code is at.

As a python noob, i genuinely don't know, but i wonder if such indentation tricks are possible with chaining in a way they're not with function wrapping. Like, i believe this is allowable, where a particular indentation level clearly indicates where each step definition starts:

(
    withData(data)
        .pipe(Chunk)
        .pipe(Embed)
)

But idk if there's an equivalently clear way of doing it with function wrapping?

Chroma right now implicitly stores your documents, however we also want to support multimodal embeddings and part of this CIP is to propose we deprecate documents= in favor of data=. The contract then becomes that the data you input (or that is generated by your pipeline) is stored by default and can be retrieved by the user in some form. We also want to support storing mixed modalities in a collection. For example, a CLIP based collection should be able
to store both text and images.

We propose to add Storage Pipeline steps capable of storing the inputted data into a Storage layer, and then storing a string reference to that data in the metadata of the document. This is already how the document is currently stored and backed by the metadata segment. The metadata segment stores a KV pair "chroma:document" -> the document contents. For example, say you had an image in memory and wanted to store it in Chroma, you could do something like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to use metadata, or a new field uri?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can be flexible here, is a minor implementation detail. just didn't elaborate because its not super important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, I think we can introduce a mime-type router here so that different data types are automatically routed to a specific storage pipeline that can handle it.

Am I correct in assuming that when you mention multimodal above, you mean to be able to store any type of "data" in your collection - text, images, audio?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, or binary. I also was considering a design for storage thats more like index definition. Where you declaratively specify storage and it gets routed.

Copy link
Contributor

@tazarov tazarov Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to adopt a Storage Adapter like concept, we can have:

from abc import abstractmethod

import overrides


class StorageAdapter(overrides.Overrides):
    @abstractmethod
    def handles(self,mimetype:str)->bool:
        ...
   

class GoogleDriveStorageAdapter(StorageAdapter):
    def handles(self,mimetype:str)->bool:
        return mimetype.startswith("application/pdf")


def extract_mimetype(file:str)->str:
    ...

class StorageAdapterRouter:
    adapters = [GoogleDriveStorageAdapter()]
    def handle_storage_request(self, file:str):
        mime= extract_mimetype(file)
        for adapter in self.adapters:
            if adapter.handles(mime):
                return adapter
        raise ValueError("No adapter found for mime-type: {}".format(mime))

Of course, the actual supported mime-types (or just types of files) can be declaratively configured.

collection.query(data=[image], n_results=100) # LoadFromStorage will run before the results are returned and will load the data from the storage layer and return it to the user in memory
```

#### Future work: More Storage Pipelines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another idea I had is this...

perhaps we should embrace the idea of a "repo per database" or "repo with /mapped folders as DB"

then custom ops can be defined in that github repo.

Chroma can link a database to a repo and then enable CD - so when you update the repo (on a set branch)- it updates that database. (could also extend to have ideas of dev/staging branches)

Then we could have a chroma init command in the CLI that would set up the scaffold for this repo to make it easy.

It's a way to use Git to manage the lifecycle of the DB itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three concepts being intermingled here.

  1. What is the schema of your chroma instance.
  2. How is that schema stored and defined.
  3. How is that schema updated.

I don't think OSS Chroma should become a gitops platform, which is what you are suggesting, that feels far out of scope and very specific to one way of doing things. Not all organizations will want that.

We certainly can allow you to declaratively define the schema of your DB in code, and then allow programmatic updates to that schema. This starts to feel a lot like reinventing SQL, but thats a neither here nor there.

I can see hosted chroma offering such functionality, but seems far off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely far off and very hypothetical

For example, the default query pipeline would simply Embed the text, and would be defined as follows:

```python
collection = client.create_collection("my_collection", pipeline=DefaultEmbeddingFunctionPipelineStep(), query_pipeline=DefaultEmbeddingFunctionPipelineStep()) # The defaults would not actually have to be specified, as they would be the default values. Here we are specifying them for clarity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I like the "inversion of control". With this, LC becomes an upstream dep of Chroma


```python
@pipeline_step(name="Chunk")
def chunk(data: str) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a way for the parametrization of functions? Should we consider/support partials, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about lambdas?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we will support partials


It is important to preserve the existing behavior of our API where users can simply pass text to the add step and have it be embedded. To do this, we will create a default pipeline that is used when no pipeline is provided. This default pipeline will simply embed the text. This default pipeline will be persisted with the collection, and will be used when no pipeline is provided. This means that users can create a collection with a default pipeline, and then query the collection without having to provide the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection.

Pipelines can be persisted by serializing the pipeline steps and storing them in the metadata of the collection. This means that when a collection is loaded, the pipeline steps can be deserialized and used to recreate the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for DLS/portable format. One caveat is when we get pipeline function sprawl from community contributions. Lots of projects split functions into core and contrib.

collection = client.create_collection("my_collection", pipeline=Embed(Chunk()), query_pipeline=Embed())
```

#### The I/0 of a pipeline stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, do you think it makes sense for pipelines to be allowed to operate over metadata and Ids?

This can be useful for the following:

  • Generate Ids
  • When doing chunking add metadata associated with chunks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can! You can imagine an AssignIds() step, and I gave a chunking example - did you want something else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on the metadata, would this allow for me to change the metadata before inserting alongside the data? For example maybe I have a url and want to get the domain from it and add it as a new metadata field before inserting. Or would that be something to be done before the whole pipeline chaining?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that pipeline steps will have access to read and update all - docs, embeddings, ids, metadata.

collection.add(data=data, ids=[0, 1])

# Input to pipeline would be {ids: [0, 1], documents: ["large_document_1", "large_document_2"], metadata: None, embeddings: None}
# Output of Chunk could be {ids: [A, B, C, D], documents: ["chunk_1", "chunk_2", "chunk_3", "chunk_4"], metadata: [{"source_id": 0}, {"source_id": 0}, {"source_id": 1}, {"source_id": 1}], embeddings: None}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, ok I see that my point above is kind of moot with this but still, do you think it could be useful to "evidence" somehow to the user that e.g. metadata will be injected etc - perhaps through function documentation e.g. what are the side effects of the pipeline function.

On the same note, maybe we can have a utility method that will create a local representation of the transformed data so the user can review it prior to storing it in Chroma.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one general concern I have about this, is the somewhat opaque transformation of data that happens for you. For example if you assign ids, how do those ids get returned to you? How can you see results of pipeline execution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipelines need a good introspection, I reckon. But maybe just start with logging :)

collection.add(data=data, ids=[2, 3])
```

#### Server-side registration of pipeline steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, do you think there's a way to sandbox things? I like the idea of self-contained things, but let's consider unsafe code execution on the server side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you are running your own server, its not unsafe, its code you defined.

For hosted chroma, we will have to sandbox the execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With community-contributed functions this can get messy, if not too complicated perhaps we can offer sandboxed across the board.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will have to audit them quite exhaustively for now. I don't think we have to ship with sandboxed in v1

# The metadata segment would then store the KV pair "chroma:document" -> "https://storage.com/image_1"
```

We can ship two StoragePipelines to start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some data types like images, we can also use the underlying DB - base64 the image and put it in string_value (TEXT size is 2GB for SQLite). This has implications, of course, for other RDMBS (distributed version) but so does datatypes in general.

@HammadB
Copy link
Collaborator Author

HammadB commented Sep 6, 2023

Note to self: think through where_document behavior

#### Future work: Parallel pipeline execution
Not all pipeline stages need to executed sequentially. In the future, we can add support for parallel pipeline execution by changing how the pipeline is defined.

## Storage Pipelines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, just a thought: Can this be, in theory, called storage adapters/interfaces?

An adapter, as in a wrapper on top of a specific storage medium?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

First, register The "Chunk" pipeline step with the name "Chunk" as follows:

```python
@pipeline_step(name="Chunk")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a chained syntax:

withData(data).pipe([Chunk,Embed]) then we can even remove the decorator part - the actual wrapping can happen within the pipe().

Elaborating further on the chained pipeline:

  • .register(pipeline_name="my_pipeline") or .upload(pipeline_name="my_pipeline") - Packages the pipeline and uploads to server
  • .exec_local() - local execution for testing/dev

For example, the default pipeline would simply Embed the text, and would be persisted as follows:

```python
collection = client.create_collection("my_collection", pipeline=DefaultEmbeddingFunctionPipelineStep())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do something like in https://github.com/chroma-core/chroma/pull/1110/files#r1318974190 (.register(pipeline_name="my_pipeline")) then we can use just the name for the pipeline pipeline="my_pipeline" thus indicating a remote pipeline execution.

We can also consider alternative semantics to differentiate between local (client-side) and remote (server-side)


It is important to preserve the existing behavior of our API where users can simply pass text to the add step and have it be embedded. To do this, we will create a default pipeline that is used when no pipeline is provided. This default pipeline will simply embed the text. This default pipeline will be persisted with the collection, and will be used when no pipeline is provided. This means that users can create a collection with a default pipeline, and then query the collection without having to provide the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection.

Pipelines can be persisted by serializing the pipeline steps and storing them in the metadata of the collection. This means that when a collection is loaded, the pipeline steps can be deserialized and used to recreate the pipeline. This is a significant improvement over the current system, where users must remember which embedding function they used when they created the collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HammadB, wdyt about promoting pipelines as first-class citizens with a separate table? This promotes reusability and we can even introduce a small API for managing them.

All this comes of course with down-sides:

  • If one deletes a pipeline that is used by several collections (we can prevent deletion of referenced pipelines)
  • Portability - While collection data is "relatively" easy to export, external pipelines add an extra layer of complexity
  • Versioning - if two cols start with the same pipeline but one decides it needs an extra step - introducing versioning might solve this problem but it too comes with its challenges

I think this needs a bit deeper analysis to fully understand all the pros/cons.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking about this too, this is a good point, I'll give it more thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stored procedures is the historical comp of course.. but I think in general keeping "code" out of the database is very important. code should be managed by gitops.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing pointers to code is what's being suggested here. Pipeline step definition will still be persisted however you manage that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree that we shouldn't version/store your code)

collection.add(data=[image], ids=[0]) # image is a PIL image, or numpy array etc
collection.query(data=[image], n_results=100) # LoadFromStorage will run before the results are returned and will load the data from the storage layer and return it to the user in memory
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to think about how querying works with multi-modality? eg right now we have query_texts ... but maybe this should be query_data(s)? cc @HammadB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we'd change query too thanks for pointing that out

@atroyn atroyn mentioned this pull request Oct 12, 2023
1 task

Users often want to load documents from a variety of sources. For example, users may want to load
documents from a database, a file system, or a remote service, these documents may be in a variety
of formats, such as text, images, or audio. Today, users are responsible for loading these documents
Copy link
Contributor

@perzeuss perzeuss Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responsible for loading

and keeping them up-to-date. e.g. triggering a sync, which brings some more challenges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants