Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve message on collection.get empty embeddings #300

Closed
arnaudmiribel opened this issue Apr 7, 2023 · 20 comments · Fixed by #2044
Closed

Improve message on collection.get empty embeddings #300

arnaudmiribel opened this issue Apr 7, 2023 · 20 comments · Fixed by #2044
Assignees
Labels
enhancement New feature or request to-discuss

Comments

@arnaudmiribel
Copy link

arnaudmiribel commented Apr 7, 2023

Problem

I have successfully defined my embedding function, and added documents.

And yet when calling Collection.get(), the default behaviour is to show None for embeddings in the output:

>>> collection.get("doc1")
{'ids': [['doc1']],
 'embeddings': None,  # <---- Why?
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

Seeing None is rather counter intuitive*, as it gives the impression that embeddings were not computed. They are! But you need to explicitly pass include=["embeddings"] to discover that:

>>> collection.get("doc1", include=["embeddings", "metadatas"])
{'ids': [['doc1']],
 'embeddings': [[0.2352, 2324.... ]], 
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

*Saw a couple of persons having the same issue on Discord

Proposals (additive)

  1. Have Collection.get only show "id" + keys passed through include. Avoid the counter intuitive None!
  2. Have "embeddings" part of the default values for include with a pretty printer, to avoid cluttering the output.

New behaviours:

>>> collection.get("doc1")
{'ids': [['doc1']],
 'embeddings': [[0.2352, 2324.... ]], 
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

>>> collection.get("doc1", include=["documents", "metadatas"])
{'ids': [['doc1']],
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

Settings

Chroma version: 0.3.21
Python version: 3.8.12

@arnaudmiribel
Copy link
Author

Happy to work on a PR should you find this a good idea.

@jeffchuber
Copy link
Contributor

@arnaudmiribel hi there!

Originally .get and .query did return the embedding itself. However this, as a default, led to very poor performance, especially over the current REST API. The rest api was not chosen for its performance (we will likely switch to ArrowFlight or other down the road), but because it's ease of integration. The serialization of the large arrays led to a huge amount of latency. We prioritized developers have a good getting started experience with chroma being fast to return a query, as most developers don't use the returned embeddings. This is strictly necessary for .query, and we implemented it into .get to be consistent.

We could do a few things

  • improve the docs
  • append a notes object in the return that specifies "to get embeddings, please do X" (though this feels messy)
  • make .get return embeddings by default, even if .query doesnt

I think improving the docs is an obvious thing to do.

Thoughts?

@arnaudmiribel
Copy link
Author

arnaudmiribel commented Apr 7, 2023

Thanks for reaching out!

I agree that improving the docs is certainly a low hanging fruit! But I still think it is misleading if not wrong to show "embeddings": None, when embeddings were actually computed and not included in the include= parameter.

What about:

  • (Straightforward) Not show anything about "embeddings" if "embeddings" is not in the include= keyword.
  • Have the API return a prettyprinted / lazy version of the embeddings? Not the actual embeddings. Something like "embeddings": Embeddings(shape=[x, y]),. So you can still surface that embeddings were computed, but without compromise on latency.

@jeffchuber
Copy link
Contributor

these are both great ideas! thanks for brainstorming this :)

@Vikho
Copy link

Vikho commented Apr 11, 2023

That's great, finally solved my puzzle, thanks!

@jeffchuber
Copy link
Contributor

Update: added a section on the new troubleshooting page about this - https://docs.trychroma.com/troubleshooting

@jeffchuber
Copy link
Contributor

jeffchuber commented Jul 4, 2023

Options

  • drop the field entirely if the user does not include it
  • include the field, but just include some high level metadata about it, eg Embeddings(shape=[x, y])
  • include the field as None, but also include another key-value something like included_fields: ['distances', 'ids']
  • include the field with literally a note about how to include it embeddings: "add include=['embeddings'] get to them back too"
  • ....

@atroyn @HammadB i added a few options here

@jeffchuber jeffchuber changed the title Collection.get displays None for embeddings even if embeddings exist Improve message on collection.get empty embeddings Aug 29, 2023
@jeffchuber jeffchuber added the enhancement New feature or request label Aug 29, 2023
@jeffchuber
Copy link
Contributor

@tazarov would love your thoughts on this long-outstanding issue as well

@jeffchuber
Copy link
Contributor

return?

"embeddings": "how to include object"

@tazarov
Copy link
Contributor

tazarov commented Sep 14, 2023

@jeffchuber I like the idea of always including embeddings in the response, but instead of actual embeddings, have a proxy object for lazy-loading. The proxy object is "invisible" to the end user, where each Embedding object will hold the underlying embedding ID to lazy fetch this.

There are two aspects to this; although most users don't realize it when operating with a persistent client it mostly doesn't make a difference, so for SegmentAPI we can always return the embeddings. For HttpClient, things look different, if our embedding proxy is too fine-grained e.g. fetching from the server on a per-vector basis, which can have even worse performance impacts than always returning the vectors in the original query. Therefore, I suggest we use Embeddings to do our proxying so that whenever access to any vector in the set is requested, we fetch the full batch.

Here is a dumbed-down version to illustrate:

class Embedding(List[Union[float, int]]):
    def __init__(self, eid: str, vector: List[Union[float, int]] = None):
        super().__init__(vector if vector else [])
        self._eid:str = eid #this is the embedding id to be lazy loaded

    @property
    def eid(self) -> str:
        return self._eid


class Embeddings(List[Embedding]):
    def __init__(self, fetch: Callable, embeddings: List[Embedding]) -> None:
        super().__init__(embeddings if embeddings else [])
        self._loaded = False if embeddings else True
        self._fetch = fetch

    def _ensure_loaded(self) -> None:
        if not self._loaded:
            self._load()

    def __getitem__(self, index: int) -> Embedding:
        self._ensure_loaded()
        return super().__getitem__(index)

    def __len__(self) -> int:
        self._ensure_loaded()
        return super().__len__()

    def _load(self) -> None:
        #TODO: Implement parallel/paginated fetches
        fetched = self._fetch(ids=[e.eid for e in self])
        self.extend([Embedding(eid, vector) for eid, vector in fetched])
        self._loaded = True
        

So FastAPI API impl will now return GetResult where:

class GetResult(TypedDict):
    ids: List[ID]
    embeddings: Optional[Embeddings]
    documents: Optional[List[Document]]
    metadatas: Optional[List[Metadata]]

@tazarov
Copy link
Contributor

tazarov commented Sep 14, 2023

To point out that that I have also looked at the other options, but I feel that returning something that is not the embeddings or proxy thereof is not a great developer experience and will lead to more questions/confusion.

Additionally, if we don't return embeddings at all, it'll mismatch the docs.

Following the Python Zen, if it's a duck, it should quack.

@HammadB wdyt?

@jeffchuber
Copy link
Contributor

hmmmm. That's an interesting idea. @HammadB thoughts?

@HammadB
Copy link
Collaborator

HammadB commented Dec 8, 2023

The lazy fetch idea is good, lets prototype it/think some more

  1. If you include, it should get loaded
  2. If you don't include it, it should get a lazy handle it
  3. getitem or similar on the lazy handle should fetch it based on the cached query

There is some wierdness, what if the data was mutated in between?

@thorwhalen
Copy link

thorwhalen commented Jan 4, 2024

Hi, struggled for a while to figure out what was happening here, until I found this issue.
Indeed, we should at least update the docs temporarily, so as to make it clear what is happening here (I'll do it myself if you want -- where can we edit and PR the docs?)

Regarding @jeffchuber's list of options, I'd vote for:

  • drop the field entirely if the user does not include it

As always, there's pros and cons for all the solutions proposed, but it does seem to me that:

{
  'ids': ['foo', 'apple'],
  'metadatas': [None, None],
  'documents': ['bar', 'crumble'],
}

communicates what's happening better than

{
  'ids': ['foo', 'apple'],
  'embeddings': None,
  'metadatas': [None, None],
  'documents': ['bar', 'crumble'],
  'uris': None,
  'data': None
}

Here, I asked for metadatas (of which there were none) and documents.

The only not-easily-circumventable cons I can think of are if some users depend on the current "fixed-schema" output (which is currently typed by the chromadb.api.types.GetResult TypedDict, in fact).
Hopefully, I would't be instigating a debate on the pros and cons of schemas/types here, but one could even argue here that setting such a schema will make it harder to evolve what the DB responses are in the future.

Another aspect that might count in our choice here: Is there an existing design (and corresponding lingo) we are following here? This "results always include an id", and several other choices chromadb made are reminiscent of mongodb to me. Mongodb's version of include would then be projection, and has made the choice that "only the fields requested are returned" (+ id, unless explicitly requested NOT TO).

Regarding a lazy loading wrapper: I whole heartedly cheer for that effort to, but I think it is more complex than what is immediately needed here, and is probably needed in more places than just here: Personally, I'd want to have such an object available anytime the lazy-load pattern solutions my problems.

@jeffchuber
Copy link
Contributor

@thorwhalen here is the docs repo :) https://github.com/chroma-core/docs

thanks for thinking this through with us!

@thorwhalen
Copy link

@jeffchuber. I see where I can submit a PR now.

Looking there, though, raises a bunch more questions about what your contributing rules (style, structure, content) is.
The CONTRIBUTING.md is essentially empty. Do you hold that information elsewhere?

Here are two images showing the Collections.add method's signature and docs:

image

image

So I see the google style is to be used, but here are a few immediate questions I have:

  • I see only a short description, plus args, returns, and raises sections. A quick scan tells me that most of your method follow a similar structure. Where therefore should I put extra information, such as some clarification of the output's content (our current subject) or doctests (do you even do those?).
  • I see that docs and function contents are misaligned. The docs don't show the images and the uris argument, which the function has, and documents. Sometimes I do this on purpose, because I don't want to document a feature before giving it time to mature -- but in general, I like to automate things when there's multiple sources of truths.

More on this "multiple sources of truths". In the images shared about we can see four sources of what the arguments are (signature and doc text in both code and md files). This means there's six different misalignment pairs. A scary prospect to me. I prefer to generate the documentation from the code's __doc__ strings, and write some "after the fact" diagnosis scripts where I can't. I don't know what the state-of-the-art is on those tools though (I have the bad habit of just doing it myself, but I wouldn't advise that!).

@thorwhalen
Copy link

Also, on a side note. I see that some functions have a lot of arguments. Perhaps we can consider putting some of these as keyword-only. Relying on positional arguments becomes risky when there starts to be a lot of arguments (it makes sense, but you can find some prose about that on the internet, if not convinced.)

@thorwhalen
Copy link

thorwhalen commented Jan 9, 2024

@arnaudmiribel, this is what I use to get the include to act more like MongoDBs projection field, which has the meaning of "only give me these fields".

Essentially, it's a transform_methods_to_keep_only_include_keys function that will wrap your collection instance so that the get and query act as I want.
This is not a good long term solution: I do believe that the behaviors discussed above would be better.
For now, I use this to depend on that (hopefully future) behavior instead of the current one.

First, the test, so you can see the behavior I'm talking about:

def test_transform_methods_to_keep_only_include_keys():
    from chromadb import Client
    c = Client().get_or_create_collection('test_inclusion_filter')
    if ids := c.get()['ids']:
        c.delete(ids)
        
    c.add(ids=['apple', 'banana'], documents=['crumble', 'split'])

    assert c.get() == {
        'ids': ['apple', 'banana'],
        'embeddings': None,
        'metadatas': [None, None],
        'documents': ['crumble', 'split'],
        'uris': None,
        'data': None
    }

    cc = transform_methods_to_keep_only_include_keys(c)

    # Now see that there's only 3 fields (the default include value of Collection.get)
    assert cc.get() == {
        'ids': ['apple', 'banana'],
        'metadatas': [None, None],
        'documents': ['crumble', 'split']
    }

    assert cc.get(include=['uris']) == {
        'ids': ['apple', 'banana'], 
        'uris': [None, None]
    }

And now the (complete) code:

from inspect import signature
from functools import wraps

def subdict(d, keys):
    return {k: v for k, v in d.items() if k in keys}

def map_arguments(func, args, kwargs):
    """
    Get a `{argname: argval, ...}` dict from the args and kwargs of a function call.
    
    >>> map_arguments(lambda x, y, z=42: None, [1], {'y': 2})
    {'x': 1, 'y': 2, 'z': 42}
    """
    b = signature(func).bind(*args, **kwargs)
    b.apply_defaults()
    return b.arguments

def argument_value(argname, func, args, kwargs):
    """
    Extract the argument value from a function call, or the default if not given.
    
    >>> func = lambda x, y, z=42: None
    >>> argument_value('z', func, [1, 2], {})  # z not given, so use default
    42
    >>> argument_value('z', func, [1, 2], {'z': 4})  # z given, so use that
    4
    """
    kwargs = map_arguments(func, args, kwargs)
    return kwargs[argname]

 
def keep_only_include_keys(method):
    """Return wrapped method that only subdicts it's outputs to only `include` keys."""

    @wraps(method)
    def _wrapped(*args, **kwargs):
        kwargs = map_arguments(method, args, kwargs)
        include = list(kwargs.get('include', []))
        keep_keys = ['ids'] + include
        return subdict(method(*args, **kwargs), keep_keys)

    return _wrapped


# TODO: The pydantic.BaseModel parent class of Collection forbids me to re-assign methods
#  so I need to wrap the whole Collection
class Delegator:
    """Delegates all attributes to the wrapped object"""

    def __init__(self, wrapped_obj):
        self.wrapped_obj = wrapped_obj

    def __getattr__(self, attr):  # delegation: just forward attributes to wrapped_obj
        return getattr(self.wrapped_obj, attr)
    
def transform_methods_to_keep_only_include_keys(
    instance, method_names=('get', 'query')
):
    """
    Wraps all methods that contain an `include` argument so they filter their output
    accordingly.
    """
    delegated = Delegator(instance)
    for method_name in dir(instance):
        if not method_name.startswith('_'):
            method = getattr(instance, method_name)
            if callable(method) and 'include' in signature(method).parameters:
                setattr(delegated, method_name, keep_only_include_keys(method))
    return delegated

@jeffchuber : Note that I originally had my method_names default be populated dynamically by all Collection methods that had an include in their arguments, but then noticed that there were other methods that used the argument name to mean something else. You may want to be aware of this name reuse in case it wasn't on purpose.

>>> sorted(_methods_containing_include_argument(__import__('chromadb').Collection))
['copy', 'dict', 'get', 'json', 'query']

@thorwhalen
Copy link

Made this issue to describe a solution which I believe is good non-disruptive step of @jeffchuber's

drop the field entirely if the user does not include it

solution.

Basically, give Collection.get and Collection.query a filter_include argument to say whether this should be done or not.

With that, the user has a choice of what the behavior should be.

Also--have the default be False so that we don't change current behavior, and switch to True when it makes sense.
When the switch is done, the user can still get the old behavior by specifying filter_include=False (could be a config).

@thorwhalen
Copy link

thorwhalen commented Jan 9, 2024

I added this to my PR from 6 days ago: https://github.com/chroma-core/chroma/pull/1607/commits

I'll update the docs too, if the PR is accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request to-discuss
Projects
None yet
6 participants