Improve message on `collection.get` empty embeddings #300

arnaudmiribel · 2023-04-07T13:02:34Z

Problem

I have successfully defined my embedding function, and added documents.

And yet when calling Collection.get(), the default behaviour is to show None for embeddings in the output:

>>> collection.get("doc1")
{'ids': [['doc1']],
 'embeddings': None,  # <---- Why?
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

Seeing None is rather counter intuitive*, as it gives the impression that embeddings were not computed. They are! But you need to explicitly pass include=["embeddings"] to discover that:

>>> collection.get("doc1", include=["embeddings", "metadatas"])
{'ids': [['doc1']],
 'embeddings': [[0.2352, 2324.... ]], 
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

*Saw a couple of persons having the same issue on Discord

Proposals (additive)

Have Collection.get only show "id" + keys passed through include. Avoid the counter intuitive None!
Have "embeddings" part of the default values for include with a pretty printer, to avoid cluttering the output.

New behaviours:

>>> collection.get("doc1")
{'ids': [['doc1']],
 'embeddings': [[0.2352, 2324.... ]], 
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

>>> collection.get("doc1", include=["documents", "metadatas"])
{'ids': [['doc1']],
 'documents': [['Louis XIV lived for tens of years']],
 'metadatas': [[{'source': 'my_source'}]],
}

Settings

Chroma version: 0.3.21
Python version: 3.8.12

The text was updated successfully, but these errors were encountered:

arnaudmiribel · 2023-04-07T13:05:05Z

Happy to work on a PR should you find this a good idea.

jeffchuber · 2023-04-07T16:16:04Z

@arnaudmiribel hi there!

Originally .get and .query did return the embedding itself. However this, as a default, led to very poor performance, especially over the current REST API. The rest api was not chosen for its performance (we will likely switch to ArrowFlight or other down the road), but because it's ease of integration. The serialization of the large arrays led to a huge amount of latency. We prioritized developers have a good getting started experience with chroma being fast to return a query, as most developers don't use the returned embeddings. This is strictly necessary for .query, and we implemented it into .get to be consistent.

We could do a few things

improve the docs
append a notes object in the return that specifies "to get embeddings, please do X" (though this feels messy)
make .get return embeddings by default, even if .query doesnt

I think improving the docs is an obvious thing to do.

Thoughts?

arnaudmiribel · 2023-04-07T19:18:06Z

Thanks for reaching out!

I agree that improving the docs is certainly a low hanging fruit! But I still think it is misleading if not wrong to show "embeddings": None, when embeddings were actually computed and not included in the include= parameter.

What about:

(Straightforward) Not show anything about "embeddings" if "embeddings" is not in the include= keyword.
Have the API return a prettyprinted / lazy version of the embeddings? Not the actual embeddings. Something like "embeddings": Embeddings(shape=[x, y]),. So you can still surface that embeddings were computed, but without compromise on latency.

jeffchuber · 2023-04-07T19:23:19Z

these are both great ideas! thanks for brainstorming this :)

Vikho · 2023-04-11T08:21:28Z

That's great, finally solved my puzzle, thanks!

jeffchuber · 2023-06-23T17:03:43Z

Update: added a section on the new troubleshooting page about this - https://docs.trychroma.com/troubleshooting

jeffchuber · 2023-07-04T17:00:29Z

Options

drop the field entirely if the user does not include it
include the field, but just include some high level metadata about it, eg Embeddings(shape=[x, y])
include the field as None, but also include another key-value something like included_fields: ['distances', 'ids']
include the field with literally a note about how to include it embeddings: "add include=['embeddings'] get to them back too"
....

@atroyn @HammadB i added a few options here

jeffchuber · 2023-09-05T22:57:31Z

@tazarov would love your thoughts on this long-outstanding issue as well

jeffchuber · 2023-09-13T21:53:52Z

return?

"embeddings": "how to include object"

tazarov · 2023-09-14T12:00:53Z

@jeffchuber I like the idea of always including embeddings in the response, but instead of actual embeddings, have a proxy object for lazy-loading. The proxy object is "invisible" to the end user, where each Embedding object will hold the underlying embedding ID to lazy fetch this.

There are two aspects to this; although most users don't realize it when operating with a persistent client it mostly doesn't make a difference, so for SegmentAPI we can always return the embeddings. For HttpClient, things look different, if our embedding proxy is too fine-grained e.g. fetching from the server on a per-vector basis, which can have even worse performance impacts than always returning the vectors in the original query. Therefore, I suggest we use Embeddings to do our proxying so that whenever access to any vector in the set is requested, we fetch the full batch.

Here is a dumbed-down version to illustrate:

class Embedding(List[Union[float, int]]):
    def __init__(self, eid: str, vector: List[Union[float, int]] = None):
        super().__init__(vector if vector else [])
        self._eid:str = eid #this is the embedding id to be lazy loaded

    @property
    def eid(self) -> str:
        return self._eid


class Embeddings(List[Embedding]):
    def __init__(self, fetch: Callable, embeddings: List[Embedding]) -> None:
        super().__init__(embeddings if embeddings else [])
        self._loaded = False if embeddings else True
        self._fetch = fetch

    def _ensure_loaded(self) -> None:
        if not self._loaded:
            self._load()

    def __getitem__(self, index: int) -> Embedding:
        self._ensure_loaded()
        return super().__getitem__(index)

    def __len__(self) -> int:
        self._ensure_loaded()
        return super().__len__()

    def _load(self) -> None:
        #TODO: Implement parallel/paginated fetches
        fetched = self._fetch(ids=[e.eid for e in self])
        self.extend([Embedding(eid, vector) for eid, vector in fetched])
        self._loaded = True

So FastAPI API impl will now return GetResult where:

class GetResult(TypedDict):
    ids: List[ID]
    embeddings: Optional[Embeddings]
    documents: Optional[List[Document]]
    metadatas: Optional[List[Metadata]]

tazarov · 2023-09-14T12:08:13Z

To point out that that I have also looked at the other options, but I feel that returning something that is not the embeddings or proxy thereof is not a great developer experience and will lead to more questions/confusion.

Additionally, if we don't return embeddings at all, it'll mismatch the docs.

Following the Python Zen, if it's a duck, it should quack.

@HammadB wdyt?

jeffchuber · 2023-09-15T16:32:54Z

hmmmm. That's an interesting idea. @HammadB thoughts?

HammadB · 2023-12-08T18:37:46Z

The lazy fetch idea is good, lets prototype it/think some more

If you include, it should get loaded
If you don't include it, it should get a lazy handle it
getitem or similar on the lazy handle should fetch it based on the cached query

There is some wierdness, what if the data was mutated in between?

thorwhalen · 2024-01-04T21:06:13Z

Hi, struggled for a while to figure out what was happening here, until I found this issue.
Indeed, we should at least update the docs temporarily, so as to make it clear what is happening here (I'll do it myself if you want -- where can we edit and PR the docs?)

Regarding @jeffchuber's list of options, I'd vote for:

drop the field entirely if the user does not include it

As always, there's pros and cons for all the solutions proposed, but it does seem to me that:

{
  'ids': ['foo', 'apple'],
  'metadatas': [None, None],
  'documents': ['bar', 'crumble'],
}

communicates what's happening better than

{
  'ids': ['foo', 'apple'],
  'embeddings': None,
  'metadatas': [None, None],
  'documents': ['bar', 'crumble'],
  'uris': None,
  'data': None
}

Here, I asked for metadatas (of which there were none) and documents.

The only not-easily-circumventable cons I can think of are if some users depend on the current "fixed-schema" output (which is currently typed by the chromadb.api.types.GetResult TypedDict, in fact).
Hopefully, I would't be instigating a debate on the pros and cons of schemas/types here, but one could even argue here that setting such a schema will make it harder to evolve what the DB responses are in the future.

Another aspect that might count in our choice here: Is there an existing design (and corresponding lingo) we are following here? This "results always include an id", and several other choices chromadb made are reminiscent of mongodb to me. Mongodb's version of include would then be projection, and has made the choice that "only the fields requested are returned" (+ id, unless explicitly requested NOT TO).

Regarding a lazy loading wrapper: I whole heartedly cheer for that effort to, but I think it is more complex than what is immediately needed here, and is probably needed in more places than just here: Personally, I'd want to have such an object available anytime the lazy-load pattern solutions my problems.

jeffchuber · 2024-01-05T06:42:38Z

@thorwhalen here is the docs repo :) https://github.com/chroma-core/docs

thanks for thinking this through with us!

thorwhalen · 2024-01-05T10:13:08Z

@jeffchuber. I see where I can submit a PR now.

Looking there, though, raises a bunch more questions about what your contributing rules (style, structure, content) is.
The CONTRIBUTING.md is essentially empty. Do you hold that information elsewhere?

Here are two images showing the Collections.add method's signature and docs:

So I see the google style is to be used, but here are a few immediate questions I have:

I see only a short description, plus args, returns, and raises sections. A quick scan tells me that most of your method follow a similar structure. Where therefore should I put extra information, such as some clarification of the output's content (our current subject) or doctests (do you even do those?).
I see that docs and function contents are misaligned. The docs don't show the images and the uris argument, which the function has, and documents. Sometimes I do this on purpose, because I don't want to document a feature before giving it time to mature -- but in general, I like to automate things when there's multiple sources of truths.

More on this "multiple sources of truths". In the images shared about we can see four sources of what the arguments are (signature and doc text in both code and md files). This means there's six different misalignment pairs. A scary prospect to me. I prefer to generate the documentation from the code's __doc__ strings, and write some "after the fact" diagnosis scripts where I can't. I don't know what the state-of-the-art is on those tools though (I have the bad habit of just doing it myself, but I wouldn't advise that!).

thorwhalen · 2024-01-05T10:15:04Z

Also, on a side note. I see that some functions have a lot of arguments. Perhaps we can consider putting some of these as keyword-only. Relying on positional arguments becomes risky when there starts to be a lot of arguments (it makes sense, but you can find some prose about that on the internet, if not convinced.)

thorwhalen · 2024-01-09T12:17:58Z

@arnaudmiribel, this is what I use to get the include to act more like MongoDBs projection field, which has the meaning of "only give me these fields".

Essentially, it's a transform_methods_to_keep_only_include_keys function that will wrap your collection instance so that the get and query act as I want.
This is not a good long term solution: I do believe that the behaviors discussed above would be better.
For now, I use this to depend on that (hopefully future) behavior instead of the current one.

First, the test, so you can see the behavior I'm talking about:

def test_transform_methods_to_keep_only_include_keys():
    from chromadb import Client
    c = Client().get_or_create_collection('test_inclusion_filter')
    if ids := c.get()['ids']:
        c.delete(ids)
        
    c.add(ids=['apple', 'banana'], documents=['crumble', 'split'])

    assert c.get() == {
        'ids': ['apple', 'banana'],
        'embeddings': None,
        'metadatas': [None, None],
        'documents': ['crumble', 'split'],
        'uris': None,
        'data': None
    }

    cc = transform_methods_to_keep_only_include_keys(c)

    # Now see that there's only 3 fields (the default include value of Collection.get)
    assert cc.get() == {
        'ids': ['apple', 'banana'],
        'metadatas': [None, None],
        'documents': ['crumble', 'split']
    }

    assert cc.get(include=['uris']) == {
        'ids': ['apple', 'banana'], 
        'uris': [None, None]
    }

And now the (complete) code:

from inspect import signature
from functools import wraps

def subdict(d, keys):
    return {k: v for k, v in d.items() if k in keys}

def map_arguments(func, args, kwargs):
    """
    Get a `{argname: argval, ...}` dict from the args and kwargs of a function call.
    
    >>> map_arguments(lambda x, y, z=42: None, [1], {'y': 2})
    {'x': 1, 'y': 2, 'z': 42}
    """
    b = signature(func).bind(*args, **kwargs)
    b.apply_defaults()
    return b.arguments

def argument_value(argname, func, args, kwargs):
    """
    Extract the argument value from a function call, or the default if not given.
    
    >>> func = lambda x, y, z=42: None
    >>> argument_value('z', func, [1, 2], {})  # z not given, so use default
    42
    >>> argument_value('z', func, [1, 2], {'z': 4})  # z given, so use that
    4
    """
    kwargs = map_arguments(func, args, kwargs)
    return kwargs[argname]

 
def keep_only_include_keys(method):
    """Return wrapped method that only subdicts it's outputs to only `include` keys."""

    @wraps(method)
    def _wrapped(*args, **kwargs):
        kwargs = map_arguments(method, args, kwargs)
        include = list(kwargs.get('include', []))
        keep_keys = ['ids'] + include
        return subdict(method(*args, **kwargs), keep_keys)

    return _wrapped


# TODO: The pydantic.BaseModel parent class of Collection forbids me to re-assign methods
#  so I need to wrap the whole Collection
class Delegator:
    """Delegates all attributes to the wrapped object"""

    def __init__(self, wrapped_obj):
        self.wrapped_obj = wrapped_obj

    def __getattr__(self, attr):  # delegation: just forward attributes to wrapped_obj
        return getattr(self.wrapped_obj, attr)
    
def transform_methods_to_keep_only_include_keys(
    instance, method_names=('get', 'query')
):
    """
    Wraps all methods that contain an `include` argument so they filter their output
    accordingly.
    """
    delegated = Delegator(instance)
    for method_name in dir(instance):
        if not method_name.startswith('_'):
            method = getattr(instance, method_name)
            if callable(method) and 'include' in signature(method).parameters:
                setattr(delegated, method_name, keep_only_include_keys(method))
    return delegated

@jeffchuber : Note that I originally had my method_names default be populated dynamically by all Collection methods that had an include in their arguments, but then noticed that there were other methods that used the argument name to mean something else. You may want to be aware of this name reuse in case it wasn't on purpose.

>>> sorted(_methods_containing_include_argument(__import__('chromadb').Collection))
['copy', 'dict', 'get', 'json', 'query']

thorwhalen · 2024-01-09T16:17:57Z

Made this issue to describe a solution which I believe is good non-disruptive step of @jeffchuber's

drop the field entirely if the user does not include it

solution.

Basically, give Collection.get and Collection.query a filter_include argument to say whether this should be done or not.

With that, the user has a choice of what the behavior should be.

Also--have the default be False so that we don't change current behavior, and switch to True when it makes sense.
When the switch is done, the user can still get the old behavior by specifying filter_include=False (could be a config).

thorwhalen · 2024-01-09T16:27:02Z

I added this to my PR from 6 days ago: https://github.com/chroma-core/chroma/pull/1607/commits

I'll update the docs too, if the PR is accepted.

jeffchuber mentioned this issue Apr 10, 2023

local_persistence.ipynb in /examples is giving embeddings as none #313

Closed

McDo mentioned this issue Apr 11, 2023

Collection.add doesn't embed the document automatically in client-server mode #325

Closed

jeffchuber changed the title ~~Collection.get displays None for embeddings even if embeddings exist~~ Improve message on collection.get empty embeddings Aug 29, 2023

jeffchuber added the enhancement New feature or request label Aug 29, 2023

jeffchuber added the to-discuss label Sep 5, 2023

jeffchuber assigned HammadB Sep 13, 2023

mrtunguyen mentioned this issue Sep 26, 2023

chroma vector store doesn't return similar questions (even the same question) Dataherald/dataherald#187

Closed

HammadB assigned tazarov and unassigned HammadB Dec 8, 2023

HammadB added the taz-sprint-2 label Dec 8, 2023

thorwhalen mentioned this issue Jan 9, 2024

[Feature Request]: A filter_include argument for Collection.get and Collection.query #1622

Open

thorwhalen mentioned this issue Jan 12, 2024

Filter include #1626

Open

HammadB added taz-sprint-4 and removed taz-sprint-2 labels Jan 15, 2024

HammadB removed the taz-sprint-4 label Apr 17, 2024

codetheweb mentioned this issue Apr 19, 2024

[ENH]: clarify non-included fields in .get() and .query() responses #2028

Closed

1 task

codetheweb linked a pull request Apr 23, 2024 that will close this issue

[ENH] add included to .get() & .query() response #2044

Merged

1 task

codetheweb closed this as completed in #2044 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve message on `collection.get` empty embeddings #300

Improve message on `collection.get` empty embeddings #300

arnaudmiribel commented Apr 7, 2023 •

edited

Loading

arnaudmiribel commented Apr 7, 2023

jeffchuber commented Apr 7, 2023

arnaudmiribel commented Apr 7, 2023 •

edited

Loading

jeffchuber commented Apr 7, 2023

Vikho commented Apr 11, 2023

jeffchuber commented Jun 23, 2023

jeffchuber commented Jul 4, 2023 •

edited

Loading

jeffchuber commented Sep 5, 2023

jeffchuber commented Sep 13, 2023

tazarov commented Sep 14, 2023

tazarov commented Sep 14, 2023 •

edited

Loading

jeffchuber commented Sep 15, 2023

HammadB commented Dec 8, 2023 •

edited

Loading

thorwhalen commented Jan 4, 2024 •

edited

Loading

jeffchuber commented Jan 5, 2024

thorwhalen commented Jan 5, 2024

thorwhalen commented Jan 5, 2024

thorwhalen commented Jan 9, 2024 •

edited

Loading

thorwhalen commented Jan 9, 2024

thorwhalen commented Jan 9, 2024 •

edited

Loading

Improve message on collection.get empty embeddings #300

Improve message on collection.get empty embeddings #300

Comments

arnaudmiribel commented Apr 7, 2023 • edited Loading

Problem

Proposals (additive)

Settings

arnaudmiribel commented Apr 7, 2023

jeffchuber commented Apr 7, 2023

arnaudmiribel commented Apr 7, 2023 • edited Loading

jeffchuber commented Apr 7, 2023

Vikho commented Apr 11, 2023

jeffchuber commented Jun 23, 2023

jeffchuber commented Jul 4, 2023 • edited Loading

jeffchuber commented Sep 5, 2023

jeffchuber commented Sep 13, 2023

tazarov commented Sep 14, 2023

tazarov commented Sep 14, 2023 • edited Loading

jeffchuber commented Sep 15, 2023

HammadB commented Dec 8, 2023 • edited Loading

thorwhalen commented Jan 4, 2024 • edited Loading

jeffchuber commented Jan 5, 2024

thorwhalen commented Jan 5, 2024

thorwhalen commented Jan 5, 2024

thorwhalen commented Jan 9, 2024 • edited Loading

thorwhalen commented Jan 9, 2024

thorwhalen commented Jan 9, 2024 • edited Loading

Improve message on `collection.get` empty embeddings #300

Improve message on `collection.get` empty embeddings #300

arnaudmiribel commented Apr 7, 2023 •

edited

Loading

arnaudmiribel commented Apr 7, 2023 •

edited

Loading

jeffchuber commented Jul 4, 2023 •

edited

Loading

tazarov commented Sep 14, 2023 •

edited

Loading

HammadB commented Dec 8, 2023 •

edited

Loading

thorwhalen commented Jan 4, 2024 •

edited

Loading

thorwhalen commented Jan 9, 2024 •

edited

Loading

thorwhalen commented Jan 9, 2024 •

edited

Loading