-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve message on collection.get
empty embeddings
#300
Comments
Happy to work on a PR should you find this a good idea. |
@arnaudmiribel hi there! Originally We could do a few things
I think improving the docs is an obvious thing to do. Thoughts? |
Thanks for reaching out! I agree that improving the docs is certainly a low hanging fruit! But I still think it is misleading if not wrong to show What about:
|
these are both great ideas! thanks for brainstorming this :) |
That's great, finally solved my puzzle, thanks! |
Update: added a section on the new |
Options
|
collection.get
empty embeddings
@tazarov would love your thoughts on this long-outstanding issue as well |
return?
|
@jeffchuber I like the idea of always including embeddings in the response, but instead of actual embeddings, have a proxy object for lazy-loading. The proxy object is "invisible" to the end user, where each Embedding object will hold the underlying embedding ID to lazy fetch this. There are two aspects to this; although most users don't realize it when operating with a persistent client it mostly doesn't make a difference, so for SegmentAPI we can always return the embeddings. For HttpClient, things look different, if our embedding proxy is too fine-grained e.g. fetching from the server on a per-vector basis, which can have even worse performance impacts than always returning the vectors in the original query. Therefore, I suggest we use Here is a dumbed-down version to illustrate: class Embedding(List[Union[float, int]]):
def __init__(self, eid: str, vector: List[Union[float, int]] = None):
super().__init__(vector if vector else [])
self._eid:str = eid #this is the embedding id to be lazy loaded
@property
def eid(self) -> str:
return self._eid
class Embeddings(List[Embedding]):
def __init__(self, fetch: Callable, embeddings: List[Embedding]) -> None:
super().__init__(embeddings if embeddings else [])
self._loaded = False if embeddings else True
self._fetch = fetch
def _ensure_loaded(self) -> None:
if not self._loaded:
self._load()
def __getitem__(self, index: int) -> Embedding:
self._ensure_loaded()
return super().__getitem__(index)
def __len__(self) -> int:
self._ensure_loaded()
return super().__len__()
def _load(self) -> None:
#TODO: Implement parallel/paginated fetches
fetched = self._fetch(ids=[e.eid for e in self])
self.extend([Embedding(eid, vector) for eid, vector in fetched])
self._loaded = True
So FastAPI API impl will now return class GetResult(TypedDict):
ids: List[ID]
embeddings: Optional[Embeddings]
documents: Optional[List[Document]]
metadatas: Optional[List[Metadata]] |
To point out that that I have also looked at the other options, but I feel that returning something that is not the embeddings or proxy thereof is not a great developer experience and will lead to more questions/confusion. Additionally, if we don't return embeddings at all, it'll mismatch the docs. Following the Python Zen, if it's a duck, it should quack. @HammadB wdyt? |
hmmmm. That's an interesting idea. @HammadB thoughts? |
The lazy fetch idea is good, lets prototype it/think some more
There is some wierdness, what if the data was mutated in between? |
Hi, struggled for a while to figure out what was happening here, until I found this issue. Regarding @jeffchuber's list of options, I'd vote for:
As always, there's pros and cons for all the solutions proposed, but it does seem to me that:
communicates what's happening better than
Here, I asked for The only not-easily-circumventable cons I can think of are if some users depend on the current "fixed-schema" output (which is currently typed by the Another aspect that might count in our choice here: Is there an existing design (and corresponding lingo) we are following here? This "results always include an id", and several other choices Regarding a lazy loading wrapper: I whole heartedly cheer for that effort to, but I think it is more complex than what is immediately needed here, and is probably needed in more places than just here: Personally, I'd want to have such an object available anytime the lazy-load pattern solutions my problems. |
@thorwhalen here is the docs repo :) https://github.com/chroma-core/docs thanks for thinking this through with us! |
@jeffchuber. I see where I can submit a PR now. Looking there, though, raises a bunch more questions about what your contributing rules (style, structure, content) is. Here are two images showing the So I see the google style is to be used, but here are a few immediate questions I have:
More on this "multiple sources of truths". In the images shared about we can see four sources of what the arguments are (signature and doc text in both code and md files). This means there's six different misalignment pairs. A scary prospect to me. I prefer to generate the documentation from the code's |
Also, on a side note. I see that some functions have a lot of arguments. Perhaps we can consider putting some of these as keyword-only. Relying on positional arguments becomes risky when there starts to be a lot of arguments (it makes sense, but you can find some prose about that on the internet, if not convinced.) |
@arnaudmiribel, this is what I use to get the Essentially, it's a First, the test, so you can see the behavior I'm talking about: def test_transform_methods_to_keep_only_include_keys():
from chromadb import Client
c = Client().get_or_create_collection('test_inclusion_filter')
if ids := c.get()['ids']:
c.delete(ids)
c.add(ids=['apple', 'banana'], documents=['crumble', 'split'])
assert c.get() == {
'ids': ['apple', 'banana'],
'embeddings': None,
'metadatas': [None, None],
'documents': ['crumble', 'split'],
'uris': None,
'data': None
}
cc = transform_methods_to_keep_only_include_keys(c)
# Now see that there's only 3 fields (the default include value of Collection.get)
assert cc.get() == {
'ids': ['apple', 'banana'],
'metadatas': [None, None],
'documents': ['crumble', 'split']
}
assert cc.get(include=['uris']) == {
'ids': ['apple', 'banana'],
'uris': [None, None]
} And now the (complete) code: from inspect import signature
from functools import wraps
def subdict(d, keys):
return {k: v for k, v in d.items() if k in keys}
def map_arguments(func, args, kwargs):
"""
Get a `{argname: argval, ...}` dict from the args and kwargs of a function call.
>>> map_arguments(lambda x, y, z=42: None, [1], {'y': 2})
{'x': 1, 'y': 2, 'z': 42}
"""
b = signature(func).bind(*args, **kwargs)
b.apply_defaults()
return b.arguments
def argument_value(argname, func, args, kwargs):
"""
Extract the argument value from a function call, or the default if not given.
>>> func = lambda x, y, z=42: None
>>> argument_value('z', func, [1, 2], {}) # z not given, so use default
42
>>> argument_value('z', func, [1, 2], {'z': 4}) # z given, so use that
4
"""
kwargs = map_arguments(func, args, kwargs)
return kwargs[argname]
def keep_only_include_keys(method):
"""Return wrapped method that only subdicts it's outputs to only `include` keys."""
@wraps(method)
def _wrapped(*args, **kwargs):
kwargs = map_arguments(method, args, kwargs)
include = list(kwargs.get('include', []))
keep_keys = ['ids'] + include
return subdict(method(*args, **kwargs), keep_keys)
return _wrapped
# TODO: The pydantic.BaseModel parent class of Collection forbids me to re-assign methods
# so I need to wrap the whole Collection
class Delegator:
"""Delegates all attributes to the wrapped object"""
def __init__(self, wrapped_obj):
self.wrapped_obj = wrapped_obj
def __getattr__(self, attr): # delegation: just forward attributes to wrapped_obj
return getattr(self.wrapped_obj, attr)
def transform_methods_to_keep_only_include_keys(
instance, method_names=('get', 'query')
):
"""
Wraps all methods that contain an `include` argument so they filter their output
accordingly.
"""
delegated = Delegator(instance)
for method_name in dir(instance):
if not method_name.startswith('_'):
method = getattr(instance, method_name)
if callable(method) and 'include' in signature(method).parameters:
setattr(delegated, method_name, keep_only_include_keys(method))
return delegated @jeffchuber : Note that I originally had my >>> sorted(_methods_containing_include_argument(__import__('chromadb').Collection))
['copy', 'dict', 'get', 'json', 'query'] |
Made this issue to describe a solution which I believe is good non-disruptive step of @jeffchuber's drop the field entirely if the user does not include it solution. Basically, give With that, the user has a choice of what the behavior should be. Also--have the default be |
I added this to my PR from 6 days ago: https://github.com/chroma-core/chroma/pull/1607/commits I'll update the docs too, if the PR is accepted. |
Problem
I have successfully defined my embedding function, and added documents.
And yet when calling
Collection.get()
, the default behaviour is to showNone
forembeddings
in the output:Seeing
None
is rather counter intuitive*, as it gives the impression that embeddings were not computed. They are! But you need to explicitly passinclude=["embeddings"]
to discover that:*Saw a couple of persons having the same issue on Discord
Proposals (additive)
Collection.get
only show"id"
+ keys passed throughinclude
. Avoid the counter intuitiveNone
!"embeddings"
part of the default values forinclude
with a pretty printer, to avoid cluttering the output.New behaviours:
Settings
Chroma version: 0.3.21
Python version: 3.8.12
The text was updated successfully, but these errors were encountered: