Filter include #1626

thorwhalen · 2024-01-12T11:31:00Z

Description of changes

Resolves Feature Request: A filter_include argument for Collection.get and Collection.query #1622, which itself is related to #300.

Include a filter_include argument in Collection.get and Collection.query, which (if True) will have the effect of filtering in only those fields that were requested (plus "ids").

I propose to set the default of filter_include to be False for now, controlled via a DFLT_FILTER_INCLUDE variable.
This is so that the current behavior doesn't change, so we have minimal disruption.

Test plan

Included tests in chromadb/test/property/test_collections.py.

Documentation Changes

Documentation changes are needed to mention the additional filter_include argument, but will be done if and after this PR is accepted.
One reason is: The Collection.get and Collection.query documentation are already behind on the arguments, so will add those as well at the same time.

github-actions · 2024-01-12T11:31:12Z

tazarov

@thorwhalen, thanks for splitting the PRs and thinking of this ergonomic aspect of the API. Even then, I'm not too sure about this change. Seemingly innocuous, it has the potential to invalidate assumptions made by downstream projects. Also, the need to add filter_include as an extra keyword on top of the already existing include seems counter-intuitive - it seems like saying something and then saying, "What I mean to say is..."

@HammadB, wdyt?

tazarov · 2024-01-18T06:23:34Z

chromadb/api/models/Collection.py

+
+        if filter_include:
+            keep_keys = ['ids'] + [key for key in include if key != 'ids']
+            return subdict(get_results, keep_keys)


@thorwhalen, do you find this might break existing downstream projects that may depend on attribute None checks?

Also it feels awkward that we return something else than GetResult, in this case, a subset dictionary.

Not sure I understand: "(...) break existing downstream projects that may depend on attribute None checks?".

But about the GetResult TypeDict, know (in case you didn't) that since 3.11, one can make some keys optional, using the NotRequired type.

Here's the pep 655 about it

tazarov · 2024-01-18T06:24:43Z

chromadb/api/models/Collection.py

@@ -364,6 +388,10 @@ def query(
        if "uris" not in include:
            query_results["uris"] = None

+        if filter_include:
+            keep_keys = ['ids'] + [key for key in include if key != 'ids']
+            return subdict(query_results, keep_keys)


Same issue as above:

Downstream impacts

Non-typed returns

thorwhalen · 2024-01-19T09:23:23Z

@thorwhalen, thanks for splitting the PRs and thinking of this ergonomic aspect of the API. Even then, I'm not too sure about this change. Seemingly innocuous, it has the potential to invalidate assumptions made by downstream projects. Also, the need to add filter_include as an extra keyword on top of the already existing include seems counter-intuitive - it seems like saying something and then saying, "What I mean to say is..."

@HammadB, wdyt?

Quite frankly, I think the solution, proposed earlier in the issue, that we only return keys that are listed in include (plus ids) is a better one.

The PR I'm proposing here is just to see what a non-breaking solution would be.
Perhaps an even "better" one would be to have an "external" class decorator that changes the behavior of the get and query functions instead, so that users only have to pull in that functionality if they want to.

Still, if it were my choice, I'd go for changing the semantics of include as mentioned above.

Downstream project

I'm not sure I understand the downstream concern specifically, but understand all too well the concern generally. It's always a hard choice, and the only thing we can be certain of is any change will invalidate someone's assumptions.

Yet the solution of changing the semantics of include (my favorite solution) seems to be more of a breaking change than the one I'm proposing, no?

Non-typed returns

Can still be typed, but using NotRequired with the TypeDict.

Awkward `filter_include`

Agreed. I thought of it, and it's the least awkward I could think of.
Any other ideas?

I asked chatGPT just now, and this is what it gave me:

exclude_unrequested: This name clearly indicates that any fields not requested are excluded from the response.
trim_response: Suggests the method will trim or reduce the response to only include requested fields.
lean_output: Implies the output will be lean or minimal, containing only the requested keys.
compact_result: Indicates that the result will be compact, containing only essential or requested data.
include_only_specified: Directly states that only the specified fields in the include argument will be present in the result.
omit_unspecified: This name suggests that unspecified fields will be omitted from the response.
sparse_fields: Indicates that the fields in the response will be sparse, limited to those specified.
restrict_to_included: This name denotes that the response will be restricted to only the included fields.
minimal_keys: Suggests that the response will contain a minimal set of keys, aligning with the user's request.
prune_extras: Implies that extra or unrequested fields will be pruned from the response.

tazarov · 2024-01-19T11:52:22Z

@thorwhalen, let me give it a thought.

thorwhalen added 2 commits January 11, 2024 15:06

feat: add filter_include arg to Collection get and query methods

32546f1

Merge branch 'chroma-core:main' into filter_include

36ff95a

This was referenced Jan 12, 2024

feat: FileLoader and vectorize. Closes #1606 #1607

Closed

[Feature Request]: A filter_include argument for Collection.get and Collection.query #1622

Open

tazarov reviewed Jan 18, 2024

View reviewed changes

jeffchuber added the community label Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter include #1626

Filter include #1626

thorwhalen commented Jan 12, 2024

github-actions bot commented Jan 12, 2024

tazarov left a comment

tazarov Jan 18, 2024

thorwhalen Jan 19, 2024

tazarov Jan 18, 2024

thorwhalen commented Jan 19, 2024

tazarov commented Jan 19, 2024

Filter include #1626

Are you sure you want to change the base?

Filter include #1626

Conversation

thorwhalen commented Jan 12, 2024

Description of changes

Test plan

Documentation Changes

github-actions bot commented Jan 12, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

tazarov left a comment

Choose a reason for hiding this comment

tazarov Jan 18, 2024

Choose a reason for hiding this comment

thorwhalen Jan 19, 2024

Choose a reason for hiding this comment

tazarov Jan 18, 2024

Choose a reason for hiding this comment

thorwhalen commented Jan 19, 2024

Downstream project

Non-typed returns

Awkward filter_include

tazarov commented Jan 19, 2024

Awkward `filter_include`