Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter include #1626

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

thorwhalen
Copy link

Description of changes

Resolves Feature Request: A filter_include argument for Collection.get and Collection.query #1622, which itself is related to #300.

Include a filter_include argument in Collection.get and Collection.query, which (if True) will have the effect of filtering in only those fields that were requested (plus "ids").

I propose to set the default of filter_include to be False for now, controlled via a DFLT_FILTER_INCLUDE variable.
This is so that the current behavior doesn't change, so we have minimal disruption.

Test plan

Included tests in chromadb/test/property/test_collections.py.

Documentation Changes

Documentation changes are needed to mention the additional filter_include argument, but will be done if and after this PR is accepted.
One reason is: The Collection.get and Collection.query documentation are already behind on the arguments, so will add those as well at the same time.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

@tazarov tazarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thorwhalen, thanks for splitting the PRs and thinking of this ergonomic aspect of the API. Even then, I'm not too sure about this change. Seemingly innocuous, it has the potential to invalidate assumptions made by downstream projects. Also, the need to add filter_include as an extra keyword on top of the already existing include seems counter-intuitive - it seems like saying something and then saying, "What I mean to say is..."

@HammadB, wdyt?


if filter_include:
keep_keys = ['ids'] + [key for key in include if key != 'ids']
return subdict(get_results, keep_keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thorwhalen, do you find this might break existing downstream projects that may depend on attribute None checks?

Also it feels awkward that we return something else than GetResult, in this case, a subset dictionary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand: "(...) break existing downstream projects that may depend on attribute None checks?".

But about the GetResult TypeDict, know (in case you didn't) that since 3.11, one can make some keys optional, using the NotRequired type.

Here's the pep 655 about it

@@ -364,6 +388,10 @@ def query(
if "uris" not in include:
query_results["uris"] = None

if filter_include:
keep_keys = ['ids'] + [key for key in include if key != 'ids']
return subdict(query_results, keep_keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above:

  • Downstream impacts
  • Non-typed returns

@thorwhalen
Copy link
Author

@thorwhalen, thanks for splitting the PRs and thinking of this ergonomic aspect of the API. Even then, I'm not too sure about this change. Seemingly innocuous, it has the potential to invalidate assumptions made by downstream projects. Also, the need to add filter_include as an extra keyword on top of the already existing include seems counter-intuitive - it seems like saying something and then saying, "What I mean to say is..."

@HammadB, wdyt?

Quite frankly, I think the solution, proposed earlier in the issue, that we only return keys that are listed in include (plus ids) is a better one.

The PR I'm proposing here is just to see what a non-breaking solution would be.
Perhaps an even "better" one would be to have an "external" class decorator that changes the behavior of the get and query functions instead, so that users only have to pull in that functionality if they want to.

Still, if it were my choice, I'd go for changing the semantics of include as mentioned above.

Downstream project

I'm not sure I understand the downstream concern specifically, but understand all too well the concern generally. It's always a hard choice, and the only thing we can be certain of is any change will invalidate someone's assumptions.

Yet the solution of changing the semantics of include (my favorite solution) seems to be more of a breaking change than the one I'm proposing, no?

Non-typed returns

Can still be typed, but using NotRequired with the TypeDict.

Awkward filter_include

Agreed. I thought of it, and it's the least awkward I could think of.
Any other ideas?

I asked chatGPT just now, and this is what it gave me:

  1. exclude_unrequested: This name clearly indicates that any fields not requested are excluded from the response.
  2. trim_response: Suggests the method will trim or reduce the response to only include requested fields.
  3. lean_output: Implies the output will be lean or minimal, containing only the requested keys.
  4. compact_result: Indicates that the result will be compact, containing only essential or requested data.
  5. include_only_specified: Directly states that only the specified fields in the include argument will be present in the result.
  6. omit_unspecified: This name suggests that unspecified fields will be omitted from the response.
  7. sparse_fields: Indicates that the fields in the response will be sparse, limited to those specified.
  8. restrict_to_included: This name denotes that the response will be restricted to only the included fields.
  9. minimal_keys: Suggests that the response will contain a minimal set of keys, aligning with the user's request.
  10. prune_extras: Implies that extra or unrequested fields will be pruned from the response.

@tazarov
Copy link
Contributor

tazarov commented Jan 19, 2024

@thorwhalen, let me give it a thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: A filter_include argument for Collection.get and Collection.query
3 participants