-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed #27149 -- Added Subquery and Exists database expressions. #6478
Conversation
@jarshwah provided a heap of guidance on how to make this better than it was, but it seems to be coming along nicely now. Would love to have some others provide further feedback. |
# Circular reference if we do it earlier. | ||
from django.db.models.lookups import Exact | ||
|
||
# Is there a single relationship between subquery and query? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This join generation concerns me - not that it won't work just that it's kinda magical and ugly. It would be awesome if we could use the relationship name somewhere. Perhaps SubQuery(rel_name, qs=BLAH)
which is a similar API to Prefetch
? I don't know how easy that would be to get to work as the rel
object would probably need to do some of the transformations. It may allow a wider variety of rel objects to work though - e.g. subquery on a M2M field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the magic here is probably the thing I like least (although the where.children.append()
also feels nasty, but maybe there isn't another way to do that).
I played around with different ways of passing stuff around (but that was before I was pointed towards resolve_expression
)
Nice patch Matthew! I was wondering if it had been considered to allow passing a boolean to the |
I was taking the approach that you might want to filter on filtered queryset. Using the True as a shortcut for RelatedModel.objects.all() may be nice though. |
Building on @charettes suggestion, I think maybe there should also be special-cased versions:
Or should this collapse to a single case, where you can use |
FWIW I really dislike exclude. I think Django could solve a lot of (internal) problems by ditching exclude and implementing everything in terms of filter (not that that can happen now, but no need to continue on that path). If you choose to special case True, then do so for False also. |
So I've managed to implement |
Hmm. Looking at the documentation, it may be possible to use the |
The implementation of .exclude(cond) isn't anything more than .filter(~Q(cond)), and we definitely can't get rid of negated Q() objects. There are multiple problems with the way negated filters are implemented currently in Django. I'll got a bit into why we have problems in this comment even if this PR might not be the right forum. First, the definition of a negated query is pretty much this: If obj in qs and obj not in qs.filter(cond), then obj in qs.filter(~Q(cond)). Now, this is pretty nice definition and we can make it work nicely, except for the case where we are filtering along a join generating multiple entries (aka a multivalued relation). Say, we have a book with three authors, aged 40, 50 and 60. If we do The correct way to write both the filter() and negated filter() query is to use a subquery. But, we can't use a subquery for the .filter() case, as Extra fun is generated because So, the tl;dr version of this is that we do have a problem with negated filters, but there isn't an easy fix in sight. The fix might be that we change the definition of filter against a multivalued join to use a subquery (thus breaking .filter().annotate()), or that we change the definition of .filter(~Q(cond)) to not use a subquery (thus breaking the current definition of negated filtering). Both of these will break silently a lot of correctly written projects, so we need something else than just a release note for this. We could also try to fix the code, and I had some PRs a while ago trying to do exactly that, but making sure the code works for complex combinations of negated and non-negated filters over multiple multivalued relations and with annotations over the same relations is extremely hard. |
For the specific problem in this PR with exclude, we should push the joins into the subquery. That is, instead of generating a query like
the query should be
Given models Book and Chapters(book=ForeignKey(Book)), I'd start with making sure the following queryset works: Building on |
@akaariai Thanks for the input. Lots there: I think I understand some of it. I assume the last comment is basically stating that we shouldn't push the where automatically into the sub-query? Instead, there should be a pseudo-field called |
The way negated querysets and multivalued relations work is complicated, unfortunately I don't see any easy way to make it simpler. What I am after for SubQuery and Exists is that if you issue:
then what should happen is that the F('outerq__pk') gets resolved against the outer query (the Book query). So, F('outerq__pk') is replaced by book.id in the generated query. Similarly, if you did:
then the F('outerq...') would be resolved against the book query, generating a join to the publisher of the book and fetching the pk of the publisher. So, the query would look something like:
If we have that working, we are pretty far in getting all kinds of interesting subqueries implemented. You can then built on this for automatically generating the connecting condition in the subqueries. Does this make sense? |
Hmmh, it might be we can't easily implement the F('outerq__') style of connecting subqueries to their parent queries. We'd need to resolve the F('outerq__') lazily as at the point the filter() method is called for the inner query, we don't yet have a reference to the outer query. An idea is to add a Query.delayed_expressions list. When the inner query is connected to the outer query, we resolve all the delayed expressions, and similarly to how relabeled_clone is implemented, replace the delayed expression with a resolved expression. So, for example, condition This isn't exactly easy, but I believe it should work reliably. |
After starting to use In the end, I wound up writing it using |
Hmm. This seems like the reasonable place to resolve the expression to the delayed expression. |
Okay, pretty happy with how this went tonight. I've managed to implement delayed It makes the syntax for using these expressions slightly more cumbersome:
However, this explicit nature is better, because it means that you can have an arbitrary relationship to the outer query (and potentially more than one), or indeed no relationship at all. I needed to not grab a |
Oh, @akaariai, I didn't use an explicit Thanks for your help. |
[resolve(_child) for _child in child.children] | ||
if hasattr(child, 'rhs') and isinstance(child.rhs, F): | ||
fieldname = child.rhs.name.split('__')[1] | ||
if fieldname == 'pk': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can likely just resolve the F() expression using resolve_expression() once you have removed the outerq prefix from the F obj.
Apart from the |
'ORDER BY "lookup_author"."name"' | ||
) | ||
print(expected) | ||
result = str(Author.objects.exclude(article__exists=Article.objects.filter(author=OuterRef('pk'))).query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here is using the lookup machinery in the first place. It will try to create a subquery for you...
Could you add a hack where exists and not exists are implemented as annotation + adding that annotation to the where clause directly? That might work, though no guarantees. Btw why do you need not exists, isn't ~Q(field__exists) enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about a way to use the annotation version: I'm currently doing some stuff in a test project where I annotate and then filter on those annotations.
The rationale for the not_exists was they are just temporary, so I can see what the query is supposed to look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had even thought about Model.objects.filter(article=Exists(...))
as the filtering syntax, but I'm not sure how that will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just Model.objects.filter(Exists(...))? It seems you have the outerq-innerq connection already manually added.
It might be a good choice to skip the lookup support for the initial patch and go with Exists and SubQuery only. Usually if it is possible to write a patch in two parts, it is better to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'll have a quick look at the bare .filter(Exists(...))
, and if that isn't super-simple, scale down the patch.
Also, I'm not sure how to handle evaluation of a queryset that contains an unresolved OuterRef()
(or the resolved F()
expression that refers to an outer query).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep OuterRef as OuterRef instead of resolving to F, and add as_sql to OuterRef throwing an exception with info of what went wrong.
Best thing about this: all of this functionality only adds classes, doesn't actually require any changes to any other django code. That's got to be a win. |
1a61598
to
24bda5b
Compare
FWIW I've used this today. I managed to get a bit of an odd issue with sql params being a tuple, in most of the expressions it's a list. This may just be a bug in my code for |
Lists. We often use all_params.extend(part_params) when combining parts of SQL together. I'm not sure if we have actually documented this anywhere - if not, we probably should. |
Hmm ok, so I'm seeing django/django/db/models/sql/compiler.py Line 1224 in 92053ac
For now I've fixed it locally by forcing |
I thought i'd only CamelCased SubQuery when talking about the class (or instances of it): otherwise I'd used subquery. I'm not exactly sure when I'll get enough time to look at the Oracle traceback stuff: I don't have VirtualBox set up (it crashed when then using xhyve), and I have after-work commitments today. |
My question was: why camel case the class name? |
buildbot, test on oracle |
For what it's worth I prefer the CamelCased |
Oracle errors are due to Oracle not supporting
Simply wrapping in a Case/When expression won't work either, since When expects a LHS/Condition (just like filter does). You'll need to change the template and hardcode the CASE WHEN SQL. I'll have a go doing this later tonight. I think there are other issues with how the SQL is being compiled, but that might just be my setup or how I'm printing SQL. I'll look into it further. |
OK, the
But now running into problems with how Oracle does LIMITs. Oracle requires a subquery for doing limits. It pushes the original query into a subquery, then wraps it with a TestCase: test_nested_subquery_outer_ref_2:
One way I've fixed this SQL manually (in the sql shell), is by aliasing the outer subquery (that does ROWNUM <= 1), then changing the "SELECT *" to "SELECT newalias.OUTER". I don't think that's the right path though. It requires the outer most query to have knowledge of the inner most query select list, and to return just those values. A pretty shitty situation. I'd be interested in hearing ways to solve this particular problem in SQL, so that it can be translated more easily into python. |
Oh, will need a test to make sure NOT EXISTS is still working correctly on Oracle. |
This diff fixes the
|
This sure is promising. I ran a bunch of test modules, and the test_subquery is the only test to fail with the above patches. I'll try to solve this too though. hold my beer. |
OK, all tests passing with a PR I opened up against @schinckel branch: schinckel#1 If you can merge this, we can kick off the oracle build to verify, and then Tim can finish review. |
Can I put your beer down yet? |
|
||
def as_sql(self, *args, **kwargs): | ||
raise FieldError( | ||
'This queryset contains an unresolved reference to an outer query ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test for this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure how to get this method to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove it?
alias for alias in change_map.values() if alias not in clone.subquery.query.tables) | ||
return clone | ||
|
||
def as_sql(self, compiler, connection, template=None, **extra_context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage of template=None, **extra_context
params seems untested. Not sure if it's important.
|
||
template = template or template_params.get('template', self.template) | ||
sql = template % template_params | ||
sql = connection.ops.unification_cast_sql(self.output_field) % sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for this?
``SubQuery()`` expressions | ||
-------------------------- | ||
|
||
.. currentmodule:: django.db.models.expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel this adds some value? You added aliases in django.db.models so I don't see why these expressions would be treated differently from the rest in this document. I didn't want to change it without asking.
a1f76e8
to
9acac32
Compare
|
||
.. currentmodule:: django.db.models.expressions | ||
|
||
.. class:: Subquery(queryset, output_field=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queryset
is called subquery
in the code -- should it be consistent one way or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be queryset, because that shows what type of argument is expected.
.. versionadded:: 1.11 | ||
|
||
When a value is required that is dependent upon values from within a related | ||
object, it's possible to use an explicit ``Subquery``. For example:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to describe in words what the querysets are computing. Also, I'm not sure how valuable the examples are without the model definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think maybe I'll rewrite the examples to use an existing model structure, or at least something with "known" semantics. Perhaps Author, Post, Comment?
buildbot, test on oracle. |
So I've had a crack at https://code.djangoproject.com/ticket/16603, however I can't get it to work (because of the M2M stuff, and I'm not sure I understand exactly what the regression test is supposed to be testing). I've also had a look at https://code.djangoproject.com/ticket/25789, and I think I'm just not going to claim that they can be solved by these classes. I'm not convinced they can't, but I don't think we have time to investigate further before feature freeze. |
fe0adac
to
d1cf201
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please check my edits.
|
||
def as_sql(self, *args, **kwargs): | ||
raise FieldError( | ||
'This queryset contains an unresolved reference to an outer query ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove it?
|
||
>>> from django.db.models import Exists, OuterRef | ||
>>> recent_comments = Comment.objects.filter(post=OuterRef('pk'), created_at__gte=one_day_ago) | ||
>>> Post.objects.filter(recent_comment=Exists(recent_comments)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, this doesn't work (and the next section says so -- FieldError: Cannot resolve keyword 'recent_comment' into field.
). I revised it as I saw fit.
self.assertCountEqual(qs.values('pk').filter(is_point_of_contact=True), expected_result) | ||
# A less elegant way to write the same query: this uses a LEFT OUTER | ||
# JOIN and an IS NULL which is probably less efficient than EXISTS. | ||
self.assertCountEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL/PostgreSQL test failure suggests this query isn't equivalent (unless it really only cares about .count()
as you had previously). Let me know what you want to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
I think maybe the expected_result
is incorrect, because the test value matches the one in the next query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact: that's testing a company, not an employee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm not 100% happy about with this is that the query that is generated contains a superfluous = true
:
SELECT "expressions_employee"."id"
FROM "expressions_employee"
WHERE EXISTS(SELECT U0."id",
U0."name",
U0."num_employees",
U0."num_chairs",
U0."ceo_id",
U0."point_of_contact_id"
FROM "expressions_company" U0
WHERE U0."point_of_contact_id" = ("expressions_employee"."id")) = True
ORDER BY "expressions_employee"."firstname" ASC
(And we are selecting superfluous columns with the EXISTS (SELECT ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the = true
is a byproduct of filter requiring a left and right hand side. This can be fixed when that restriction is removed via filter/expressions.
Yeah, your edits are great! |
6d1243e
to
d7e4a08
Compare
Thanks Josh Smeaton for Oracle fixes.
d7e4a08
to
236ebe9
Compare
Sort-of related to:
https://code.djangoproject.com/ticket/16603
https://code.djangoproject.com/ticket/25789
but this is dealing with .annotate(), rather than .filter() methods
on a queryset.
https://code.djangoproject.com/ticket/27149