Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bedrock performance tooling and initial optimisations #15513

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

stevejalim
Copy link
Collaborator

@stevejalim stevejalim commented Nov 18, 2024

This changeset adds support for profiling Bedrock using django-silk locally (or anywhere the bedrock_test image is used - but not in production).

It also contains some optimisations - via cacheing - to reduce the DB queries executed on the busiest pages.

Significant changes and points to review

Please revew this PR commit by commit, paying sceptical attention to the usage of cacheing etc. They contain details of the number queries saved.

I know that at the moment Bedrock is using a verson of Django's LocMemCache backend - this means that for 45 pods in production, we'll still get plenty of cache misses until the pods have all had a call that warms the cache. It might be that, given the TTL of the cached items, we never really get to that point in prod, but we will in Dev and Stage where there are far fewer pods.

We'd certainly get more cacheing uplift if we had a shared cache backend, such as Redis. Given we now have Redis in play for the rq backend, we could switch it at this point (expanding this PR), or as a separate change - opinions welcome!

Issue / Bugzilla link

#15505

Testing

Unit tests passing should be enough here, but feel free to follow the notes in profiling/hit_popular_pages.py to test drive things yourself.

Questions

Is the addition of django-silk work mentioning in formal documentation?

@stevejalim stevejalim requested a review from a team as a code owner November 18, 2024 11:43
@stevejalim stevejalim requested review from pmac and robhudson November 18, 2024 11:44
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 90.47619% with 6 lines in your changes missing coverage. Please review.

Project coverage is 79.13%. Comparing base (61085d4) to head (dbd3748).

Files with missing lines Patch % Lines
bedrock/settings/base.py 44.44% 5 Missing ⚠️
bedrock/urls.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15513      +/-   ##
==========================================
+ Coverage   79.06%   79.13%   +0.06%     
==========================================
  Files         160      160              
  Lines        8355     8411      +56     
==========================================
+ Hits         6606     6656      +50     
- Misses       1749     1755       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stevejalim
Copy link
Collaborator Author

Talking with @bkochendorfer, we're both OK with the idea of using Redis for the Web deployments too (the same Redis we use for the CMS task queues because there's little real-world risk of task eviction if Redis gets low on memory). As such, I will follow this PR with another one focused on using Redis as an FE cache backend.

@pmac @robhudson Are you aware of any issues that are likley switching from LocMem to Redis for a cache - is there anywhere where we specifically exploit the local nature of it?

@pmac
Copy link
Member

pmac commented Nov 19, 2024

I don't know of any issues specific to locmem other than this will be a lot slower. I am a little concerned about using the same redis for cache and queue. We are doing this with basket, but that's a much lower volume. My understanding is that you'd use very different settings in redis for a cache vs. queue.

@stevejalim
Copy link
Collaborator Author

I don't know of any issues specific to locmem other than this will be a lot slower. I am a little concerned about using the same redis for cache and queue. We are doing this with basket, but that's a much lower volume. My understanding is that you'd use very different settings in redis for a cache vs. queue.

I hear you. I'm happy to try these cache uses based on LocMemCache for now and see what the uplift is like. My leaning towards a shared cache was to reduce cache misses from pods with cold local-memory caches, but doing things one step as a time is good - may not need to use a shared cache once the pods are warm.

if qs is None:
qs = Position.objects.exclude(job_locations="Remote")
cache.set(_key, qs, settings.CACHE_TIME_SHORT)
return qs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache the queryset or cache the list of objects?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yeah, we'd cache the objects, not the qs, but it's a bit more fiddly than I expected.

Because these are CBVs, cacheing the queryset in get_queryset is easier to do (and to discover) than cacheing the Position objects themselves, because there's no clean way to slot in cacheing when the view sets the object_list attribute that I can see

Later, even though we don't use pagination in these pages, the list view runs the data through paginate_queryset() which currently (Django 4.2.x) takes a queryset arg but treats it as a general iterable -- all of which means we could make get_queryset return a list of objects (from the cache), not a QuerySet, but it kind of muddies things.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing that. It's a bit more complex than first glance. I'm actually a bit surprised we can cache the queryset objects, which seems more complex. I suppose they get pickled? But this is fine with me, I was mostly curious about your thoughts on the consideration. Thanks!

from bedrock.careers.models import Position
from bedrock.careers.tests import PositionFactory
from bedrock.mozorg.tests import TestCase


class TestPositionModel(TestCase):
def setUp(self):
cache.clear()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for bonus points you could test that the db call doesn't get triggered on a 2nd call to the wrapped methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - good call. Time to break out assertNumQueries!

Copy link
Member

@robhudson robhudson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

@stevejalim stevejalim force-pushed the 15505-bedrock-perf-pass branch from 6784cc0 to 9e6ee0f Compare November 21, 2024 12:14
@stevejalim stevejalim force-pushed the 15505-bedrock-perf-pass branch 2 times, most recently from 3683216 to dc54a7d Compare December 2, 2024 10:05
Also add script to hit some common/popular URLs
to give djanjo-silk some traffic to capture
Saves 11 SQL queries on the releasnotes page by cacheing the country code
lookup for an hour.

Tested on /en-US/firefox/132.0.1/releasenotes/

Cold cache: 14 queries / 2066ms
Warm cache: 3 queries / 222ms
Caches derived values (location options, etc) for longer than the actual page of positions

Takes careers listing page down from 9 queries to 2 for the CMS deployment and 0
on the Web deployment
It's used on the newsletter form which shows up
in a bunch of places on the site.
@stevejalim stevejalim force-pushed the 15505-bedrock-perf-pass branch from dc54a7d to dbd3748 Compare December 13, 2024 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants