-
Notifications
You must be signed in to change notification settings - Fork 50
Django command for generating waveforms #530
Django command for generating waveforms #530
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it locally and it worked! I ran the command and then just dj shell
and confirmed the processed waveforms were in the audio_add_on table.
I've got a couple questions/gut-checks I'd like to verify before approving. Unit tests would also be nice to have for this, especially if my suggestions end up getting implemented (in particular the start/stop/resume aspect) and the complexity of it increases.
# information, so they get silenced | ||
logging.getLogger("catalog.api.utils.waveform").setLevel(logging.WARNING) | ||
|
||
audios = Audio.objects.all().order_by("id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is loading all of these into memory... okay? Or should we paginate this somehow? I've never done this kind of thing with Postgres before but previously with MySQL paginating these sorts of full-table iterations was a must.
Django ORM supports slicing queries for this which should make it not-too-bad to do if we need to avoid loading the entire table into memory on the API.
Also if this command crashes for any reason halfway through or we have to stop and restart it for any reason, this is going to iterate over Audio that we've already got waveforms for. Would it be worth excluding any rows that already have corresponding rows in the add_on table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah drat, I naively assumed that iterating over them in this manner would automatically paginate a cursor on the database. According to the docs1, all()
evaluates the entire QuerySet
. I think we can use a Paginator
to improve that.
Would it be worth excluding any rows that already have corresponding rows in the add_on table?
Absolutely, great call!
Footnotes
with self.tqdm(total=count) as progress: | ||
for audio in audios: | ||
try: | ||
audio.get_or_create_waveform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about this approach, in particular I'm worried about us hitting the upstream provider endpoints too quickly. Should we rate-limit ourselves and slow this process down, maybe just as simple as sleeping for a quarter of a second after processing each record? We're not in a super rush I don't think, if we run the command in a screen we can leave it running and continue to monitor it throughout whatever amount of time it predicts will take to complete the full list of audio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be okay with adding a delay at this step 🙂 0.25 seconds is a small fraction of the time it takes to actually retrieve + compute the waveform, so that seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was noticing that as I ran it locally. I almost suggested doing something like a speed limit (1 row per second but include the time spent processing the previous row in the "wait" time between each row... or something?) but decided that was probably too complicated/overkill (though if there's some kind of utility for already doing that easily it would be sweet to use it).
Agreed, I'll try to start on those if I have time before I'm out. Initially this was really straightforward, but then I added error handling, and with the other suggestions it definitely makes sense to do! |
@AetherUnbound If you're okay with handing this off and run out of time, I'm happy to pick this up wherever you leave off, no sweat, just let me know. |
@sarayourfriend I don't think I'm going to end up getting to this before I'm out - please feel free to take it over! |
@AetherUnbound Sure thing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. A few lint fixes pointed out by the check and we could merge this!
@@ -14,6 +14,7 @@ sphinx = "*" | |||
sphinx-autobuild = "*" | |||
furo = "*" | |||
myst-parser = "*" | |||
factory-boy = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition.
@sarayourfriend Seeing three remaining lint errors in the latest run: api/test/unit/management/commands/generatewaveforms_test.py:53:5: F841 local variable 'waveforms' is assigned to but never used
api/test/factory/models/media.py:6:1: F401 'catalog.api.models.media.AbstractMedia' imported but unused
api/test/factory/models/media.py:21:58: E741 ambiguous variable name 'l' |
dd246ac
to
b285290
Compare
Thanks @zackkrida, fixed and pushed. Tho now there are some conflicts I need to resolve 👍 |
b285290
to
3d7fb7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't approve my own PR 😅 but here's a few comments, nothing to stop a merge though!
audios = Audio.objects.exclude( | ||
identifier__in=existing_waveform_audio_identifiers_query | ||
).order_by("id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we taking this approach because Audio
and AudioAddOn
don't have a foreign key relationship?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "this approach" what do you mean? The inner select on the add ons table?
The tables don't have FK relationship though, you're right, at least not one the DB knows about. Trying to guess what your question was about, I don't think there's a JOIN we could do here, just an inner select (though for clarity for anyone reading this who isn't familiar with Django ORM, the inner select will be executed in the DB as an actual inner-SELECT, not Python-side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I meant that I could have sworn we could do something like
Audio.objects.filter(audioaddon__waveform_peaks__isnull=True)
if these were foreign keys. They're not, so I don't think Django would be able to interpret that relationship appropriately.
And sorry for the lack of context, I had that initially but second guessed myself and deleted the entire comment 🤦♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh yes no that's not possible unfortunately because without the explicit relationship field Django won't add the reverse field relationships you'd need to make that JOIN.
from factory.django import DjangoModelFactory | ||
|
||
|
||
class AudioFactory(MediaFactory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super neat!
call_generatewaveforms() | ||
|
||
assert_all_audio_have_waveforms() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this approach!
@AetherUnbound I think your symbolic approval here would be fine, once you think it's good to go I can merge it, if no one else reviews it by then. |
This looks great! Manually testing with interrupts, with the rate limit enabled/disabled, and setting a limit to let the batch complete all seem to be working great 🎉 Other than merge conflicts, I think the
For the audio detail & search endpoints. |
I'm also getting the error Staci is getting, as well as:
|
@stacimc @AetherUnbound does that happen on a fresh version of the API or only with existing data after running If I do I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration. Is fixing that in scope for this issue? I could probably find a workaround but it would likely cheapen the tests 😕 The easiest thing I can think would be to just delete the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried generating the waveforms and running together with the frontend locally, and I got myself some instant waveforms! :)
The problem is that there is a mismatch between what the front end assumes and what the API returns when no waveforms are available. If I understand correctly, the API returns an empty list []
if the waveform is not available.
The frontend expects either an array with at least 1 item where all values are between 0 and 1, or a falsy value (then it generates random peaks).
When the frontend gets an empty list for peaks, []
, it does not show any waveform, the validator throws an error, and there are lots of Nan
errors when it is trying to calculate the waveform peaks.
I am not sure whether it should be fixed in this PR, or a separate one, though, so I'm approving :)
Ah, my bad! I didn't run |
Kudos to @AetherUnbound for coming up with much better ones than I was able to.
753182c
to
3d3c05e
Compare
@obulat that's a bug introduced by the previous PR that actually added the peaks data to the API. Let's fix it in a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration.
My bad, I didn't look closely enough at the tests. They are passing after recreating/reinitializing everything. That's definitely not ideal but absolutely a separate issue and shouldn't be fixed here, agreed!
Fixes
Fixes #529 by @AetherUnbound
Description
This PR adds a Django command,
generatewaveforms
, for generating waveforms of all audio records in order to populate the waveform cache.I've added a new dependency,
django-tqdm
, to make the output easier to read. We can remove this dependency once the API no longer generates waveforms.Command list
Help text
Sample output
Testing Instructions
just recreate
just init
docker-compose exec web bash
python manage.py generatewaveforms
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin