Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Django command for generating waveforms #530

Merged
merged 13 commits into from
Mar 8, 2022

Conversation

AetherUnbound
Copy link
Contributor

Fixes

Fixes #529 by @AetherUnbound

Description

This PR adds a Django command, generatewaveforms, for generating waveforms of all audio records in order to populate the waveform cache.

I've added a new dependency, django-tqdm, to make the output easier to read. We can remove this dependency once the API no longer generates waveforms.

Command list

root@c219265a4b39:/api# python manage.py --help

Type 'manage.py help <subcommand>' for help on a specific subcommand.

Available subcommands:

[auth]
    changepassword
    createsuperuser

[catalog]
    generatewaveforms

[contenttypes]
    remove_stale_contenttypes

...

Help text

root@c219265a4b39:/api# python manage.py generatewaveforms --help
usage: manage.py generatewaveforms [-h] [--version] [-v {0,1,2,3}] [--settings SETTINGS] [--pythonpath PYTHONPATH] [--traceback] [--no-color] [--force-color] [--skip-checks]

Generates waveforms for all audio records to populate the cache

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v {0,1,2,3}, --verbosity {0,1,2,3}
                        Verbosity level; 0=minimal output, 1=normal output, 2=verbose output, 3=very verbose output
  --settings SETTINGS   The Python path to a settings module, e.g. "myproject.settings.main". If this isn't provided, the DJANGO_SETTINGS_MODULE environment variable will be used.
  --pythonpath PYTHONPATH
                        A directory to add to the Python path, e.g. "/home/djangoprojects/myproject".
  --traceback           Raise on CommandError exceptions
  --no-color            Don't colorize the command output.
  --force-color         Force colorization of the command output.
  --skip-checks         Skip system checks.

Sample output

root@c219265a4b39:/api# python manage.py generatewaveforms
Generating waveforms for 5,000 records
  0%|                                                                                                                            | 1/5000 [00:00<16:19,  5.10it/s]
  Unable to process 67acc776-838c-4242-bb99-be9e0be061b6: Can't generate json format output from Unknown file format
  1%|██▍                                                                                                                        | 69/5000 [01:04<45:46,  1.80it/s]

Testing Instructions

  1. just recreate
  2. just init
  3. docker-compose exec web bash
  4. python manage.py generatewaveforms

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner February 23, 2022 00:50
@AetherUnbound AetherUnbound added 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🤖 aspect: dx Concerns developers' experience with the codebase labels Feb 23, 2022
@AetherUnbound AetherUnbound requested a review from a team February 23, 2022 00:50
@AetherUnbound AetherUnbound removed the request for review from a team February 23, 2022 00:50
Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it locally and it worked! I ran the command and then just dj shell and confirmed the processed waveforms were in the audio_add_on table.

I've got a couple questions/gut-checks I'd like to verify before approving. Unit tests would also be nice to have for this, especially if my suggestions end up getting implemented (in particular the start/stop/resume aspect) and the complexity of it increases.

# information, so they get silenced
logging.getLogger("catalog.api.utils.waveform").setLevel(logging.WARNING)

audios = Audio.objects.all().order_by("id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is loading all of these into memory... okay? Or should we paginate this somehow? I've never done this kind of thing with Postgres before but previously with MySQL paginating these sorts of full-table iterations was a must.

Django ORM supports slicing queries for this which should make it not-too-bad to do if we need to avoid loading the entire table into memory on the API.

Also if this command crashes for any reason halfway through or we have to stop and restart it for any reason, this is going to iterate over Audio that we've already got waveforms for. Would it be worth excluding any rows that already have corresponding rows in the add_on table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah drat, I naively assumed that iterating over them in this manner would automatically paginate a cursor on the database. According to the docs1, all() evaluates the entire QuerySet. I think we can use a Paginator to improve that.

Would it be worth excluding any rows that already have corresponding rows in the add_on table?

Absolutely, great call!

Footnotes

  1. https://docs.djangoproject.com/en/dev/ref/models/querysets/#when-querysets-are-evaluated

with self.tqdm(total=count) as progress:
for audio in audios:
try:
audio.get_or_create_waveform()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about this approach, in particular I'm worried about us hitting the upstream provider endpoints too quickly. Should we rate-limit ourselves and slow this process down, maybe just as simple as sleeping for a quarter of a second after processing each record? We're not in a super rush I don't think, if we run the command in a screen we can leave it running and continue to monitor it throughout whatever amount of time it predicts will take to complete the full list of audio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be okay with adding a delay at this step 🙂 0.25 seconds is a small fraction of the time it takes to actually retrieve + compute the waveform, so that seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was noticing that as I ran it locally. I almost suggested doing something like a speed limit (1 row per second but include the time spent processing the previous row in the "wait" time between each row... or something?) but decided that was probably too complicated/overkill (though if there's some kind of utility for already doing that easily it would be sweet to use it).

@AetherUnbound
Copy link
Contributor Author

Unit tests would also be nice to have for this, especially if my suggestions end up getting implemented (in particular the start/stop/resume aspect) and the complexity of it increases.

Agreed, I'll try to start on those if I have time before I'm out. Initially this was really straightforward, but then I added error handling, and with the other suggestions it definitely makes sense to do!

@AetherUnbound AetherUnbound added 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🤖 aspect: dx Concerns developers' experience with the codebase and removed 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 🤖 aspect: dx Concerns developers' experience with the codebase labels Feb 23, 2022
@sarayourfriend
Copy link
Contributor

@AetherUnbound If you're okay with handing this off and run out of time, I'm happy to pick this up wherever you leave off, no sweat, just let me know.

@AetherUnbound
Copy link
Contributor Author

@sarayourfriend I don't think I'm going to end up getting to this before I'm out - please feel free to take it over!

@sarayourfriend
Copy link
Contributor

@AetherUnbound Sure thing!

@sarayourfriend sarayourfriend self-assigned this Feb 28, 2022
@sarayourfriend sarayourfriend requested a review from dhruvkb March 1, 2022 14:21
@sarayourfriend
Copy link
Contributor

@obulat @dhruvkb this should be ready for review now after adding pagination and unit tests

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. A few lint fixes pointed out by the check and we could merge this!

@@ -14,6 +14,7 @@ sphinx = "*"
sphinx-autobuild = "*"
furo = "*"
myst-parser = "*"
factory-boy = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition.

@zackkrida
Copy link
Member

@sarayourfriend Seeing three remaining lint errors in the latest run:

api/test/unit/management/commands/generatewaveforms_test.py:53:5: F841 local variable 'waveforms' is assigned to but never used
api/test/factory/models/media.py:6:1: F401 'catalog.api.models.media.AbstractMedia' imported but unused
api/test/factory/models/media.py:21:58: E741 ambiguous variable name 'l'

@sarayourfriend sarayourfriend force-pushed the feature/generate-waveforms-django-command#529 branch from dd246ac to b285290 Compare March 2, 2022 20:21
@sarayourfriend
Copy link
Contributor

Thanks @zackkrida, fixed and pushed. Tho now there are some conflicts I need to resolve 👍

@sarayourfriend sarayourfriend force-pushed the feature/generate-waveforms-django-command#529 branch from b285290 to 3d7fb7a Compare March 2, 2022 20:23
Copy link
Contributor Author

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't approve my own PR 😅 but here's a few comments, nothing to stop a merge though!

api/catalog/management/commands/generatewaveforms.py Outdated Show resolved Hide resolved
Comment on lines +93 to +95
audios = Audio.objects.exclude(
identifier__in=existing_waveform_audio_identifiers_query
).order_by("id")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we taking this approach because Audio and AudioAddOn don't have a foreign key relationship?

Copy link
Contributor

@sarayourfriend sarayourfriend Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "this approach" what do you mean? The inner select on the add ons table?

The tables don't have FK relationship though, you're right, at least not one the DB knows about. Trying to guess what your question was about, I don't think there's a JOIN we could do here, just an inner select (though for clarity for anyone reading this who isn't familiar with Django ORM, the inner select will be executed in the DB as an actual inner-SELECT, not Python-side).

Copy link
Contributor Author

@AetherUnbound AetherUnbound Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant that I could have sworn we could do something like

Audio.objects.filter(audioaddon__waveform_peaks__isnull=True)

if these were foreign keys. They're not, so I don't think Django would be able to interpret that relationship appropriately.

And sorry for the lack of context, I had that initially but second guessed myself and deleted the entire comment 🤦‍♀️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yes no that's not possible unfortunately because without the explicit relationship field Django won't add the reverse field relationships you'd need to make that JOIN.

from factory.django import DjangoModelFactory


class AudioFactory(MediaFactory):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super neat!

api/test/run_test.sh Show resolved Hide resolved
Comment on lines +42 to +44
call_generatewaveforms()

assert_all_audio_have_waveforms()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this approach!

@zackkrida zackkrida requested a review from obulat March 7, 2022 21:42
@zackkrida
Copy link
Member

@AetherUnbound I think your symbolic approval here would be fine, once you think it's good to go I can merge it, if no one else reviews it by then.

@stacimc
Copy link
Contributor

stacimc commented Mar 7, 2022

This looks great! Manually testing with interrupts, with the rate limit enabled/disabled, and setting a limit to let the batch complete all seem to be working great 🎉

Other than merge conflicts, I think the test_audio_success_examples test may also need to be updated. When I run the api-tests locally I get something like:

Differing items:
E         {'peaks': [0.61275, 0.19593, 0.74023, 0.45534, 0.52315, 0.40059, ...]} != {'peaks': []}

For the audio detail & search endpoints.

@AetherUnbound
Copy link
Contributor Author

I'm also getting the error Staci is getting, as well as:

in_val = '\n# Search for music titled "Wish You Were Here" by The.madpix.project\ncurl  "http://localhost:8000/v1/audio/?title=Wish%20You%20Were%20Here&creator=The.madpix.project"\n'
out_val = {'application/json': {'page': 1, 'page_count': 0, 'page_size': 20, 'result_count': 1, ...}}

    @pytest.mark.parametrize("in_val, out_val", list(audio_mappings.items()))
    def test_audio_success_examples(in_val, out_val):
        res = execute_request(in_val)
>       assert res == out_val["application/json"]
E       AssertionError: assert {'page': 1, '...ount': 1, ...} == {'page': 1, '...ount': 1, ...}
E         Omitting 4 identical items, use -vv to show
E         Differing items:
E         {'results': [{'category': 'music', 'creator': 'The.madpix.project', 'creator_url': 'https://www.jamendo.com/artist/441585/the.madpix.project', 'detail_url': 'http://localhost:8000/v1/audio/8624ba61-57f1-4f98-8a85-ece206c319cf/', ...}]} != {'results': [{'category': 'music', 'creator': 'The.madpix.project', 'creator_url': 'https://www.jamendo.com/artist/441585/the.madpix.project', 'detail_url': 'http://localhost:8000/v1/audio/8624ba61-57f1-4f98-8a85-ece206c319cf/', ...}]}
E         Use -v to get the full diff

@sarayourfriend
Copy link
Contributor

@stacimc @AetherUnbound does that happen on a fresh version of the API or only with existing data after running generatewaveforms?

If I do just recreate && just init && just api-tests they all pass. But if I run just dj "generatewaveforms --max_records=3" and then run just api-tests again they fail.

I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration.

Is fixing that in scope for this issue? I could probably find a workaround but it would likely cheapen the tests 😕 The easiest thing I can think would be to just delete the peaks property from both sides of the comparison.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried generating the waveforms and running together with the frontend locally, and I got myself some instant waveforms! :)
The problem is that there is a mismatch between what the front end assumes and what the API returns when no waveforms are available. If I understand correctly, the API returns an empty list [] if the waveform is not available.
The frontend expects either an array with at least 1 item where all values are between 0 and 1, or a falsy value (then it generates random peaks).
When the frontend gets an empty list for peaks, [], it does not show any waveform, the validator throws an error, and there are lots of Nan errors when it is trying to calculate the waveform peaks.
I am not sure whether it should be fixed in this PR, or a separate one, though, so I'm approving :)

@AetherUnbound
Copy link
Contributor Author

Ah, my bad! I didn't run recreate before testing. I'm doing so now, as soon as that finishes OK I'll fix the Pipfile.lock conflicts and merge 😄

@zackkrida zackkrida added the 💬 talk: discussion Open for discussions and feedback label Mar 8, 2022
@AetherUnbound AetherUnbound force-pushed the feature/generate-waveforms-django-command#529 branch from 753182c to 3d3c05e Compare March 8, 2022 16:47
@sarayourfriend
Copy link
Contributor

@obulat that's a bug introduced by the previous PR that actually added the peaks data to the API. Let's fix it in a separate issue.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration.

My bad, I didn't look closely enough at the tests. They are passing after recreating/reinitializing everything. That's definitely not ideal but absolutely a separate issue and shouldn't be fixed here, agreed!

@sarayourfriend sarayourfriend merged commit f74c128 into main Mar 8, 2022
@sarayourfriend sarayourfriend deleted the feature/generate-waveforms-django-command#529 branch March 8, 2022 17:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 💬 talk: discussion Open for discussions and feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Audio waveform cache-warming Django command
6 participants