Django command for generating waveforms #530

AetherUnbound · 2022-02-23T00:50:29Z

Fixes

Description

This PR adds a Django command, generatewaveforms, for generating waveforms of all audio records in order to populate the waveform cache.

I've added a new dependency, django-tqdm, to make the output easier to read. We can remove this dependency once the API no longer generates waveforms.

Command list

root@c219265a4b39:/api# python manage.py --help

Type 'manage.py help <subcommand>' for help on a specific subcommand.

Available subcommands:

[auth]
    changepassword
    createsuperuser

[catalog]
    generatewaveforms

[contenttypes]
    remove_stale_contenttypes

...

Help text

root@c219265a4b39:/api# python manage.py generatewaveforms --help
usage: manage.py generatewaveforms [-h] [--version] [-v {0,1,2,3}] [--settings SETTINGS] [--pythonpath PYTHONPATH] [--traceback] [--no-color] [--force-color] [--skip-checks]

Generates waveforms for all audio records to populate the cache

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v {0,1,2,3}, --verbosity {0,1,2,3}
                        Verbosity level; 0=minimal output, 1=normal output, 2=verbose output, 3=very verbose output
  --settings SETTINGS   The Python path to a settings module, e.g. "myproject.settings.main". If this isn't provided, the DJANGO_SETTINGS_MODULE environment variable will be used.
  --pythonpath PYTHONPATH
                        A directory to add to the Python path, e.g. "/home/djangoprojects/myproject".
  --traceback           Raise on CommandError exceptions
  --no-color            Don't colorize the command output.
  --force-color         Force colorization of the command output.
  --skip-checks         Skip system checks.

Sample output

root@c219265a4b39:/api# python manage.py generatewaveforms
Generating waveforms for 5,000 records
  0%|                                                                                                                            | 1/5000 [00:00<16:19,  5.10it/s]
  Unable to process 67acc776-838c-4242-bb99-be9e0be061b6: Can't generate json format output from Unknown file format
  1%|██▍                                                                                                                        | 69/5000 [01:04<45:46,  1.80it/s]

Testing Instructions

just recreate
just init
docker-compose exec web bash
python manage.py generatewaveforms

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend

Tested it locally and it worked! I ran the command and then just dj shell and confirmed the processed waveforms were in the audio_add_on table.

I've got a couple questions/gut-checks I'd like to verify before approving. Unit tests would also be nice to have for this, especially if my suggestions end up getting implemented (in particular the start/stop/resume aspect) and the complexity of it increases.

sarayourfriend · 2022-02-23T01:17:41Z

api/catalog/management/commands/generatewaveforms.py

+        # information, so they get silenced
+        logging.getLogger("catalog.api.utils.waveform").setLevel(logging.WARNING)
+
+        audios = Audio.objects.all().order_by("id")


Is loading all of these into memory... okay? Or should we paginate this somehow? I've never done this kind of thing with Postgres before but previously with MySQL paginating these sorts of full-table iterations was a must.

Django ORM supports slicing queries for this which should make it not-too-bad to do if we need to avoid loading the entire table into memory on the API.

Also if this command crashes for any reason halfway through or we have to stop and restart it for any reason, this is going to iterate over Audio that we've already got waveforms for. Would it be worth excluding any rows that already have corresponding rows in the add_on table?

Ah drat, I naively assumed that iterating over them in this manner would automatically paginate a cursor on the database. According to the docs¹, all() evaluates the entire QuerySet. I think we can use a Paginator to improve that.

Would it be worth excluding any rows that already have corresponding rows in the add_on table?

Absolutely, great call!

Footnotes

https://docs.djangoproject.com/en/dev/ref/models/querysets/#when-querysets-are-evaluated ↩

sarayourfriend · 2022-02-23T11:56:32Z

api/catalog/management/commands/generatewaveforms.py

+        with self.tqdm(total=count) as progress:
+            for audio in audios:
+                try:
+                    audio.get_or_create_waveform()


I'm a little concerned about this approach, in particular I'm worried about us hitting the upstream provider endpoints too quickly. Should we rate-limit ourselves and slow this process down, maybe just as simple as sleeping for a quarter of a second after processing each record? We're not in a super rush I don't think, if we run the command in a screen we can leave it running and continue to monitor it throughout whatever amount of time it predicts will take to complete the full list of audio.

I would be okay with adding a delay at this step 🙂 0.25 seconds is a small fraction of the time it takes to actually retrieve + compute the waveform, so that seems fine.

I was noticing that as I ran it locally. I almost suggested doing something like a speed limit (1 row per second but include the time spent processing the previous row in the "wait" time between each row... or something?) but decided that was probably too complicated/overkill (though if there's some kind of utility for already doing that easily it would be sweet to use it).

AetherUnbound · 2022-02-23T18:55:30Z

Unit tests would also be nice to have for this, especially if my suggestions end up getting implemented (in particular the start/stop/resume aspect) and the complexity of it increases.

Agreed, I'll try to start on those if I have time before I'm out. Initially this was really straightforward, but then I added error handling, and with the other suggestions it definitely makes sense to do!

sarayourfriend · 2022-02-23T19:23:04Z

@AetherUnbound If you're okay with handing this off and run out of time, I'm happy to pick this up wherever you leave off, no sweat, just let me know.

AetherUnbound · 2022-02-25T18:23:10Z

@sarayourfriend I don't think I'm going to end up getting to this before I'm out - please feel free to take it over!

sarayourfriend · 2022-02-25T19:10:08Z

@AetherUnbound Sure thing!

sarayourfriend · 2022-03-01T14:21:39Z

@obulat @dhruvkb this should be ready for review now after adding pagination and unit tests

dhruvkb

This looks good to me. A few lint fixes pointed out by the check and we could merge this!

dhruvkb · 2022-03-01T14:24:45Z

api/Pipfile

@@ -14,6 +14,7 @@ sphinx = "*"
 sphinx-autobuild = "*"
 furo = "*"
 myst-parser = "*"
+factory-boy = "*"


This is a nice addition.

zackkrida · 2022-03-02T16:41:52Z

@sarayourfriend Seeing three remaining lint errors in the latest run:

api/test/unit/management/commands/generatewaveforms_test.py:53:5: F841 local variable 'waveforms' is assigned to but never used
api/test/factory/models/media.py:6:1: F401 'catalog.api.models.media.AbstractMedia' imported but unused
api/test/factory/models/media.py:21:58: E741 ambiguous variable name 'l'

sarayourfriend · 2022-03-02T20:21:28Z

Thanks @zackkrida, fixed and pushed. Tho now there are some conflicts I need to resolve 👍

AetherUnbound

I can't approve my own PR 😅 but here's a few comments, nothing to stop a merge though!

api/catalog/management/commands/generatewaveforms.py

AetherUnbound · 2022-03-07T19:41:58Z

api/catalog/management/commands/generatewaveforms.py

+        audios = Audio.objects.exclude(
+            identifier__in=existing_waveform_audio_identifiers_query
+        ).order_by("id")


Are we taking this approach because Audio and AudioAddOn don't have a foreign key relationship?

By "this approach" what do you mean? The inner select on the add ons table?

The tables don't have FK relationship though, you're right, at least not one the DB knows about. Trying to guess what your question was about, I don't think there's a JOIN we could do here, just an inner select (though for clarity for anyone reading this who isn't familiar with Django ORM, the inner select will be executed in the DB as an actual inner-SELECT, not Python-side).

Ah, I meant that I could have sworn we could do something like

Audio.objects.filter(audioaddon__waveform_peaks__isnull=True)

if these were foreign keys. They're not, so I don't think Django would be able to interpret that relationship appropriately.

And sorry for the lack of context, I had that initially but second guessed myself and deleted the entire comment 🤦‍♀️

Ohh yes no that's not possible unfortunately because without the explicit relationship field Django won't add the reverse field relationships you'd need to make that JOIN.

AetherUnbound · 2022-03-07T19:43:03Z

api/test/factory/models/audio.py

+from factory.django import DjangoModelFactory
+
+
+class AudioFactory(MediaFactory):


Super neat!

api/test/run_test.sh

AetherUnbound · 2022-03-07T19:45:32Z

api/test/unit/management/commands/generatewaveforms_test.py

+    call_generatewaveforms()
+
+    assert_all_audio_have_waveforms()


I love this approach!

api/test/unit/management/commands/generatewaveforms_test.py

zackkrida · 2022-03-07T21:43:14Z

@AetherUnbound I think your symbolic approval here would be fine, once you think it's good to go I can merge it, if no one else reviews it by then.

stacimc · 2022-03-07T21:56:31Z

This looks great! Manually testing with interrupts, with the rate limit enabled/disabled, and setting a limit to let the batch complete all seem to be working great 🎉

Other than merge conflicts, I think the test_audio_success_examples test may also need to be updated. When I run the api-tests locally I get something like:

Differing items:
E         {'peaks': [0.61275, 0.19593, 0.74023, 0.45534, 0.52315, 0.40059, ...]} != {'peaks': []}

For the audio detail & search endpoints.

AetherUnbound · 2022-03-07T22:20:37Z

I'm also getting the error Staci is getting, as well as:

in_val = '\n# Search for music titled "Wish You Were Here" by The.madpix.project\ncurl  "http://localhost:8000/v1/audio/?title=Wish%20You%20Were%20Here&creator=The.madpix.project"\n'
out_val = {'application/json': {'page': 1, 'page_count': 0, 'page_size': 20, 'result_count': 1, ...}}

    @pytest.mark.parametrize("in_val, out_val", list(audio_mappings.items()))
    def test_audio_success_examples(in_val, out_val):
        res = execute_request(in_val)
>       assert res == out_val["application/json"]
E       AssertionError: assert {'page': 1, '...ount': 1, ...} == {'page': 1, '...ount': 1, ...}
E         Omitting 4 identical items, use -vv to show
E         Differing items:
E         {'results': [{'category': 'music', 'creator': 'The.madpix.project', 'creator_url': 'https://www.jamendo.com/artist/441585/the.madpix.project', 'detail_url': 'http://localhost:8000/v1/audio/8624ba61-57f1-4f98-8a85-ece206c319cf/', ...}]} != {'results': [{'category': 'music', 'creator': 'The.madpix.project', 'creator_url': 'https://www.jamendo.com/artist/441585/the.madpix.project', 'detail_url': 'http://localhost:8000/v1/audio/8624ba61-57f1-4f98-8a85-ece206c319cf/', ...}]}
E         Use -v to get the full diff

sarayourfriend · 2022-03-08T01:27:41Z

@stacimc @AetherUnbound does that happen on a fresh version of the API or only with existing data after running generatewaveforms?

If I do just recreate && just init && just api-tests they all pass. But if I run just dj "generatewaveforms --max_records=3" and then run just api-tests again they fail.

I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration.

Is fixing that in scope for this issue? I could probably find a workaround but it would likely cheapen the tests 😕 The easiest thing I can think would be to just delete the peaks property from both sides of the comparison.

obulat

I tried generating the waveforms and running together with the frontend locally, and I got myself some instant waveforms! :)
The problem is that there is a mismatch between what the front end assumes and what the API returns when no waveforms are available. If I understand correctly, the API returns an empty list [] if the waveform is not available.
The frontend expects either an array with at least 1 item where all values are between 0 and 1, or a falsy value (then it generates random peaks).
When the frontend gets an empty list for peaks, [], it does not show any waveform, the validator throws an error, and there are lots of Nan errors when it is trying to calculate the waveform peaks.
I am not sure whether it should be fixed in this PR, or a separate one, though, so I'm approving :)

AetherUnbound · 2022-03-08T15:15:59Z

Ah, my bad! I didn't run recreate before testing. I'm doing so now, as soon as that finishes OK I'll fix the Pipfile.lock conflicts and merge 😄

@AetherUnbound

Kudos to @AetherUnbound for coming up with much better ones than I was able to.

sarayourfriend · 2022-03-08T16:49:12Z

@obulat that's a bug introduced by the previous PR that actually added the peaks data to the API. Let's fix it in a separate issue.

stacimc

I think this is because our API integration tests rely on the existing database rather than (what would be ideal) building their own data that then gets torn down after each test iteration.

My bad, I didn't look closely enough at the tests. They are passing after recreating/reinitializing everything. That's definitely not ideal but absolutely a separate issue and shouldn't be fixed here, agreed!

AetherUnbound requested a review from a team as a code owner February 23, 2022 00:50

AetherUnbound requested review from obulat and dhruvkb February 23, 2022 00:50

AetherUnbound added 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🤖 aspect: dx Concerns developers' experience with the codebase labels Feb 23, 2022

AetherUnbound requested a review from a team February 23, 2022 00:50

AetherUnbound removed the request for review from a team February 23, 2022 00:50

AetherUnbound mentioned this pull request Feb 23, 2022

Update Airflow to 2.2.4 WordPress/openverse-catalog#372

Merged

7 tasks

dhruvkb approved these changes Feb 23, 2022

View reviewed changes

sarayourfriend reviewed Feb 23, 2022

View reviewed changes

sarayourfriend self-assigned this Feb 28, 2022

sarayourfriend requested a review from dhruvkb March 1, 2022 14:21

dhruvkb approved these changes Mar 1, 2022

View reviewed changes

sarayourfriend force-pushed the feature/generate-waveforms-django-command#529 branch from dd246ac to b285290 Compare March 2, 2022 20:21

sarayourfriend force-pushed the feature/generate-waveforms-django-command#529 branch from b285290 to 3d7fb7a Compare March 2, 2022 20:23

AetherUnbound commented Mar 7, 2022

View reviewed changes

zackkrida requested a review from obulat March 7, 2022 21:42

obulat approved these changes Mar 8, 2022

View reviewed changes

zackkrida added the 💬 talk: discussion Open for discussions and feedback label Mar 8, 2022

AetherUnbound and others added 13 commits March 8, 2022 08:21

Add django-tqdm dependency

b46a656

Add waveform generation command

98768e2

Incorporate error handling

aaebea6

Paginate generatewaveforms

ae7a1b0

Lint and test different exception types

b2959f4

Lint

aab0205

Fix bad merge

4d25dc6

Fix test cases

9c053cc

Correctly handle keyboard interrupt and impose self rate limit

9ad8e90

Move comments to prevent line breaks

f665ea4

Use clearer argument names

71ddcec

Kudos to @AetherUnbound for coming up with much better ones than I was able to.

Add back in destructive logging until we fix this for CI

c412147

Re-lock pipfile

3d3c05e

AetherUnbound force-pushed the feature/generate-waveforms-django-command#529 branch from 753182c to 3d3c05e Compare March 8, 2022 16:47

stacimc approved these changes Mar 8, 2022

View reviewed changes

sarayourfriend merged commit f74c128 into main Mar 8, 2022

sarayourfriend deleted the feature/generate-waveforms-django-command#529 branch March 8, 2022 17:06

sarayourfriend mentioned this pull request Feb 22, 2023

API tests rely on pre-existing data WordPress/openverse#725

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Django command for generating waveforms #530

Django command for generating waveforms #530

AetherUnbound commented Feb 23, 2022

sarayourfriend left a comment

sarayourfriend Feb 23, 2022

AetherUnbound Feb 23, 2022

sarayourfriend Feb 23, 2022

AetherUnbound Feb 23, 2022

sarayourfriend Feb 23, 2022

AetherUnbound commented Feb 23, 2022

sarayourfriend commented Feb 23, 2022

AetherUnbound commented Feb 25, 2022

sarayourfriend commented Feb 25, 2022

sarayourfriend commented Mar 1, 2022

dhruvkb left a comment

dhruvkb Mar 1, 2022

zackkrida commented Mar 2, 2022

sarayourfriend commented Mar 2, 2022

AetherUnbound left a comment

AetherUnbound Mar 7, 2022

sarayourfriend Mar 7, 2022 •

edited

Loading

AetherUnbound Mar 7, 2022 •

edited

Loading

sarayourfriend Mar 8, 2022

AetherUnbound Mar 7, 2022

AetherUnbound Mar 7, 2022

zackkrida commented Mar 7, 2022

stacimc commented Mar 7, 2022

AetherUnbound commented Mar 7, 2022

sarayourfriend commented Mar 8, 2022

obulat left a comment

AetherUnbound commented Mar 8, 2022

sarayourfriend commented Mar 8, 2022

stacimc left a comment

		from factory.django import DjangoModelFactory


		class AudioFactory(MediaFactory):

Django command for generating waveforms #530

Django command for generating waveforms #530

Conversation

AetherUnbound commented Feb 23, 2022

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AetherUnbound commented Feb 23, 2022

sarayourfriend commented Feb 23, 2022

AetherUnbound commented Feb 25, 2022

sarayourfriend commented Feb 25, 2022

sarayourfriend commented Mar 1, 2022

dhruvkb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackkrida commented Mar 2, 2022

sarayourfriend commented Mar 2, 2022

AetherUnbound left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarayourfriend Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

AetherUnbound Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackkrida commented Mar 7, 2022

stacimc commented Mar 7, 2022

AetherUnbound commented Mar 7, 2022

sarayourfriend commented Mar 8, 2022

obulat left a comment

Choose a reason for hiding this comment

AetherUnbound commented Mar 8, 2022

sarayourfriend commented Mar 8, 2022

stacimc left a comment

Choose a reason for hiding this comment

sarayourfriend Mar 7, 2022 •

edited

Loading

AetherUnbound Mar 7, 2022 •

edited

Loading