-
Notifications
You must be signed in to change notification settings - Fork 54
Add support for other media types to popularity calculations #112
Conversation
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Hm, I'm getting an odd failure, more so with the Jamendo script than the popularity calculations, I believe: postgres_1 | 2021-07-06 17:02:18.399 UTC [3380] ERROR: spiexceptions.BadCopyFileFormat: extra data after last expected column
postgres_1 | 2021-07-06 17:02:18.399 UTC [3380] CONTEXT: Traceback (most recent call last):
postgres_1 | PL/Python function "table_import_from_s3", line 20, in <module>
postgres_1 | options
postgres_1 | PL/Python function "table_import_from_s3"
postgres_1 | 2021-07-06 17:02:18.399 UTC [3380] STATEMENT:
postgres_1 | SELECT aws_s3.table_import_from_s3(
postgres_1 | 'provider_data_audio_20210706T170100',
postgres_1 | '',
postgres_1 | 'DELIMITER E'' ''',
postgres_1 | 'cccatalog-storage',
postgres_1 | 'audio/db_loader_staging/20210706T170100/jamendo_audio_20210706165612.tsv',
postgres_1 | 'us-east-1'
postgres_1 | );
postgres_1 | I saw an error like this once with production, but it was when I accidentally had an extra column in my TSV file. I wonder what it could be here? Let me know if there's any other useful information I could give you. I'd love to be able to find the I do have a good suspicion that the popularity code will work 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Got it! I think my issue was related to forgetting the docker override file. Code looks good as well.
* Add support for other media types to popularity calculations Signed-off-by: Olga Bulat <obulat@gmail.com> * Fix linting errors Signed-off-by: Olga Bulat <obulat@gmail.com> * Fix linting errors Signed-off-by: Olga Bulat <obulat@gmail.com> * Fix linting errors Signed-off-by: Olga Bulat <obulat@gmail.com> * Add audio popularity view sql Signed-off-by: Olga Bulat <obulat@gmail.com> * Add audio popularity view sql to Dockerfile Signed-off-by: Olga Bulat <obulat@gmail.com> * Finish renaming functions Signed-off-by: Olga Bulat <obulat@gmail.com> * Add two more audio popularity dags Signed-off-by: Olga Bulat <obulat@gmail.com>
…124) * Add support for other media types to popularity calculations Signed-off-by: Olga Bulat <obulat@gmail.com> * Add audio popularity view sql Signed-off-by: Olga Bulat <obulat@gmail.com> * Add audio popularity view sql to Dockerfile Signed-off-by: Olga Bulat <obulat@gmail.com> * Finish renaming functions Signed-off-by: Olga Bulat <obulat@gmail.com> * Add two more audio popularity dags Signed-off-by: Olga Bulat <obulat@gmail.com>
One of the data points we use to rank the search results is the popularity of an item. We use the provider popularity metrics and normalize them so that the popularity value is between 0 and 1. This way, if provider A's most popular item has 100 views, and provider B's most popular item has 10 000 views, we can still compare them.
The popularity calculations are done in three DAGs:
This PR adds
audio
equivalent for all of them. The code is repetitive and basically just a copy of image dags withaudio
as media_type parameter. It should be refactored after we confirm that all the dags work fine in production.These calculations create a materialized view in the main database with the popularity percentile, which is subsequently used by the ingestion server in the Openverse API to save the data to the API database and ElasticSearch.
To test this PR, you would need to have an Audio API script available like #113.
tsv_to_postgres_loader
DAG.recreate_audio_popularity_calculation
DAG.Signed-off-by: Olga Bulat obulat@gmail.com