Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Add support for other media types to popularity calculations #112

Merged
merged 8 commits into from
Jul 7, 2021

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Jun 25, 2021

One of the data points we use to rank the search results is the popularity of an item. We use the provider popularity metrics and normalize them so that the popularity value is between 0 and 1. This way, if provider A's most popular item has 100 views, and provider B's most popular item has 10 000 views, we can still compare them.

The popularity calculations are done in three DAGs:

  • src/cc_catalog_airflow/dags/recreate_image_popularity_calculation.py
  • src/cc_catalog_airflow/dags/refresh_all_image_popularity_data.py
  • src/cc_catalog_airflow/dags/refresh_image_view_data.py

This PR adds audio equivalent for all of them. The code is repetitive and basically just a copy of image dags with audio as media_type parameter. It should be refactored after we confirm that all the dags work fine in production.

These calculations create a materialized view in the main database with the popularity percentile, which is subsequently used by the ingestion server in the Openverse API to save the data to the API database and ElasticSearch.

To test this PR, you would need to have an Audio API script available like #113.

  1. Run the Docker container (instructions are in the main README)
  2. Open the Airflow web server (http://localhost:9090) and login using 'airflow' for both username and password.
  3. Start the Audio API provider script DAG and tsv_to_postgres_loader DAG.
  4. After both finish successfully, run the recreate_audio_popularity_calculation DAG.
  5. In the end, you should have a Postgres database with a table for audio, and a similar materialized view that also contains popularity metrics column.

Signed-off-by: Olga Bulat obulat@gmail.com

obulat added 2 commits June 25, 2021 08:26
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@obulat obulat requested a review from a team as a code owner June 25, 2021 10:35
@obulat obulat requested review from zackkrida and krysal and removed request for a team June 25, 2021 10:35
obulat added 6 commits June 25, 2021 14:14
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@zackkrida
Copy link
Member

Hm, I'm getting an odd failure, more so with the Jamendo script than the popularity calculations, I believe:

postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] ERROR:  spiexceptions.BadCopyFileFormat: extra data after last expected column
postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] CONTEXT:  Traceback (most recent call last):
postgres_1   | 	  PL/Python function "table_import_from_s3", line 20, in <module>
postgres_1   | 	    options
postgres_1   | 	PL/Python function "table_import_from_s3"
postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] STATEMENT:
postgres_1   | 	SELECT aws_s3.table_import_from_s3(
postgres_1   | 	  'provider_data_audio_20210706T170100',
postgres_1   | 	  '',
postgres_1   | 	  'DELIMITER E''	''',
postgres_1   | 	  'cccatalog-storage',
postgres_1   | 	  'audio/db_loader_staging/20210706T170100/jamendo_audio_20210706165612.tsv',
postgres_1   | 	  'us-east-1'
postgres_1   | 	);
postgres_1   |

CleanShot 2021-07-06 at 13 03 39@2x

I saw an error like this once with production, but it was when I accidentally had an extra column in my TSV file. I wonder what it could be here? Let me know if there's any other useful information I could give you. I'd love to be able to find the jamendo_audio_20210706165612.tsv file, but I have no idea how the local s3 stuff works for viewing the files on the bucket.

I do have a good suspicion that the popularity code will work 😄

@obulat
Copy link
Contributor Author

obulat commented Jul 6, 2021

Hm, I'm getting an odd failure, more so with the Jamendo script than the popularity calculations, I believe:

postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] ERROR:  spiexceptions.BadCopyFileFormat: extra data after last expected column
postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] CONTEXT:  Traceback (most recent call last):
postgres_1   | 	  PL/Python function "table_import_from_s3", line 20, in <module>
postgres_1   | 	    options
postgres_1   | 	PL/Python function "table_import_from_s3"
postgres_1   | 2021-07-06 17:02:18.399 UTC [3380] STATEMENT:
postgres_1   | 	SELECT aws_s3.table_import_from_s3(
postgres_1   | 	  'provider_data_audio_20210706T170100',
postgres_1   | 	  '',
postgres_1   | 	  'DELIMITER E''	''',
postgres_1   | 	  'cccatalog-storage',
postgres_1   | 	  'audio/db_loader_staging/20210706T170100/jamendo_audio_20210706165612.tsv',
postgres_1   | 	  'us-east-1'
postgres_1   | 	);
postgres_1   |
CleanShot 2021-07-06 at 13 03 39@2x

I saw an error like this once with production, but it was when I accidentally had an extra column in my TSV file. I wonder what it could be here? Let me know if there's any other useful information I could give you. I'd love to be able to find the jamendo_audio_20210706165612.tsv file, but I have no idea how the local s3 stuff works for viewing the files on the bucket.

I do have a good suspicion that the popularity code will work 😄

Are you using the most recent version of Jamendo do branch? I had some errors in it connected to the ingestion type column, which I fixed in the later commits.

zackkrida
zackkrida previously approved these changes Jul 6, 2021
Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Got it! I think my issue was related to forgetting the docker override file. Code looks good as well.

@obulat obulat changed the base branch from add_audio_db to main July 7, 2021 03:09
@obulat obulat dismissed zackkrida’s stale review July 7, 2021 03:09

The base branch was changed.

@obulat obulat changed the base branch from main to add_audio_db July 7, 2021 03:10
@obulat obulat merged commit 3639d77 into add_audio_db Jul 7, 2021
@obulat obulat deleted the audio_popularity branch July 7, 2021 03:12
@obulat obulat restored the audio_popularity branch July 7, 2021 03:20
obulat added a commit that referenced this pull request Jul 7, 2021
* Add support for other media types to popularity calculations

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Fix linting errors

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Fix linting errors

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Fix linting errors

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add audio popularity view sql

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add audio popularity view sql to Dockerfile

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Finish renaming functions

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add two more audio popularity dags

Signed-off-by: Olga Bulat <obulat@gmail.com>
obulat added a commit that referenced this pull request Jul 7, 2021
…124)

* Add support for other media types to popularity calculations

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add audio popularity view sql

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add audio popularity view sql to Dockerfile

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Finish renaming functions

Signed-off-by: Olga Bulat <obulat@gmail.com>

* Add two more audio popularity dags

Signed-off-by: Olga Bulat <obulat@gmail.com>
@zackkrida zackkrida added the ✨ goal: improvement Improvement to an existing user-facing feature label Aug 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing user-facing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants