Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Extract media type from staged tsv file name for loader #110

Merged
merged 14 commits into from
Jun 25, 2021
Merged

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Jun 24, 2021

Tsv_to_postgres_loader DAG handles getting data from TSV files, saving it to S3 and intermediate database, cleaning it and finally saving the clean data to the main database.

This PR adds media type (currently image, and in the future, audio) to the tsv filename. The media type is used in the downstream tasks to use the correct functions/columns when loading the data into the database.

Different tasks of the tsv_to_postgres_loader DAG pass the media_type data either using the filename, or, if it is not available, using the Airflow xcom messaging (xcom.push/xcom.pull).

To test this PR you can:

  1. build the Docker containers and run all tests.
  2. build the Docker containers and run one of the image ingestion DAGS (Flickr workflow (not the Flickr ingestion workflow!) runs comparatively fast, but it needs authentication, Science museum workflow doesn't require authentication). After the ingestion workflow is complete, turn on the tsv_to_postgres_loader. On its completion, you will be able to see the ingested data in openledger database.

Signed-off-by: Olga Bulat obulat@gmail.comSigned-off-by: Olga Bulat obulat@gmail.com

obulat added 12 commits June 17, 2021 15:44
…tore

Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
# Conflicts:
#	src/cc_catalog_airflow/dags/common/storage/image.py
#	src/cc_catalog_airflow/dags/common/storage/test_image.py
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@obulat obulat requested a review from a team as a code owner June 24, 2021 14:08
@obulat obulat requested review from zackkrida and krysal and removed request for a team June 24, 2021 14:08
Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything worked, I now have 30k flickr images in my local db

@obulat obulat changed the title Extract media type from stage tsv file name for loader Extract media type from staged tsv file name for loader Jun 24, 2021
obulat added 2 commits June 25, 2021 08:45
Signed-off-by: Olga Bulat <obulat@gmail.com>
Signed-off-by: Olga Bulat <obulat@gmail.com>
@obulat obulat mentioned this pull request Jun 25, 2021
Base automatically changed from extract_media_storage to main June 25, 2021 12:23
@obulat obulat merged commit 2c98e24 into main Jun 25, 2021
@obulat obulat deleted the extract_media_db branch June 25, 2021 12:26
@zackkrida zackkrida added the ✨ goal: improvement Improvement to an existing user-facing feature label Aug 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing user-facing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants