API tests rely on pre-existing data #725

sarayourfriend · 2022-03-08T17:11:11Z

Description

The API tests rely on data pre-existing in the database being tested against.

This is not good practice as it relies on information that is neither controlled nor indicated by the tests themselves.

Instead, we should use factories (introduced in WordPress/openverse-api#530) to generate new data for each test case and use @pytest.mark.django_db to ensure that the tests occur in a single transaction that is rolled back after the testcase is finished. Essentially we want fresh data for every test. Ideally the database instance used for testing isn't even the one shared by the local running instance of the API but that can introduce more infrastructure complexity than we necessarily want to have to deal with.

This will prevent issues like the one described here.

Reproduction

Run `just recreate && just init && just api-tests` and ensure they pass
Run `just recreate && just init && just dj "generatewaveforms --max_records=3" && just api-tests` and note their failure due to the unexpected data existing in the database being tested against.

Expectation

Test data should be isolated to specific test cases, not leak across, and not rely on external data that can change arbitrarily.

Resolution

🙋 I would be interested in resolving this bug.

The text was updated successfully, but these errors were encountered:

dhruvkb · 2022-03-09T08:34:32Z

@sarayourfriend this is true for unit tests but integration tests should be using the actual underlying database so that we can ensure that the overall behaviour works as expected.

In my view, there shouldn't be any server-side mocking/cleanup etc. in the integration tests and they should be conducted from the PoV of a consumer of the API.

sarayourfriend · 2022-03-09T11:22:07Z

In that case should they just use production data?

AetherUnbound · 2022-03-09T20:32:17Z

In that case should they just use production data?

Do you mean copy some production data? Or interact with the production database?

dhruvkb · 2022-03-10T08:49:12Z

Interacting with the production database for running tests seems wrong. Why not just use the sample entries set up by just init for the tests?

sarayourfriend · 2022-03-10T14:37:39Z

I guess my concern is over tests like the ones being introduced in WordPress/openverse-api#553 where it relies on specific data that isn't indicated by the tests themselves (namely, the list of providers). Maybe there's a different way to write the tests that would request the provider list first and then use that rather than hard-coding the provider list?

Idk, I've never used integration tests like this before, when I did it was on apps with user generated content so the integration tests would go through the full flow of creating and then consuming data from various users and stuff. I've never worked on this kind of catalog type data set, so sorry if I'm projecting incorrect understandings about how the tests should behave!

dhruvkb · 2022-03-10T17:27:51Z

Oh no @sarayourfriend I completely agree that tests like the ones in WordPress/openverse-api#553 are not ideal. But as it stands right now, the API completely relies on the sample data loaded into it at the start for the complete functionality so integration tests keep breaking whenever we change the sample data.

sarayourfriend · 2022-03-11T01:48:45Z

the API completely relies on the sample data loaded into it at the start for the complete functionality so integration tests keep breaking whenever we change the sample data.

Right, I guess I just want to find a way to clarify how to fix that for the tests. Is the expectation that we update the sample data in the CSVs? Is there an easy way to generate them? CSV is a fine format for what it's doing currently but trying to maintain the data in there would be a pain, like adding demo waveforms for example would be very annoying and related models, like the AudioAddOn, don't seem to have a clear place in the current sample data, at least not one that wouldn't require manually cross-referencing IDs across multiple CSVs. The load_sample_data.sh script is currently simple (which is good) but it would be hard to extend it to include cross-referenced data that didn't involved maintaining UUIDs duplicated across multiple files (in this case just two, but could be more in the future I guess?). Is there a simple way that I'm not seeing for us to fix that now that we have "side tables" present?

I can't find documentation about how they're created or maintained in the repo. I suspect they're maybe extracted from production data somehow, or maybe a local run of the ingestion process?

Could we change the way our fixture data is created to use Python factories and define our fixture data in code rather than purely declaratively like we have now? Or should we try to refresh them on a regular basis somehow?

It also seems like a fix for the problem we ran into with the audio waveforms is to automatically load the sample data before the tests run. That's essentially what happens in CI already because the tests run immediately after just init is run, but locally you can run into the edge case we saw with the waveform peaks.

Prakharkarsh1 · 2022-09-10T10:41:23Z

Hii @dhruvkb I want to contribute to fix this issue

dhruvkb · 2022-09-10T10:43:21Z

@Prakharkarsh1 please go ahead. Feel free to comment here or alteratively ask in the #openverse channel in the Making WordPress Slack workspace if you need help.

Prakharkarsh1 · 2022-09-18T13:00:07Z

Yeah sure @dhruvkb thank you!

sarayourfriend added 🟨 priority: medium 🛠 goal: fix 🤖 aspect: dx labels Mar 8, 2022

sarayourfriend mentioned this issue Feb 22, 2023

Make integration tests independent of sample data #747

Open

1 task

dhruvkb assigned Prakharkarsh1 Sep 10, 2022

obulat added the stack: backend label Feb 22, 2023

obulat transferred this issue from WordPress/openverse-api Feb 22, 2023

obulat added this to Openverse Backlog Feb 23, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023

obulat added 🧱 stack: api and removed 🧱 stack: backend labels Mar 20, 2023

obulat unassigned Prakharkarsh1 Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API tests rely on pre-existing data #725

API tests rely on pre-existing data #725

sarayourfriend commented Mar 8, 2022

dhruvkb commented Mar 9, 2022

sarayourfriend commented Mar 9, 2022

AetherUnbound commented Mar 9, 2022

dhruvkb commented Mar 10, 2022

sarayourfriend commented Mar 10, 2022

dhruvkb commented Mar 10, 2022

sarayourfriend commented Mar 11, 2022

Prakharkarsh1 commented Sep 10, 2022

dhruvkb commented Sep 10, 2022

Prakharkarsh1 commented Sep 18, 2022

API tests rely on pre-existing data #725

API tests rely on pre-existing data #725

Comments

sarayourfriend commented Mar 8, 2022

Description

Reproduction

Expectation

Resolution

dhruvkb commented Mar 9, 2022

sarayourfriend commented Mar 9, 2022

AetherUnbound commented Mar 9, 2022

dhruvkb commented Mar 10, 2022

sarayourfriend commented Mar 10, 2022

dhruvkb commented Mar 10, 2022

sarayourfriend commented Mar 11, 2022

Prakharkarsh1 commented Sep 10, 2022

dhruvkb commented Sep 10, 2022

Prakharkarsh1 commented Sep 18, 2022