Migrate the test suite to vcrpy #541

Terseus · 2022-10-30T16:11:21Z

Fixes #503

Using vcrpy now we can deterministically test the backend responses without making any real request.
This have a lot of advantages:

Removes false negatives or test skipped for 429 responses.
The tests are much faster now, all the 23 tests are executed in about 1 second in my PC.
Allow to refactor anything that doesn't involve a change in the requests made.
If the backend response changes we can regenerate the vcrpy cassettes with pytest --record-mode=rewrite.

There are some gotchas though:

The backend can be unpredictable when asking for shorter timeframes (e.g. 10 days) so we chose to do yearly requests in almost all the tests.
Due to the above the results to check can be quite big, therefore we don't compare the full result, instead we test the first and last rows, and the length of the result.
There is a lot of YAML code now with the generated cassettes, I've tried to make Github ignore them by marking them with linguist-vendored in .gitattributes (see linguist doc and this Stackoverflow response) but we will not know if it works until the change is merged and some days have passed.
- Alternatively we can merge only the first two commits in a separate PR, let pass some time and come back to this PR to see if it worked.
Writing new tests require some knowledge not documented right now:
- By default pytest-recording is executed with --record-mode=none so any new test without a cassette fails, this is intended to prevent generate cassettes unavertidely; to generate a new cassette you need to use --record-mode=once.
- The cassette must be deleted after any change that requires modifying the requests made to the backend, and the error message may not communicate this clearly enough to anyone who doesn't know vcrpy.

This have some advantages: * prevent accidentally importing the tests files, which may have different dependencies than the runtime files. * tests are not shipped anymore inside the package, which reduces package size.

We're going to use vcrpy via pytest-recording, which will record all HTTP interactions in YAML files inside the `tests/cassettes` path. Since we don't want to show them in diffs or mark YAML as the main project language, we configure the `.gitattributes` accordingly. The `linguist-vendored` attribute has been taken from the Github linguistic doc: https://github.com/github/linguist/blob/master/docs/overrides.md#vendored-code

Now we can test the expected generated content at `token_payload`.

Now we can test the expected payload at `related_queries_widget_list`. The downside is that the expected payload is a little bit of a monster. Also we start to define timeframes for tests where the timeframe defines the response so we can have deterministic responses.

Now we can test the returned value by the backend, however we don't compare the _full_ response because it's a `pd.DataFrame` with 52 rows, instead we just compare the length, the first 3 rows and the last 3 rows, and from that we can infer that the result is correct. Later we can add more fine-grained tests if we want but I think this is a good starting point.

Also now the packages are defined in `pyproject.toml`, otherwise it fails when there is a HTML coverage report in `htmlcov`. As for why pip takes `html` as a package, I have no clue.

We introduce two tools to reduce the ever-growing tests boilerplate: * `build_interest_over_time_df`. * `ExpectedResult`, a dataclass that encapsulates all that's tested against a `pd.DataFrame`.

We splitted the original `test_related_topics` in 3 tests: one to test the result dict shape, one to test the 'top' results and one to test the 'rising' results.

There should be a special circle of hell for the programmers who designs these kind of responses in an API.

Here we too a different approach because there is no observable behavior in the result, in this case is in the requests made. We switch to the responses library to test the exact requests made. Also we do it with `trending_searches` because it's the easiest of all calls to mock, and because all the code passed through `_get_data` which is the one who make the requests. Also we wrote just one test instead of two, there's no point in testing only two methods, we can either test only one and hope for the best or test all. While testing all whould be better is also a time consuming chore with very little value.

They're not needed anymore because we test all the responses fields and indexes dtypes now.

Now tests in the CI cannot use the network, so no test without vcrpy cassette or responses-patching should silently make through.

* remove the `tool.coverage.run.source` directive and `html` section in `pyproject.toml`, it was useless. * add coverage by default in pytest config.

Terseus · 2022-10-30T16:14:43Z

@emlazzarin
Please note, I don't consider this 100% finished until we add some doc to how to write/maintain/use the test suite, however I don't know if we should simply add it to the README.md or if we want to create a CONTRIBUTING.md file along with more content, what do you prefer?

emlazzarin · 2022-10-30T20:50:17Z

Looks good. I think the additional size is acceptable. Is comparing the full results very slow? Is that why we only check the first+last+size? Let’s link from the README to a new CONTRIBUTING file we can start from scratch, just so the doc has the freedom to be as expansive as necessary.

…

On Sun, Oct 30 2022 at 12:14, David Caro < ***@***.*** > wrote: @emlazzarin ( https://github.com/emlazzarin ) Please note, I don't consider this 100% finished until we add some doc to how to write/maintain/use the test suite, however I don't know if we should simply add it to the README.md or if we want to create a CONTRIBUTING.md file along with more content, what do you prefer? — Reply to this email directly, view it on GitHub ( #541 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAIWU4MHIJGNHKIP2BIEVOLWF2GG3ANCNFSM6AAAAAARSLHNZM ). You are receiving this because you were mentioned. Message ID: <GeneralMills/pytrends/pull/541/c1296293616 @ github. com>

Terseus · 2022-10-30T21:15:05Z

@emlazzarin

Is comparing the full results very slow? Is that why we only check the first+last+size?

No, is because the expected result variables can become quite big because we ask for a year of data, something I did because the daily data from the API is not trustworthy and I fear it might change.
Maybe I was paranoid and we can change the tests them to ask for a week of data or so and compare the full results, then we can get rid of the ExpectedResponse class.
What do you think?

Let’s link from the README to a new CONTRIBUTING file we can start from scratch, just so the doc has the freedom to be as expansive as necessary.

Perfect, do you want me to start it with info about working with the test suite in this PR?

Terseus · 2022-10-30T22:42:35Z

Ok, I reduced the date ranges in the tests where we can use daily data but that was only 5 tests, with all the other I don't think is worth comparing the full result for not hardcoding DataFrames of between 25 and 250 rows.
I think it's okay right now.

Looks like Github doesn't support splitting a command in more than one line.

Terseus added 26 commits October 29, 2022 22:22

Move the tests files to a new tests directory

e4bd91a

This have some advantages: * prevent accidentally importing the tests files, which may have different dependencies than the runtime files. * tests are not shipped anymore inside the package, which reduces package size.

Migrate test_get_data to vcr

b7c041f

Migrate test_build_payload to vcr

db3f88e

Now we can test the expected generated content at `token_payload`.

Migrate test_interest_over_time_images to vcr

53aae3a

Add coverage report support, show it in pipeline

152425a

Also now the packages are defined in `pyproject.toml`, otherwise it fails when there is a HTML coverage report in `htmlcov`. As for why pip takes `html` as a package, I have no clue.

Migrate test_over_time_news to vcr, reduce tests boilerplate

72da6fa

We introduce two tools to reduce the ever-growing tests boilerplate: * `build_interest_over_time_df`. * `ExpectedResult`, a dataclass that encapsulates all that's tested against a `pd.DataFrame`.

Migrate test_interest_over_time_youtube to vcr

abbad88

Migrate test_interest_over_time_froogle to vcr

557c5dd

Migrate test_interest_over_time_bad_gprop to vcr

4931040

Migrate test_interest_by_region to vcr

8c15afc

Migrate test_interest_by_region_city_resolution to vcr

4d14bd4

Migrate test_related_topics to vcr, split it in 3 tests

351f770

We splitted the original `test_related_topics` in 3 tests: one to test the result dict shape, one to test the 'top' results and one to test the 'rising' results.

Migrate test_related_queries to vcr splitted in 3 tests

b743c17

Migrate test_trending_searches to vcr, remove useless build_payload

8117278

Migrate test_realtime_trending_searches to vcr

6a40561

There should be a special circle of hell for the programmers who designs these kind of responses in an API.

Remove useless index creations to reduce boilerplate

673e468

Migrate test_top_charts to vcr

e1ec297

Migrate test_suggestions to vcr

5327811

Remove the two leftover tests about the isPartial dtype

41b8b7f

They're not needed anymore because we test all the responses fields and indexes dtypes now.

Add test_interest_over_time_partial to test when isPartial=True

e7bd5c4

Block network calls in CI

0aeb371

Now tests in the CI cannot use the network, so no test without vcrpy cassette or responses-patching should silently make through.

Review some of the coverage config

263e60f

* remove the `tool.coverage.run.source` directive and `html` section in `pyproject.toml`, it was useless. * add coverage by default in pytest config.

Reduce some tests size by using a timeframe of 5 days

146937e

Fix CI pytest execution

17b7f4d

Looks like Github doesn't support splitting a command in more than one line.

emlazzarin merged commit de31fb0 into GeneralMills:master Oct 31, 2022

Terseus deleted the vcr-based-unit-tests branch November 1, 2022 08:36

Terseus mentioned this pull request Mar 25, 2023

Updating tests expected DataFrames is a chore #569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate the test suite to vcrpy #541

Migrate the test suite to vcrpy #541

Terseus commented Oct 30, 2022

Terseus commented Oct 30, 2022

emlazzarin commented Oct 30, 2022 via email

Terseus commented Oct 30, 2022 •

edited

Loading

Terseus commented Oct 30, 2022

Migrate the test suite to vcrpy #541

Migrate the test suite to vcrpy #541

Conversation

Terseus commented Oct 30, 2022

Terseus commented Oct 30, 2022

emlazzarin commented Oct 30, 2022 via email

Terseus commented Oct 30, 2022 • edited Loading

Terseus commented Oct 30, 2022

Terseus commented Oct 30, 2022 •

edited

Loading