Reduce permissions of default authentication scope #4372

sarayourfriend · 2024-05-22T06:20:16Z

Fixes

Description

This PR adds a new Privilege concept to the Openverse API, and allows us to "upgrade" access to fine-grained aspects of the API for certain requesters. I've used this to add further gradations to the limits on page size and pagination depth in line with the issue's requirements.

I've left comments on the PR to help explain some of the changes. The line count looks big, but it really is mostly a lot of changes to unit tests and reorganisation of the admin site configuration. The unit tests had some severe duplication of the media_type_config fixture, which is now fixed, and cleans the test data up a bit. This makes the tests consistent across the integration and unit tests, and easier to reason about and work with (no more context switching which media type fixture is this one?...). I ended up touching these because the changes in the serializer at first assumed the request was always available. This works in practice, but not in tests, because we actually do not always instantiate the serializer with a fully and correctly configured request object. That's pretty frustrating, but fixing it would require even more changes outside the scope of this PR, so for now the serializer shamefully allows the request to be missing so that tests will not fail. I've fixed the issue for some tests, but really we should have a helper method for the tests to use that creates the serializer with particular data and an correctly configured authed or anonymous request. Or, even better, move more tests towards the integration level, rather than unit level, and forget about these implementation details of the serializer in the tests (where they really don't matter). Anyway, that's all a lot more work and this is messy enough as it is.

Otherwise, the bulk of the real changes are the new api.constants.privileges module and its data, along with the changes to the media serializer.

Testing Instructions

Run the app with just api/test and make anonymous requests, and see that limits are enforced. Page size should be limited to 20, and you shouldn't be able to go deeper than 240 results (12 pages of 20 size).

Make authenticated requests, and confirm the behaviour matches the configuration and your understanding of it. Do the same with a privileged app, by visiting the Django admin and adding the privileges on the ThrottledApplication. Enable the privilege by checking the checkbox for that privilege, and save the throttled application.

Review the new unit tests in test_auth and confirm they cover the requirements.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
PRs) or the media properties generator (just catalog/generate-docs media-props
for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend · 2024-05-23T03:43:23Z

Something (presumably caused by my deduplication of the media_type fixture in favour of the more complete, older, and widely used media_type_config), is causing test state to leak between tests, and the query assertions are failing due to the enabled sources query augmentation.

Very frustrating! I can't see at all how the test configuration is causing that to happen, and it only happens on a full test run, so it's purely down to some luck of ordering. Very annoying. I'm trying to find a way to completely squash the redis data on each subsequent test run so there's no chance of either the real redis being used nor of the data leaking between tests.

sarayourfriend · 2024-05-23T03:55:37Z

api/test/fixtures/cache.py

@@ -41,18 +43,20 @@ def get_redis_connection(*args, **kwargs):


 @pytest.fixture(autouse=True)
-def django_cache(redis, monkeypatch) -> RedisCache:


Got it! Indeed, something was not working about the cache fixtures, and they were sometimes persisting between tests. Adding new tests (and moving fixtures around) in this PR exposed the issue, but it ends up being easy to solve by actually completely squashing the possibility of a cache being reused by literally replacing it. We can access the real data structure underneath (and it isn't even a hack to do so), so we don't need to monkeypatch in this case.

This explains the changes in this fixture module.

api/api/constants/privilege.py

sarayourfriend · 2024-05-23T03:59:23Z

api/api/admin/__init__.py

@@ -1,140 +1,22 @@
 from django.conf import settings


I had to reorganise this module and the admin in general. We've been putting things pretty haphazardly, and it was becoming a mess. The forms module was too generically named, and encouraged needlessly separating a form definition from the admin it was meant for. That doesn't make sense.

I've split it up into basically the conceptual model of our data, roughly matching the serializers, models, and so forth, rather than the implementation detail (form this, admin that).

The only new things in the admin are the throttled admin changes and the new form for it in admin.oauth.

sarayourfriend · 2024-05-23T04:02:39Z

api/conf/settings/spectacular.py

@@ -50,19 +50,17 @@
            "name": "auth",
            "description": dedent(
                """
-                The API has rate-limiting and throttling in place to prevent


I chose not to list out all the details of what the limits are for throttling or privileged pagination. The throttle limits were wrong, sort of, in that those are the limits for some views, but thumbnails and some others have different limits. I don't think we actually need to publish the configured limits explicitly, and doing it like this (when we know they are configured via environment variables elsewhere anyway) inevitably leads to inaccurate information and drifting documentation.

Instead, I've changed this to describe the general idea of Openverse's API having intentional limits to requesters, how we communicate those limits (kept from the existing docs), and letting the reader know they can request higher limits if they wish.

+1, this is a great improvement IMO.

sarayourfriend · 2024-05-23T04:03:29Z

api/test/fixtures/media_type_config.py

@@ -0,0 +1,161 @@
+from dataclasses import dataclass


This is just moved from api/test/unit/conftest.py and deduplicates the much more recent and less complete media_type fixture removed from api/test/integration.

sarayourfriend · 2024-05-23T04:05:13Z

api/test/fixtures/rest_framework.py

+        "/", HTTP_AUTHORIZATION=f"Bearer {access_token.token}"
+    )
+
+    return APIView().initialize_request(request)


Fixtures in this module are just moved around, but I've also fixed an issue in this authed_request fixture where it was causing request.auth to be a token string, rather than the actual authenticated access token object that it would be in normal circumstances. It was using force_authenticate which causes auth to be just a string, and to bypass the authentication class. We don't need to do that, we have a real token, so we can just rely on the normal authentication configuration. That's a good thing! It means we're integrating the whole authentication flow into these tests and ensuring it works in all scenarios that we have tests for.

sarayourfriend · 2024-05-23T04:05:34Z

api/test/integration/test_auth.py

-@pytest.mark.django_db
-def test_page_size_limit_unauthed(api_client):
-    query_params = {"page_size": 20}
-    res = api_client.get("/v1/images/", query_params)
-    assert res.status_code == 200
-    query_params["page_size"] = 21
-    res = api_client.get("/v1/images/", query_params)
-    assert res.status_code == 401
-
-
-@pytest.mark.django_db
-def test_page_size_limit_authed(api_client, test_auth_token_exchange):
-    time.sleep(1)
-    token = test_auth_token_exchange["access_token"]
-    query_params = {"page_size": 21}
-    res = api_client.get(
-        "/v1/images/", query_params, HTTP_AUTHORIZATION=f"Bearer {token}"
-    )
-
-    assert res.status_code == 200
-
-    query_params = {"page_size": 500}
-    res = api_client.get(
-        "/v1/images/", query_params, HTTP_AUTHORIZATION=f"Bearer {token}"
-    )
-    assert res.status_code == 200


These tests are superseded by the new ones at the end of this module.

sarayourfriend · 2024-05-23T04:05:51Z

api/test/integration/test_dead_link_filter.py

-@pytest.mark.django_db
-def test_max_page_count(api_client):
-    response = api_client.get(
-        "/v1/images/", {"page": settings.MAX_PAGINATION_DEPTH + 1}
-    )
-    assert response.status_code == 400


This test is replaced by the new test_auth tests.

sarayourfriend · 2024-05-23T04:06:16Z

api/test/unit/conftest.py

@@ -1,40 +1,6 @@
-from dataclasses import dataclass


All of these removed lines are just moved to api/test/fixtures/media_type_config.py

obulat · 2024-05-28T17:19:31Z

I tried testing this PR locally, but I get

django.db.utils.ProgrammingError: column api_throttledapplication.privileges does not exist
LINE 1: ..."verified", "api_throttledapplication"."revoked", "api_throt...

error in tests and when I try to open the ThrottledApplication admin page. I assumed that I needed to run just api/dj migrate, but this gives me the following error:

  Applying api.0062_decision_through_tables...Traceback (most recent call last):

django.db.utils.ProgrammingError: relation "api_imagedecisionthrough" already exists

What should I do?

It would also be easier to review this PR if the admin splitting was extracted to a separate PR.

sarayourfriend · 2024-05-28T23:17:34Z

@obulat you need to run just api/init, I thought? It entirely depends on what branches you've worked on recently and what migrations you've applied to your local API database. Unfortunately I don't know what more specific instructions to give other than just recreate (or just down -v && just api/up && just api/init).

openverse-bot · 2024-05-29T00:00:08Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

stacimc

The code, including tests, looks great to me @sarayourfriend. I really like the refactoring here, and the effort in comments and the commit history to make that easier to review 😅

I have been having trouble actually testing but just confirmed it's an issue with my local env and not this branch. Submitting my comments so far, but I'm going to continue looking at this now!

stacimc · 2024-05-28T19:59:44Z

api/api/constants/privilege.py

+    """
+
+    slug: str
+    anonymous: typing.Any


Why typing.Any for these?

In practice, they're currently int, but I resume they could be anything. I'll try and see if Python will support a Generic parameter here 🤔

A generic appears to work! 🎉 Well, whether it would actually pass mypy, I have no idea, but the code runs still and I don't get any errors with my informal type information from my IDE.

stacimc · 2024-05-28T20:13:47Z

api/api/serializers/media_serializers.py

    """This serializer passes pagination parameters from the query string."""

+    _SUBJECT_TO_PAGINATION_LIMITS = (
+        "This parameter is subject to limitations based on authentication "
+        "and special privileges. For details, refer to [the authentication "


special privileges -> access level here might also be more clear.

api/api/constants/privilege.py

stacimc · 2024-05-29T00:21:07Z

api/api/serializers/media_serializers.py

+
+        if real_page_count > max_possible_page_count:
+            return floor(max_possible_page_count)
+


The math here reads confusingly to me for the same reason commented above related to the name PAGINATION_DEPTH. That being said -- very cool to clamp the actual result count independent of page size!

Let me know if the changes to names fixes this for you. Otherwise, I can try to think of a different way to express this.

stacimc · 2024-05-29T00:21:56Z

api/conf/settings/spectacular.py

@@ -50,19 +50,17 @@
            "name": "auth",
            "description": dedent(
                """
-                The API has rate-limiting and throttling in place to prevent


+1, this is a great improvement IMO.

github-actions · 2024-05-29T01:38:11Z

This PR has migrations. Please rebase it before merging to ensure that conflicting migrations are not introduced.

stacimc

Thanks for your patience @sarayourfriend! Finally got it working and this tested well for me. I tried authenticated & unauthenticated, with setting the privileges individually and both at once. Everything worked beautifully :)

obulat

The code looks great and works well locally (just recreate worked, thank you!).

My main change request is for form clarification. It can be done in this PR, or in a follow-up one.

As a maintainer trying to restrict a violating app's access level, what do I need to do? When I look at the form, without the knowledge about the underlying code, I'm not sure if I should be changing the rate_limit_model or the privileges, and which privileges should I restrict to fix the problem.
We could add a documentation page about it, but I think for the maintainer trying to fix an API traffic problem, it would be nice to have an instruction inside the form itself. Something like "Unselect max_page_size for an app that is violating Openverse TOS by scraping content" in the privileges label.

sarayourfriend · 2024-05-29T06:12:39Z

That's definitely a documentation problem, and I think a misunderstanding of how to apply decisions based on ToS violations. The correct response to an application abusing any level of access it's been granted is to revoke its access. I'll talk with @zackkrida about where best to document this, I'm not entirely sure if it's suitable for public documentation right now.

sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 22, 2024

github-actions bot added the migrations Modifications to Django migrations label May 22, 2024

sarayourfriend force-pushed the add/refined-registered-app-limits branch from 47a42a8 to 53b2423 Compare May 23, 2024 03:32

WordPress deleted a comment from github-actions bot May 23, 2024

sarayourfriend commented May 23, 2024

View reviewed changes

sarayourfriend marked this pull request as ready for review May 23, 2024 04:14

sarayourfriend requested a review from a team as a code owner May 23, 2024 04:14

sarayourfriend requested review from obulat and stacimc May 23, 2024 04:14

stacimc reviewed May 29, 2024

View reviewed changes

sarayourfriend added 11 commits May 29, 2024 11:36

Split admin registration and reorganise definitions

ca4052c

Deduplicate media type fixture

a87cd9f

Add page size and pagination depth privileges

ba6b41c

Remove base request serializer

955cf9e

Fix tuples

42d6426

Prevent cache persistence between tests

67ea4c2

Rename dataclass to RestrictedFeature

74ad74b

Rename PAGINATION_DEPTH to QUERY_DEPTH

5c46e83

Use clearer names for restricted features

efe5469

Privilege -> access level

9486804

Rewrite migration after rebase

f8e26dc

sarayourfriend force-pushed the add/refined-registered-app-limits branch from c66cbe2 to f8e26dc Compare May 29, 2024 01:37

WordPress deleted a comment from github-actions bot May 29, 2024

stacimc approved these changes May 29, 2024

View reviewed changes

obulat approved these changes May 29, 2024

View reviewed changes

sarayourfriend merged commit e0c2951 into main May 29, 2024
48 checks passed

sarayourfriend deleted the add/refined-registered-app-limits branch May 29, 2024 06:12

obulat mentioned this pull request Jun 12, 2024

The API result_count is no more than 240 for unauthenticated requests #4474

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce permissions of default authentication scope #4372

Reduce permissions of default authentication scope #4372

sarayourfriend commented May 22, 2024 •

edited

Loading

sarayourfriend commented May 23, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

stacimc May 29, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

sarayourfriend May 23, 2024

obulat commented May 28, 2024

sarayourfriend commented May 28, 2024 •

edited

Loading

openverse-bot commented May 29, 2024

stacimc left a comment

stacimc May 28, 2024

sarayourfriend May 29, 2024

sarayourfriend May 29, 2024

stacimc May 28, 2024

stacimc May 29, 2024

sarayourfriend May 29, 2024

stacimc May 29, 2024

github-actions bot commented May 29, 2024

stacimc left a comment

obulat left a comment

sarayourfriend commented May 29, 2024

		@@ -41,18 +43,20 @@ def get_redis_connection(args, *kwargs):


		@pytest.fixture(autouse=True)
		def django_cache(redis, monkeypatch) -> RedisCache:


		if real_page_count > max_possible_page_count:
		return floor(max_possible_page_count)

Reduce permissions of default authentication scope #4372

Reduce permissions of default authentication scope #4372

Conversation

sarayourfriend commented May 22, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obulat commented May 28, 2024

sarayourfriend commented May 28, 2024 • edited Loading

openverse-bot commented May 29, 2024

Footnotes

stacimc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 29, 2024

stacimc left a comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

sarayourfriend commented May 29, 2024

sarayourfriend commented May 22, 2024 •

edited

Loading

sarayourfriend commented May 28, 2024 •

edited

Loading