Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks/WP-190: Handle concurrency with Tapis OAuth Token Refresh #932

Merged
merged 33 commits into from
Aug 21, 2024

Conversation

chandra-tacc
Copy link
Collaborator

@chandra-tacc chandra-tacc commented Jan 25, 2024

Overview

When Tapis OAuth token expires for a user and multiple tapis api calls are requested concurrently (example: page refresh), all of them send requests to Tapis to refresh token. This duplicate request is a waste of resource and slows performance.

The solution is to only make one request per user when token expires. Any solution should work with requests distributed across multiple processes.

Possible Solutions

django select_for_update

  • operates at the row level, Tapis OAuth info is one row per user. So, select_for_update a convenient mechanism to do distributed locking.
  • allows for transactions to wait until they acquire the lock. This allows for incoming requests to not fail because they cannot acquire lock.

django-db-mutex

  • allows you to create higher-level named locks that can be used to protect larger sections of code or critical operations. This needs a third party library installation
  • If a request cannot acquire a mutex because it is already locked, the db-mutex throws an error. This will fail the request if a retry-wait like mechanism is not implemented.

This PR uses select_for_update since is readily available in django and has waits.

Update
PR now uses middleware to process (similar to solution used in DesignSafe)

Related

Changes

  • Adds a middleware to handle tapis token refresh.
  • Uses select_for_update for a specific user, check if the updated data has non-expired token. Otherwise, refreshes the token.
  • There are expiry checks at two places - one before acquiring a lock (on current model data) and one after acquiring row lock, but with latest db data.

Testing

Expire token

docker exec -it core_portal_django /bin/bash
python3 manage.py shell
from django.contrib.auth.models import User
u = User.objects.get(username='cyemparala')
u.tapis_oauth.expires_in = 1
u.tapis_oauth.save()

Test cases:

Whole page refresh - triggers multiple concurrent requests with expired token:
  • Go to applications page on cep.test
  • Expiry token (use above code)
  • Go to dashboard page.
  • See logs
    ** Only one "Refreshing Tapis OAuth token" message is seen, rest are all waiting for row lock
    ** 2 of waiting transactions, get the update access token info and process the request. And see log message Refreshing token from DB for those 2 waiting transactions
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:23,638 UTC middleware portal.apps.auth.middleware.process_request:50: Tapis OAuth token expired for user . Refreshing token
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:23,638 UTC middleware portal.apps.auth.middleware.process_request:50: Tapis OAuth token expired for user . Refreshing token
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:23,638 UTC middleware portal.apps.auth.middleware.process_request:50: Tapis OAuth token expired for user . Refreshing token
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:23,640 UTC middleware portal.apps.auth.middleware.process_request:62: Refreshing Tapis OAuth token
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:24,427 UTC middleware portal.apps.auth.middleware.process_request:73: Token updated by another request. Refreshing token from DB.
core_portal_django         | [DJANGO] INFO 2024-08-21 15:38:24,428 UTC middleware portal.apps.auth.middleware.process_request:73: Token updated by another request. Refreshing token from DB.

Single request
  • Expiry token
  • Click on history menu item in cep.test
  • See logs
    ** Only one "Refreshing Tapis OAuth token" message is seen, rest are all waiting for row lock
    ** No other request acquired row lock because they already saw the non-expired token
core_portal_nginx          | 172.22.0.1 - - [25/Jan/2024:20:30:35 +0000] "PATCH /api/notifications/ HTTP/2.0" 200 17 "https://cep.test/workbench/history/jobs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "-"
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:35,397 UTC models portal.apps.auth.models.optimized_client:98: Tapis OAuth token expired
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:35,398 UTC models portal.apps.auth.models.optimized_client:104: Refreshing tapis oauth token
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:36,262 UTC basehttp django.server.log_message:212: "GET /api/workspace/jobs/listing?offset=0&limit=50&query_string= HTTP/1.0" 200 421569
core_portal_nginx          | 172.22.0.1 - - [25/Jan/2024:20:30:36 +0000] "GET /api/workspace/jobs/listing?offset=0&limit=50&query_string= HTTP/2.0" 200 30057 "https://cep.test/workbench/history/jobs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" 
Basic UI Sanity Tests

Ran through basic UI sanity tests to ensure acquiring client does not fail

UI

Notes

Copy link

codecov bot commented Jan 25, 2024

Codecov Report

Attention: Patch coverage is 0% with 39 lines in your changes missing coverage. Please review.

Project coverage is 65.25%. Comparing base (fa15f5f) to head (6e2e81e).
Report is 1 commits behind head on main.

Files Patch % Lines
server/portal/apps/auth/middleware.py 0.00% 38 Missing ⚠️
server/portal/apps/auth/models.py 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #932      +/-   ##
==========================================
- Coverage   65.40%   65.25%   -0.16%     
==========================================
  Files         437      438       +1     
  Lines       12653    12681      +28     
  Branches     2667     2636      -31     
==========================================
- Hits         8276     8275       -1     
- Misses       4140     4169      +29     
  Partials      237      237              
Flag Coverage Δ
javascript 70.30% <ø> (ø)
unittests 60.08% <0.00%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
server/portal/settings/settings.py 0.00% <ø> (ø)
server/portal/urls.py 86.66% <ø> (ø)
server/portal/apps/auth/models.py 82.50% <0.00%> (+14.50%) ⬆️
server/portal/apps/auth/middleware.py 0.00% <0.00%> (ø)

@chandra-tacc chandra-tacc marked this pull request as ready for review January 25, 2024 23:15
Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I think

server/portal/apps/auth/models.py Outdated Show resolved Hide resolved
server/portal/apps/auth/models.py Outdated Show resolved Hide resolved
chandra-tacc and others added 2 commits February 9, 2024 13:21
Co-authored-by: Sal Tijerina <r.sal.tijerina@gmail.com>
@nathanfranklin
Copy link
Member

I haven’t had the chance to test it yet, but I appreciate the detailed PR description, particularly the 'Possible Solutions' section 💯

Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

client.refresh_tokens()
except Exception:
logger.exception('Tapis Token refresh failed')
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On failure, perhaps we should call logout

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will test it and share info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is all in a model, can't do http redirect or control view responses from here. Have to do from view. I setup an custom exception and handled it in Base View to send 401 back to client. On testing, by forcing an error - it turned the 401 to redirect to tapis oauth, but that failed due to CORS policy issue, I have to check if this is local setup or something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came to that realization as well today. I tried another solution with DesignSafe, which is to put the refresh logic in a middleware. In CEP we originally moved that logic from middleware to the client() method, but I think the solution you propose here might solve the original issue there?

Haven't tested yet, what are your thoughts?
https://github.com/DesignSafe-CI/portal/blob/task/DES-2702--tapis-v3-oauth/designsafe/apps/auth/middleware.py

Copy link
Collaborator Author

@chandra-tacc chandra-tacc Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rstijerina - sorry for delay in response, I missed this note.
I looked at the code in that branch. It looks good, one comment on overall integration:

  • Should you do this also for logout?
    logout(request)
    return HttpResponseRedirect(reverse('designsafe_auth:login'))

Some testing aspects:

  • Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.
  • Walking through the code, it is protected from infinite loop through this(which is good). If refresh fails, goes to login, which has to authenticate with tapis, if authentication works, this middleware immediately returns (because there is no expiry) and move away from this middleware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use do this also for logout?
logout(request)
return HttpResponseRedirect(reverse('designsafe_auth:login'))

Yes, thanks. Added:
https://github.com/DesignSafe-CI/portal/blob/task/DES-2709--v3-apps-views/designsafe/apps/auth/middleware.py

Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.

Can you expand more on this part? Where would tapis token expiry fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.

Can you expand more on this part? Where would tapis token expiry fail?
I meant - "Behavior on xhr requests when tapis token expires, and the refresh fails - this will hit the logout code and send a 302 back to client. If javascript side of response handling does not handle 302 cleanly (page rendering after 302, etc), may be extra logic need be needed to check for 302 status and specific error type(token expired) and then setting location href to logout".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rstijerina - regarding this PR, if middleware is the right place for auth and if it is working in designsafe, should I do the same here and start testing. What is your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution I have in DesginSafe does work, here are example logs from a refresh that just occurred for me:

des_django         | [DJANGO] INFO 2024-04-12 14:28:57,764 middleware designsafe.apps.auth.middleware.process_request:49: Tapis OAuth token expired for user sal. Refreshing token
des_django         | [DJANGO] INFO 2024-04-12 14:28:57,769 middleware designsafe.apps.auth.middleware.process_request:49: Tapis OAuth token expired for user sal. Refreshing token
des_django         | [DJANGO] INFO 2024-04-12 14:28:57,775 middleware designsafe.apps.auth.middleware.process_request:61: Refreshing Tapis OAuth token
des_django         | [DJANGO] INFO 2024-04-12 14:29:02,626 middleware designsafe.apps.auth.middleware.process_request:72: Token updated by another request. Refreshing token from DB.

It might not be fool-proof though, and could definitely use more testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could talk about best place for token refresh in the next infra scrum?

Copy link
Contributor

@edmondsgarrett edmondsgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! And tested well

@chandra-tacc
Copy link
Collaborator Author

Copy link
Contributor

@edmondsgarrett edmondsgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@chandra-tacc chandra-tacc marked this pull request as draft April 19, 2024 22:00
chandra-tacc and others added 4 commits August 21, 2024 11:09
* Handle validation for FORK job type

* Prettier fix

* Remove unrelated fix

* Adjust Yup validation and initialValues
Bumps [sqlparse](https://github.com/andialbrecht/sqlparse) from 0.4.4 to 0.5.0.
- [Changelog](https://github.com/andialbrecht/sqlparse/blob/master/CHANGELOG)
- [Commits](andialbrecht/sqlparse@0.4.4...0.5.0)

---
updated-dependencies:
- dependency-name: sqlparse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
Bumps [idna](https://github.com/kjd/idna) from 3.4 to 3.7.
- [Release notes](https://github.com/kjd/idna/releases)
- [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.rst)
- [Commits](kjd/idna@v3.4...v3.7)

---
updated-dependencies:
- dependency-name: idna
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
dependabot bot and others added 18 commits August 21, 2024 11:09
Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.1 to 3.0.3.
- [Release notes](https://github.com/pallets/werkzeug/releases)
- [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
- [Commits](pallets/werkzeug@3.0.1...3.0.3)

---
updated-dependencies:
- dependency-name: werkzeug
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [braces](https://github.com/micromatch/braces) from 3.0.2 to 3.0.3.
- [Changelog](https://github.com/micromatch/braces/blob/master/CHANGELOG.md)
- [Commits](micromatch/braces@3.0.2...3.0.3)

---
updated-dependencies:
- dependency-name: braces
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.18 to 1.26.19.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/1.26.19/CHANGES.rst)
- [Commits](urllib3/urllib3@1.26.18...1.26.19)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
Bumps [ws](https://github.com/websockets/ws) from 7.5.7 to 7.5.10.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](websockets/ws@7.5.7...7.5.10)

---
updated-dependencies:
- dependency-name: ws
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
Bumps [certifi](https://github.com/certifi/python-certifi) from 2023.7.22 to 2024.7.4.
- [Commits](certifi/python-certifi@2023.07.22...2024.07.04)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [django](https://github.com/django/django) from 4.2.11 to 4.2.14.
- [Commits](django/django@4.2.11...4.2.14)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [setuptools](https://github.com/pypa/setuptools) from 68.2.2 to 70.0.0.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](pypa/setuptools@v68.2.2...v70.0.0)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
* Allow docs to be behind login

* Lint fix

* Customize url path

* Fix unit test

* conditionally add urls.py
* TAS: Country is no longer available.

* Adjust tests
Bumps [django](https://github.com/django/django) from 4.2.14 to 4.2.15.
- [Commits](django/django@4.2.14...4.2.15)

---
updated-dependencies:
- dependency-name: django
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chandra Y <cyemparala@tacc.utexas.edu>
Bumps [twisted](https://github.com/twisted/twisted) from 23.10.0 to 24.7.0.
- [Release notes](https://github.com/twisted/twisted/releases)
- [Changelog](https://github.com/twisted/twisted/blob/trunk/NEWS.rst)
- [Commits](twisted/twisted@twisted-23.10.0...twisted-24.7.0)

---
updated-dependencies:
- dependency-name: twisted
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Working version of execution system changes

+ get available systems
+ handle client side scenarios

* Add tests, fix issues found from unit tests

* Fix job history and also lint

* Use one attribute for exec systems instead of two

* Fix formatting

* Sort system list in UI

* Address code review comments

* Adjusted exec system label text and fixed jest tests

* Working version of execution system changes

+ get available systems
+ handle client side scenarios

* Add tests, fix issues found from unit tests

* Fix job history and also lint

* Use one attribute for exec systems instead of two

* Fix formatting

* Sort system list in UI

* Address code review comments

* Adjusted exec system label text and fixed jest tests

* Fix bug related to job history execution system

* Merge fix

* Redo the fix on job history

* Rework commit for exec and allocation

* Prettier, lint and test fix

* Fix merge issues

* Make exec system dependent on allocation

* Fix job history display of system name

* Handle max memory on a system

* Validation for execSystemId and fix express VM job submission

---------

Co-authored-by: Sal Tijerina <r.sal.tijerina@gmail.com>
@chandra-tacc chandra-tacc marked this pull request as ready for review August 21, 2024 16:26
Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@edmondsgarrett edmondsgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!!

@chandra-tacc chandra-tacc merged commit 0a5fe6f into main Aug 21, 2024
4 of 6 checks passed
@chandra-tacc chandra-tacc deleted the tasks/WP-190-Tapis-Mutex branch August 21, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants