Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(oauth2): add support for trino #30081

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

joaoferrao
Copy link
Contributor

@joaoferrao joaoferrao commented Aug 31, 2024

SUMMARY

Under #27631 under #20300
It also fixes an issue not totally resolved here #29981, which is required for OAuth2 to work for trino.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screen.Recording.2024-08-31.at.16.57.22.mov

TESTING INSTRUCTIONS

  1. I created a Keycloak client for trino and added this configuration in superset_docker_config.py:
DATABASE_OAUTH2_REDIRECT_URI = "http://localhost:8088/api/v1/database/oauth2/"
DATABASE_OAUTH2_CLIENTS = {
    'Trino': {
        'id': 'trino',
        'secret': ''<some-secret>',
        'scope': 'openid email offline_access roles profile',
        'redirect_uri': 'http://localhost:8088/api/v1/database/oauth2/',
        'authorization_request_uri': 'https://<the url of keycloak deploy>/realms/master/protocol/openid-connect/auth',
        'token_request_uri': 'https://<the url of keycloak deploy>/realms/master/protocol/openid-connect/token',
       'request_content_type': 'data' # keycloak doesn't accept application/json body.
    }
}
  1. Database configured via UI: with following settings:
trino://<trino_url>:443/tpcds

{"connect_args":{"http_scheme":"https"}}

Impersonate: true

ADDITIONAL INFORMATION

  • Has associated issue: [SIP-85] OAuth2 for databases #20300
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Need feedback with:

We still need to trigger this OAuth2 dance in, at least, 2 contexts (I don't know much about superset, possibility there are more):

  • Automatic attempt to list schemas and tables
  • Testing Connection when adding the database: temp: previous OAuth2 features implemented and already merged don't include a way to trigger this flow when adding a connection via UI. For this reason, I had to hack the test_connection.py so I'm allowed to

@dosubot dosubot bot added api Related to the REST API authentication Related to authentication data:connect:trino Related to Trino labels Aug 31, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on making your first PR and thank you for contributing to Superset! 🎉 ❤️

We hope to see you in our Slack community too! Not signed up? Use our Slack App to self-register.

@joaoferrao joaoferrao marked this pull request as draft August 31, 2024 15:09
@github-actions github-actions bot removed the api Related to the REST API label Aug 31, 2024
@joaoferrao joaoferrao marked this pull request as ready for review September 2, 2024 14:04
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 83.91%. Comparing base (76d897e) to head (4011e69).
Report is 924 commits behind head on master.

Files with missing lines Patch % Lines
superset/db_engine_specs/trino.py 70.58% 5 Missing ⚠️
superset/db_engine_specs/base.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #30081       +/-   ##
===========================================
+ Coverage   60.48%   83.91%   +23.42%     
===========================================
  Files        1931      534     -1397     
  Lines       76236    38788    -37448     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32549    -13565     
+ Misses      28017     6239    -21778     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.91% <48.27%> (-0.25%) ⬇️
javascript ?
mysql 76.73% <51.72%> (?)
postgres 76.85% <51.72%> (?)
presto 53.41% <48.27%> (-0.40%) ⬇️
python 83.91% <79.31%> (+20.42%) ⬆️
sqlite 76.31% <51.72%> (?)
unit 60.90% <79.31%> (+3.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sfirke
Copy link
Member

sfirke commented Sep 5, 2024

Looks like @betodealmeida was working on #30126 at the same time which might be related -- Beto, are you able to review this one?

@mistercrunch
Copy link
Member

tagging @betodealmeida as he has most context to review this

@@ -192,3 +192,8 @@ class OAuth2ClientConfigSchema(Schema):
)
authorization_request_uri = fields.String(required=True)
token_request_uri = fields.String(required=True)
request_content_type = fields.String(
Copy link
Contributor

@fisjac fisjac Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be mistaken, but I believe this validation is only applied to the client_info contained within encrypted_extra when provided by a user. Is the intent that this is where the request_content_type is going to be provided, or is it going to be set as a default for the Trino engine spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think the idea is that DB engine specs can have a default content type (defined in the oauth2_token_request_type class attribute), but it can be overridden on a per-database basis by setting it in the encrypted_extra.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I left a few comments but nothing blocking.

Comment on lines +566 to +578
if config["request_content_type"] == "data":
return requests.post(uri, data=req_body, timeout=timeout).json()
return requests.post(uri, json=req_body, timeout=timeout).json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think that the most common (standard?) workflow is to send form encoded data, and not JSON. It's just that the first OAuth2 implementation done for Superset targeted GSheets, and Google uses JSON instead of form encoded (it's the same for BigQuery, for example). But now changing the default would break existing databases, so we need to leave JSON as the default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that data is likely the more common implementation. As we're still in early days, I feel it's well motivated to introduce a breaking change, as long as we have an UPDATING.md explaining the required steps to get it to work again. This in the interest of having a clean/idiomatic API in the long term..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro yeah, that sounds good to me.

superset/db_engine_specs/trino.py Show resolved Hide resolved
@@ -192,3 +192,8 @@ class OAuth2ClientConfigSchema(Schema):
)
authorization_request_uri = fields.String(required=True)
token_request_uri = fields.String(required=True)
request_content_type = fields.String(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think the idea is that DB engine specs can have a default content type (defined in the oauth2_token_request_type class attribute), but it can be overridden on a per-database basis by setting it in the encrypted_extra.

Comment on lines 67 to 69
if isinstance(instance, HttpError):
return "error 401: b'Invalid credentials'" in str(instance)
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like this approach!

You can simplify this to:

return isinstance(instance, HttpError) and "error 401: b'Invalid credentials'" in str(instance)

@pytest.fixture
def oauth2_config() -> OAuth2ClientConfig:
"""
Config for GSheets OAuth2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Config for GSheets OAuth2.
Config for Trino OAuth2.

@mistercrunch
Copy link
Member

tried my luck fixing the merge conflict in the GitHub web editor, didn't pull the branch and run CI, so no pre-commit checks.... let's see if I get lucky, if not feel free to disregard my commit

@rusackas rusackas requested review from villebro and nytai October 15, 2024 20:52
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, other than that LGTM

Comment on lines +566 to +578
if config["request_content_type"] == "data":
return requests.post(uri, data=req_body, timeout=timeout).json()
return requests.post(uri, json=req_body, timeout=timeout).json()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that data is likely the more common implementation. As we're still in early days, I feel it's well motivated to introduce a breaking change, as long as we have an UPDATING.md explaining the required steps to get it to work again. This in the interest of having a clean/idiomatic API in the long term..

Comment on lines 434 to 436
oauth2_authorization_request_uri: str | None = "" # pylint: disable=invalid-name
oauth2_token_request_uri: str | None = ""
oauth2_token_request_type: str | None = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we're defaulting to "", when might these still be None?

@rusackas
Copy link
Member

@villebro @betodealmeida looks like we have a few nits and unresolved questions here, but the thread has gone quiet. If @joaoferrao doesn't respond, would anyone want to block this, or should someone smash that merge button?

@rusackas
Copy link
Member

(note that code suggestions can be committed here by the person offering them, btw!)

@joaoferrao
Copy link
Contributor Author

joaoferrao commented Oct 29, 2024 via email

@joaoferrao
Copy link
Contributor Author

@betodealmeida @villebro I have addressed the main comments, I believe.

One question I still made when opening my PR is still open:

Testing Connection when adding the database: temp: previous OAuth2 features implemented and already merged don't include a way to trigger this flow when adding a connection via UI. For this reason, I had to hack the test_connection.py so I'm allowed to

I don't know enough about superset, but should this be addressed in a separate PR? Basically, I had to do a lot of hacking to be able to add the connection without the test connection working in the UI - ofc so I could develop the functionality and actually prove the connection it self works.

@anushreegawdeAP
Copy link

I’ve been working on setting up OAuth2.0 integration for Trino with Okta in Superset, following the guidance from this PR (#30081). However, despite configuring authorization_request_uri, token_request_uri, and handling impersonation, I’m still encountering issues with 401 Unauthorized errors when trying to connect to Trino.
My current setup looks like this:

trino_oauth

DATABASE_OAUTH2_REDIRECT_URI: "http://localhost:8088/api/v1/database/oauth2/"
DATABASE_OAUTH2_CLIENTS: {
'HSH Trino Oauth': {
'id': '<trino_app_client_id>',
'secret': '<trino_app_client_secret>',
'scope': 'openid email offline_access roles profile',
'redirect_uri': 'http://localhost:8088/api/v1/database/oauth2/',
'authorization_request_uri': 'https://example.okta.com/oauth2/v1/authorize',
'token_request_uri': 'https://example.okta.com/oauth2/v1/token'
}
}

{"connect_args":{"http_scheme":"https"}}

Impersonate: true is in place as well and UI shows the correct configs.

I've reviewed the setup and ensured that the OAuth tokens are obtained correctly from Okta, and all necessary permissions seem in place. Are there any additional configurations or specifics for Trino that this PR setup might require? Any guidance or insights on further troubleshooting would be greatly appreciated!

Thank you!

@joaoferrao
Copy link
Contributor Author

I've reviewed the setup and ensured that the OAuth tokens are obtained correctly from Okta, and all necessary permissions seem in place. Are there any additional configurations or specifics for Trino that this PR setup might require? Any guidance or insights on further troubleshooting would be greatly appreciated!

A bit more detail would help fine tune the issue. Can I ask if you, even if redundantly:

  1. Have you confirmed trino is working with OAuth2 without superset? Meaning: trino cli with trino --server='<your trino endpoint uri>' --external-authentication=true?
  2. Does that work with the exact same user your are trying to impersonate from superset?
  3. Have you configured superset to use the same "username" as you are using for trino?

If you are sure you are getting the access token back from the IdP (after being requested to authenticate), then this is what is sent to trino:

        if backend_name == "trino" and username is not None:
            connect_args["user"] = username
            if access_token is not None:
                http_session = requests.Session()
                http_session.headers.update({"Authorization": f"Bearer {access_token}"})
                connect_args["http_session"] = http_session

@anushreegawdeAP
Copy link

@joaoferrao the first part works. I verified that. I am unclear about the username part. Can you shed some more light on that?
Have you configured superset to use the same "username" as you are using for trino? how do i verify this?

@betodealmeida
Copy link
Member

I don't know enough about superset, but should this be addressed in a separate PR? Basically, I had to do a lot of hacking to be able to add the connection without the test connection working in the UI - ofc so I could develop the functionality and actually prove the connection it self works.

Yeah, this was a chicken-and-egg problem. The token is stored associated with the database ID, so you can't do OAuth2 before the database is created. But for the database to be created, we require a successful connection, which in this case is impossible.

I added some logic in #30071 for Snowflake so that if the test connection fails because OAuth2 is needed then the test connection command succeeds:

https://github.com/apache/superset/pull/30071/files#diff-627a6d549d121af6805862d2edcf1fbbbc68c6df72013a81e67cc8f771b85717R166-R171

Wasn't this logic enough for the same flow with Trino?

@joaoferrao
Copy link
Contributor Author

@betodealmeida that is indeed very similar to what I did for me to test the feature, just that I didn't publish it, because I didn't delve enough to the test_connection and it would impact more than just trino.

Thanks for the feedback. I see all checks are passed and my PR is still approved. Do I assume correctly I should merge it?

@joaoferrao
Copy link
Contributor Author

joaoferrao commented Oct 31, 2024

@joaoferrao the first part works. I verified that. I am unclear about the username part. Can you shed some more light on that? Have you configured superset to use the same "username" as you are using for trino? how do i verify this?

Take what I'm saying with a grain of salt, but the point of impersonation is for your superset user to be impersonated upon the request to trino. Whatever the username is for superset, is going to be used in the --user argument of the trino connection. Currently, this username, depending on your Superset settings will either your Superset username or everything in your email address up until the @.

@anushreegawdeAP
Copy link

anushreegawdeAP commented Oct 31, 2024

@joaoferrao I did verify the command trino --server='' --external-authentication=true with my okta username and that works on the browser but throws cannot impersonate user error in the end on the cluster. Tried with making that setting false still no luck.
Will adding 'request_content_type': 'application/x-www-form-urlencoded' make any difference instead of data? i suspect its not retrieving the token correctly. I have also checked a snippet of log and The log does not show an Authorization: Bearer <access_token> header in the request.

@joaoferrao
Copy link
Contributor Author

and that works on the browser but throws cannot impersonate user error in the end on the cluster

Does it or doesn't in work when you remove superset from the equation? The previous comment left me a bit confused about the scenario you described.

  • This should work with standalone trino + okta: trino --server='<your trino endpoint uri>' --external-authentication=true --user=<username_used_by_superset> (notice the last argument) + execute some query such as SHOW CATLAOGS; to confirm you have permissions to see the catalogs. This username must be the same that is injected by superset in the query. If it doesn't work, you may have a problem with Trino setup.
  • adding 'request_content_type': 'application/x-www-form-urlencoded': this is what data does. This is used for the request from superset to okta to retrieve the tokens. This doesn't affect impersonation directly.
  • You mentioned in your initial comment that "I've reviewed the setup and ensured that the OAuth tokens are obtained correctly from Okta": this would mean that superset is correctly retrieving the token from okta, so there should be no issue with the data type of the request.

@anushreegawdeAP
Copy link

@joaoferrao thanks for all the pointers. Noticed that trino is using the email and superset uses okta token as the username hence the mismatch and 401. I will need to change the setup on either trino or superset side to get the authentication correct.

@betodealmeida betodealmeida merged commit 305b6df into apache:master Nov 4, 2024
37 checks passed
@betodealmeida
Copy link
Member

Thanks for working on this, @joaoferrao!

@datasc24
Copy link

datasc24 commented Dec 9, 2024

Testing Connection when adding the database: temp: previous OAuth2 features implemented and already merged don't include a way to trigger this flow when adding a connection via UI. For this reason, I had to hack the test_connection.py so I'm allowed to

Hello @joaoferrao, do you have a solution to create the connection to Trino with oauth 2 enabled in a production environment ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
authentication Related to authentication data:connect:trino Related to Trino size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants