Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support to connections with Thrift over HTTP transport. #325

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

joaopedroantonio
Copy link

@joaopedroantonio joaopedroantonio commented Apr 4, 2020

  • Supports HTTP transport for Thrift protocol.
  • Three types of authentication supported: NONE, NOSASL, BASIC and KERBEROS.
  • BASIC authentication is useful when the Thrift HTTP interface is behind a proxy (e.g. in Azure HDInsight clusters).

@joaopedroantonio
Copy link
Author

This PR solves issue #69. Let me know what you think and if it looks good for you I'll invest time in the tests that are missing in this PR. :)

- Supports HTTP transport for Thrift protocol.
- Three types of authentication supported: NONE, BASIC and KERBEROS.
@codecov
Copy link

codecov bot commented Apr 4, 2020

Codecov Report

Attention: Patch coverage is 91.71975% with 13 lines in your changes missing coverage. Please review.

Project coverage is 93.67%. Comparing base (c448ed9) to head (97fe26e).
Report is 25 commits behind head on master.

Files Patch % Lines
pyhive/hive.py 86.76% 4 Missing and 5 partials ⚠️
pyhive/tests/test_hive.py 95.50% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
+ Coverage   93.23%   93.67%   +0.44%     
==========================================
  Files          14       14              
  Lines        1523     1677     +154     
  Branches      165      185      +20     
==========================================
+ Hits         1420     1571     +151     
+ Misses         75       74       -1     
- Partials       28       32       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bkyryliuk
Copy link
Collaborator

will take a pass on it next week, unfortunately can't get to this earlier

# TODO
# Setting the Cookie in the headers should be implemented in the thrift library.
# We'll keep this here until that change is available in there.
class TCookieHttpClient(thrift.transport.THttpClient.THttpClient):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, this commit was merged into Thrift master branch, we just have to wait for a new Thrift release to get rid of this TCookieHttpClient:

apache/thrift@69642f3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this fix has been included in the 0.13.0 Release

@bkyryliuk
Copy link
Collaborator

@joaopedroantonio I'm aligned with the approach here, let's add the unit tests & add the test plan in the PR description if you don't mind.

@joaopedroantonio
Copy link
Author

Thank you for the feedback @bkyryliuk, will start working on the tests between this weekend and start of next week.

@MinuMichaelOlassayil
Copy link

Any update on this?

@joaopedroantonio
Copy link
Author

Any update on this?

Oh well, tbh this got lost on my backlog, thanks for the bump. :) Will try to wrap it up this week.

@joaopedroantonio
Copy link
Author

Added unit tests for different HTTP scenarios and fixed a few issues in the process.

@bkyryliuk the codecov checks failed but I guess that's mostly because I moved the binary transport implementation and there are a few line/branch misses. Do you think it's relevant to cover those misses or can we move forward as it is?

@joaopedroantonio
Copy link
Author

@bkyryliuk, any idea when you might be able to pick this up? :)

@joaopedroantonio
Copy link
Author

@bkyryliuk bump. :)

@joaopedroantonio
Copy link
Author

Hello again everyone! Can you give some feedback on this PR? Thanks! (cc @bkyryliuk )

if http_path is None:
http_path = '/'

socket = TCookieHttpClient('http://{}:{}{}'.format(host, port, http_path))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could also pass through a http_protocol, either http or https and have this set.

@gthomas-slack
Copy link

I've tested this PR and this is working nicely for our HiveServer2 service running over http. Awesome work @joaopedroantonio - would love to see this get merged

@gthomas-slack
Copy link

For our particular use-case we need a few more features:

  • Support for connecting to a https endpoint
  • Support for passing through custom http headers
  • Support for passing through custom cookie values

I don't think these should block this PR though, I'd be happy to attempt to add these features once this initial version is in, unless of course you want to add them :)

if auth is None:
auth = 'NONE'
if http_path is None:
http_path = '/'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better to default this to /cliservice, as that seems to be the default path HiveServer2 provides

@joaopedroantonio
Copy link
Author

@gthomas-slack sorry for the late reply! I'm glad this was helpful! I'll take a better look at your comments over the weekend and I'm happy to include your suggestions if they aren't too complex. :)

@lvazquez
Copy link

For thrift=0.13.0 (current version in conda/pip) it doesn't work with the included TCookieHttpClient wrapper class, and I get the following error in the connect constructor when executing the first "use default" query:

  File "/root/superset/pyhive/pyhive/hive.py", line 189, in __init__
    cursor.execute('USE `{}`'.format(database))
  File "/root/superset/pyhive/pyhive/hive.py", line 459, in execute
    response = self._connection.client.ExecuteStatement(req)
  File "/root/superset/pyhive/TCLIService/TCLIService.py", line 280, in ExecuteStatement
    return self.recv_ExecuteStatement()
  File "/root/superset/pyhive/TCLIService/TCLIService.py", line 292, in recv_ExecuteStatement
    (fname, mtype, rseqid) = iprot.readMessageBegin()
  File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin
    name = self.trans.readAll(sz)
  File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/transport/TTransport.py", line 68, in readAll
    raise EOFError()

After replacing TCookieHttpClient with the original THttpClient it seems to work ok for http connections (no SSL support included).

@lvazquez
Copy link

lvazquez commented Nov 19, 2020

I believe the parameters name for the HTTP Thrift should be renamed to match the (simpler) names used in the original JDBC Thrift URI. Also 'binary', 'http', 'sasl' are not named "protocols" but transport "modes" according to JDBC/Hive documentation
Connection URL When HiveServer2 Is Running in HTTP Mode

Example Thrift JDBC over HTTPS URI:
jdbc:hive2://{host}:{port}/{database};transportMode=http;httpPath=/hive;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=changeit

Based on this it seems the names should be:

  • thrift_transport_protocol -> transport_mode (transportMode in JDBC/Thrift)
  • http_path -> http_path (httpPath in JDBC/Thrift)

The remaining parameters needed to support HTTPS (HTTP over SSL) should be:

  • ssl = True/False (could also be handled with transport_mode=https but this seems cleaner).
  • cacert, ca_cert or cafile (Python uses plain PEM cert files instead of Java keystores/truststores
  • OPTIONAL: cert_file, key_file and key_pass - to support SSL client authentication (as intended in Support for thrift over HTTPS for Hive connections #306)

@kent-pawar
Copy link

Hi @joaopedroantonio . Bumping this. Thank you for working on this! Could you pls look at the new feedback. Thank you

@mhconradt
Copy link

I've tested this with my Spark Thrift HiveServer2 deployment behind a load balancer on Kubernetes and it works great.
I'm looking forward to using this to connect Superset to my Spark Thrift server and would like to see this accepted soon.
Thanks @joaopedroantonio and everyone in the thread.

@gthomas-slack
Copy link

gthomas-slack commented Dec 14, 2020

For anyone who needs to use http/https HiveServer2 in production right now, It's actually possible with the current release of PyHive, as the HiveServer2 Connection class exposes a thrift_transport which allows you to pass in a custom configured transport to use: https://github.com/dropbox/PyHive/blob/master/pyhive/hive.py#L112-L113

Here is a basic example using HTTPS and adding a custom authentication header:

from pyhive import hive
import base64
import thrift.transport.THttpClient


def thrift_http_transport():
    transport = thrift.transport.THttpClient.THttpClient(uri_or_host='https://my-hiveserver2.com:443/cliservice')

    auth_credentials = '{}:{}'.format('test', 'test').encode('UTF-8')
    auth_credentials_base64 = base64.standard_b64encode(auth_credentials).decode('UTF-8')
    transport.setCustomHeaders(
        {
            'Authorization': 'Basic {}'.format(auth_credentials_base64),  # HiveServer2 BASIC auth
            'X-Auth-Token': 'xxx' # Custom header to auth with some kind of middleware
        }
    )
    return transport


conn = hive.connect(thrift_transport=thrift_http_transport())
cursor = conn.cursor()
cursor.execute("""SELECT SUM(1) from model.dim_date""")
data = cursor.fetchall()
cursor.close()
conn.close()
print(data)

@mithunjmistry
Copy link

Can we use http protocol with SQLAlchemy?

@ajprabhu09
Copy link

Does anyone have an example for http connection with kerberos auth ?

@gthomas-slack
Copy link

FYI there is currently a bug using HTTP transport connections when using thrift 0.15.0, some details here:
#417 (comment)

@CLAassistant
Copy link

CLAassistant commented Apr 16, 2022

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Joao Antonio
❌ joaopedroantonio


Joao Antonio seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@danjampro
Copy link

Hi @joaopedroantonio are there any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.