Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mssql): use odbc #7317

Merged
merged 2 commits into from
Oct 11, 2023
Merged

feat(mssql): use odbc #7317

merged 2 commits into from
Oct 11, 2023

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Oct 9, 2023

This PR moves the MSSQL backend to use ODBC as the underlying driver. This is necessary to support connecting to Azure services. The main user-facing API change is that you now need to set up ODBC to use the MSSQL backend.

Closes #6640.
Closes #6012.

xref #7306

@cpcloud cpcloud added feature Features or general enhancements refactor Issues or PRs related to refactoring the codebase mssql The Microsoft SQL Server backend labels Oct 9, 2023
@cpcloud cpcloud changed the base branch from master to 8.x.x October 9, 2023 14:28
@cpcloud cpcloud force-pushed the mssql-pyodbc branch 4 times, most recently from 6ff2b45 to ebb8f0c Compare October 9, 2023 14:46
@inigohidalgo
Copy link
Contributor

inigohidalgo commented Oct 9, 2023

Wow, that was so fast. Thank you so much for prioritziing this!

I think I am slowly getting it to work, I've had to modify the code a little bit to get it to accept the parameters I needed to. These are some very rough changes I made just to brute-force get the parameters through, I'm sure there's a cleaner way to do this.

inigohidalgo@1805dda

conn_ib = ibis.mssql.connect(
    host=None,
    port=None,
    user=None,
    query={"odbc_connect": f"Driver=ODBC+Driver+17+for+SQL+Server;SERVER={server};DATABASE={database}"},
    **connect_args
)

where connect_args are those I defined here #7306

It seems that the odbc_connect kwarg I first deleted in the script I posted in the discussion is actually making all the strings passthrough to pyodbc directly sqlalchemy docs

EDIT: just to confirm, using the code block I posted above using the version of the code in my branch, I already successfully queried a simple table! 🙌

updates:
- [github.com/pre-commit/pre-commit-hooks: v4.4.0 → v4.5.0](pre-commit/pre-commit-hooks@v4.4.0...v4.5.0)
@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

@inigohidalgo Nice!

I incorporated most of your changes, they look reasonable to me, thank you for that!

@inigohidalgo
Copy link
Contributor

I'm not a huge fan of passing

f"SERVER={server};DATABASE={database}"

as a string in the query field when ibis' API already offers those arguments as part of the connect functions. Those arguments get passed to self._build_alchemy_url but the output of that call returns the wrong format for pyodbc connections.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

@inigohidalgo What does your ideal call to ibis.connect (or ibis.mssql.connect) look like? Maybe there's a way we can improve it.

@inigohidalgo
Copy link
Contributor

inigohidalgo commented Oct 10, 2023

I'm going to have a look around because from what I see I should be able to get the default _build_alchemy_url to work with the host and database parameters: https://stackoverflow.com/a/46101255/9807171

EDIT: although my host looks like "XXX-XXX-XXX.database.windows.net" instead of just a single "host", not sure if that makes a difference.

@inigohidalgo
Copy link
Contributor

Yeah I'm not able to connect without passing "SERVER={server};DATABASE={database}" inside the raw string.

In my ideal call, the host and database parameters would be added to that query={"odbc_connect": f"Driver=ODBC+Driver+17+for+SQL+Server;SERVER={server};DATABASE={database}"} but that seems like a very specific implementation just for my usecase, and it's adding maintenance burden to ibis instead of being handled by sqlalchemy. I'm trying to see if there's a URL builder in SQLAlchemy which handles this but I'm not seeing it anywhere.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

I'm not sure we can special case odbc_connect usage since it seems there are many ways to connect with odbc. In CI for example, we're using a combination of odbc driver parameters and user/host/password etc.

@cpcloud cpcloud force-pushed the mssql-pyodbc branch 4 times, most recently from 7af2a85 to f09ba0f Compare October 10, 2023 10:06
@inigohidalgo
Copy link
Contributor

Absolutely, it wouldn't make any sense to have such a specific implementation, especially when it seems it SHOULD work from what I'm reading online.

According to this SQLAlchemy should be converting the connection string to the "SERVER={server};DATABASE={database}" but it isn't working.

Also apparently SQLAlchemy adds a parameter Trusted_Connection=Yes which I'm supposed to remove, but I'm having issues understanding where I'm supposed to remove it

@event.listens_for(engine, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}

@inigohidalgo
Copy link
Contributor

At worst since all our connections are going to be done through my kedro wrapper, I can handle this on my side, but that would make that wrapper implementation less of a general Ibis wrapper and more of an Ibis+Azure one.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

I think to attach that event listener you'd do this:

con = ibis.mssql.connect(...)


@event.listens_for(con.con, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    ...

@inigohidalgo
Copy link
Contributor

That did the trick!

import struct
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL
from azure import identity
import ibis


SQL_COPT_SS_ACCESS_TOKEN = 1256  # Connection option for access tokens, as defined in msodbcsql.h
TOKEN_URL = "https://database.windows.net/"  # The token URL for any Azure SQL database

connection_string = "mssql://XXX.database.windows.net/XXX?driver=ODBC+Driver+17+for+SQL+Server"

conn = ibis.connect(connection_string)

azure_credentials = identity.DefaultAzureCredential()

@event.listens_for(conn.con, "do_connect")
def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}

conn.table(name="INPUT_VARIABLES_CONN_PT", schema="IGD").to_pandas()

One issue I have run into is that I'm having to monkeypatch query["driver"] = "ODBC Driver 17 for SQL Server" inside the mssql.Backend.do_connect because it is being lost from the connection_string at some point.

Just a heads up I'm gonna be quite busy for the next 24-48h so I'll be a little bit less responsive.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

One issue I have run into is that I'm having to monkeypatch query["driver"] = "ODBC Driver 17 for SQL Server"

I think you should be able to write

conn = ibis.connect(
    "mssql://XXX.database.windows.net/XXX",
    query=dict(driver="ODBC+Driver+17+for+SQL+Server"),
)

Any **kwargs you pass to connect should be forwarded on to do_connect.

Can you give that a try?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Given the complexity of using tokens, I think it makes sense to add a token_provider parameter to the MSSQL backend's do_connect so that you can define your token provider callable without having to manage sqlalchemy events or depend on creating an ibis backend first.

That would make the connection code look like this:

import struct
from sqlalchemy import create_engine, event
from sqlalchemy.engine.url import URL
from azure import identity
import ibis


SQL_COPT_SS_ACCESS_TOKEN = 1256  # Connection option for access tokens, as defined in msodbcsql.h
TOKEN_URL = "https://database.windows.net/"  # The token URL for any Azure SQL database

azure_credentials = identity.DefaultAzureCredential()

def provide_token(dialect, conn_rec, cargs, cparams):
    # remove the "Trusted_Connection" parameter that SQLAlchemy adds
    cargs[0] = cargs[0].replace(";Trusted_Connection=Yes", "")

    # create token credential
    raw_token = azure_credentials.get_token(TOKEN_URL).token.encode("utf-16-le")
    token_struct = struct.pack(f"<I{len(raw_token)}s", len(raw_token), raw_token)

    # apply it to keyword arguments
    cparams["attrs_before"] = {SQL_COPT_SS_ACCESS_TOKEN: token_struct}


connection_string = "mssql://XXX.database.windows.net/XXX?driver=ODBC+Driver+17+for+SQL+Server"
conn = ibis.connect(connection_string, token_provider=provide_token)
conn.table(name="INPUT_VARIABLES_CONN_PT", schema="IGD").to_pandas()

@inigohidalgo
Copy link
Contributor

Indeed that worked!

What is the more commonly-recommended API, ibis.connect(resource) or ibis.backend.connect(host...)? It seems like as things stand I should be ready to start working using the ibis.connect API.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Start with ibis.connect. We try to make sure that it has parity with ibis.${backend}.connect, and it doesn't require you to remember specific argument names (user vs username, etc).

If there's something you can't do with ibis.connect that can do with ibis.${backend}.connect then it's probably a bug 😄

@inigohidalgo
Copy link
Contributor

Regarding token_provider=provide_token, would that be an implementation on Ibis' side?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Regarding token_provider=provide_token, would that be an implementation on Ibis' side?

Yep.

It's basically this code at the end of ibis.backends.mssql.Backend.do_connect:

        if token_provider is not None:
            sa.event.listens_for(engine, "do_connect")(token_provider)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

The main benefit is that you can put your token providing code wherever you want. It doesn't need to be right next to the connection.

@inigohidalgo
Copy link
Contributor

Cool! Is that something you'd be okay with adding into Ibis? I assume there'll be more ppl who have the same issue but when I was searching the issues I couldn't find anyone with the same question as me.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Cool! Is that something you'd be okay with adding into Ibis?

Personally, yes, but I'll put up a separate PR for it where we can discuss whether folks think it's a good idea or not.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

My hunch is that once people are able to use Azure services with ibis then it'll be an issue :)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

One thing to note is that this is likely not going to be released until 8.0 because it's a pretty big change for MS SQL users (unclear how many of those we have).

It seems like many of the would-be MS SQL users aren't currently using ibis because they require odbc support, so for them this is going from zero to one, but for any existing users they'll have to migrate over to PyODBC.

@inigohidalgo
Copy link
Contributor

this is likely not going to be released until 8.0 because it's a pretty big change for MS SQL users

Not a problem for us, it'll take a while for us to migrate our pipelines to Ibis so in the meantime I'm happy to just build using either :8.x.x or your mssql-pyodbc branch if it's yet to be merged.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Sweet, yeah, assuming this gets merged, it'll be done in the 8.x.x branch which we'll rebase on top of master every so often. That'll run through CI like everything else.

@inigohidalgo
Copy link
Contributor

One final kink I see is the difference between Ibis' driver kwarg which defaults to mssql+pyodbc and then the driver I pass in the query parameter. Do you think it makes sense to somehow merge those in some way, or at least distinguish between these? Because I think pyodbc usually requires you to specify the driver version, the naming might be a little bit confusing.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Perhaps we can remove the query parameter?

@inigohidalgo
Copy link
Contributor

And provide a pyodbc_driver parameter which gets added to the query inside do_connect?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Is the current driver parameter not enough for that? For example, why wouldn't we just stuff driver into a query parameter dict inside do_connect?

@inigohidalgo
Copy link
Contributor

inigohidalgo commented Oct 10, 2023

I think the driver you're using atm informs the connection string mssql+pyodbc which tells SQLAlchemy which dialect to use, whereas the pyodbc driver gets added to the end of the connection string and goes through to pyodbc.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

I think the driver you're using atm informs the connection string mssql+pyodbc which tells SQLAlchemy which dialect to use, whereas the pyodbc driver gets added to the end of the connection string and goes through to pyodbc.

Ah, no, but that's how it used to be before this PR!

In this PR, query and driver are a bit redundant: driver is being stuffed into query.

A point of confusion is that there's the driver argument to URL.create and then the driver query parameter which has nothing to do with the first 😂

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

Hm, I might need to tinker with this a bit to work out how url-based and non-url-based connection works.

@inigohidalgo
Copy link
Contributor

But at the moment, using ibis.connect, driver is everything before the :// whereas I am currently providing the ODBC driver using the query kwarg

#7317 (comment)

Would you keep the query argument in ibis.connect but then ibis.mssql.connect the driver arg would get added into the query dict?

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

But at the moment, using ibis.connect, driver is everything before the :// whereas I am currently providing the ODBC driver using the query kwarg

Those are two different drivers, unrelated to each other.

The first (mssql://) is SQLAlchemy's notion of a driver. I'm currently hardcoding that into _build_alchemy_url as mssql+pyodbc. This driver is a fixed value.

The second is the PyODBC notion of driver, which is the thing being put into query. The driver argument to do_connect corresponds to the PyODBC notion of driver.

@inigohidalgo
Copy link
Contributor

Ok, that makes sense. I'm not sure if it might lead to a bit of confusion down the line, but for a start, keeping the query in ibis.connect but using the driver in do_connect to inform the PyODBC driver through the query could work.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 10, 2023

I think to make query and driver less confusing will require some deeper work, across backends, to differentiate query parameters from connect_args.

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kszucs kszucs merged commit d5e3db5 into ibis-project:8.x.x Oct 11, 2023
85 checks passed
@cpcloud cpcloud deleted the mssql-pyodbc branch October 17, 2023 08:56
@cpcloud cpcloud mentioned this pull request Nov 16, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements mssql The Microsoft SQL Server backend refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants