Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access Issue #504

Closed
devilaadi opened this issue Nov 26, 2021 · 3 comments
Closed

Access Issue #504

devilaadi opened this issue Nov 26, 2021 · 3 comments

Comments

@devilaadi
Copy link

devilaadi commented Nov 26, 2021

HI,

I am facing authorization issue while loading table.

self._table = RawDeltaTable(table_uri, version=version) deltalake.PyDeltaTableError: Failed to load checkpoint: Failed to read checkpoint content: Generic error: HTTP error status (status: 403, body: "\u{feff}AuthorizationFailureThis request is not authorized to perform this operation.\nRequestId:173822eb-301e-009e-4c9a-e2f970000000\nTime:2021-11-26T07:49:44.2187383Z")127.0.0.1 - - [26/Nov/2021 08:49:43] "GET /hello HTTP/1.1" 500 -

Code

# Importing flask module in the project is mandatory
# An object of Flask class is our WSGI application.
from flask import Flask
from deltalake import DeltaTable
import os
from typing import Optional, List, Tuple, Any
import adlfs
from urllib.parse import urlparse
import pyarrow
from pyarrow.dataset import dataset, partitioning
from flask import jsonify


os.environ["AZURE_STORAGE_ACCOUNT"] = "*"
os.environ["AZURE_STORAGE_KEY"]='*'

def to_pyarrow_dataset2(
        dt: DeltaTable, fs, container_name, partitions: Optional[List[Tuple[str, str, Any]]] = None
    ) -> pyarrow.dataset.Dataset:
        """
        Build a PyArrow Dataset using data from the DeltaTable.

        :param partitions: A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
        :return: the PyArrow dataset in PyArrow
        """
        if partitions is None:
            file_paths = dt.file_uris()
        else:
            file_paths = dt.files_by_partitions(partitions)
        paths = [urlparse(curr_file) for curr_file in file_paths]

        empty_delta_table = len(paths) == 0
        if empty_delta_table:
            return dataset(
                [],
                schema=dt.pyarrow_schema(),
                partitioning=partitioning(flavor="hive"),
            )

        # Decide based on the first file, if the file is on cloud storage or local
        if paths[0].netloc:
            query_str = ""
            # pyarrow doesn't properly support the AWS_ENDPOINT_URL environment variable
            # for non-AWS S3 like resources. This is a slight hack until such a
            # point when pyarrow learns about AWS_ENDPOINT_URL
            endpoint_url = os.environ.get("AWS_ENDPOINT_URL")
            if endpoint_url is not None:
                endpoint = urlparse(endpoint_url)
                # This format specific to the URL schema inference done inside
                # of pyarrow, consult their tests/dataset.py for examples
                query_str += (
                    f"?scheme={endpoint.scheme}&endpoint_override={endpoint.netloc}"
                )

            keys = [container_name+curr_file.path for curr_file in paths]
            return dataset(
                keys,
                schema=dt.pyarrow_schema(),
                filesystem=fs,
                partitioning=partitioning(flavor="hive"),
            )
        else:
            return dataset(
                file_paths,
                schema=dt.pyarrow_schema(),
                format="parquet",
                partitioning=partitioning(flavor="hive"),
            )

storage_options = {
    'account_name':'*', 
    'account_key':'*'
}

fs = adlfs.AzureBlobFileSystem(**storage_options)

# Flask constructor takes the name of
# current module (__name__) as argument.
app = Flask(__name__)
 
# The route() function of the Flask class is a decorator,
# which tells the application which URL should call
# the associated function.

@app.route("/")
def hello():
    return "ok"

@app.route('/hello')
def hello_name():
    dt = DeltaTable("abfss://containername@straccount.dfs.core.windows.net/deltatablefolder/")
    df = to_pyarrow_dataset2(dt, fs, 'shared').to_table().to_pandas()
    
    return jsonify(df.to_dict(orient='records'))
 
# main driver function
if __name__ == '__main__':
 
    # run() method of Flask class runs the application
    # on the local development server.
    app.run()

str accnt key , table name has been done has been hidden with *

@roeap
Copy link
Collaborator

roeap commented Nov 29, 2021

hi @devilaadi, as per our discussion on slack, can this issue be closed?

@roeap
Copy link
Collaborator

roeap commented Apr 22, 2022

@devilaadi - is this still relevant, or can we close this issue?

@roeap
Copy link
Collaborator

roeap commented May 9, 2022

Closing this issue since there is no more feedback and validated that Azure integration does indeed work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants