Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbx does not use credential passthrough #864

Open
mathurk1 opened this issue Apr 26, 2024 · 0 comments
Open

dbx does not use credential passthrough #864

mathurk1 opened this issue Apr 26, 2024 · 0 comments

Comments

@mathurk1
Copy link

mathurk1 commented Apr 26, 2024

Expected Behavior

I am working with Azure Databricks. I have a cluster with credential passthrough which allows me to read data stored in ADLS gen2 using my own id. I can simply log into databricks workspace, attach a notebook to the cluster and query the delta tables from ADLS gen2 without any setup.

I would expect that when I submit dbx execute --cluster-id cluster123 --job jobABC to the same cluster, it should be able to read those datasets from ADLS gen2 using my ID.

Thanks!

Current Behavior

Currently, the job fails when I dbx execute a job to the cluster with the following error:

Py4JJavaError: An error occurred while calling o469.load.
: com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.$anonfun$getToken$1(AdlGen2UpgradeCredentialContextTokenProvider.scala:37)
        at scala.Option.getOrElse(Option.scala:189)
        at com.databricks.backend.daemon.data.client.adl.AdlGen2UpgradeCredentialContextTokenProvider.getToken(AdlGen2UpgradeCredentialContextTokenProvider.scala:31)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAccessToken(AbfsClient.java:1371)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:306)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:238)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:211)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:209)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1213)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:1194)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:437)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:1107)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:901)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:891)

From my understanding, it is expecting a service principal or storeage keys to be configured

Steps to Reproduce (for bugs)

  1. clone charming aurora repo - https://github.com/gstaubli/dbx-charming-aurora
  2. setup dbx configure --token to setup link with databricks workspace
  3. add a new job to the conf/deployment.yml file:
      - name: "my-test-job"
        spark_python_task:
          python_file: "file://charming_aurora/tasks/sample_etl_task.py"
          parameters: [ "--conf-file", "file:fuse://conf/tasks/sample_etl_config.yml" ]
  1. update the sample etl task to read a adls delta table - https://github.com/gstaubli/dbx-charming-aurora/blob/main/charming_aurora/tasks/sample_etl_task.py
    def _write_data(self):
        df = (
            self.spark.read.format("delta")
            .load(
                f"abfss://containername@storeageaccount.dfs.core.windows.net/path/to/table"
            )
            .filter(f.col("date") == "2024-01-01")
        )
        print(df.count())
  1. submit job - dbx execute --cluster-id=cluster-id-with-credential-passthrough --job my-test-job

Context

I want to specifically "dbx execute" to my interactive cluster and not create a job cluster.

Your Environment

  • dbx version used: 0.8.18
  • Databricks Runtime version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant