Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table schema is not written to hive metastore #695

Closed
bn2302 opened this issue Jun 11, 2021 · 14 comments
Closed

Table schema is not written to hive metastore #695

bn2302 opened this issue Jun 11, 2021 · 14 comments
Labels
acknowledged This issue has been read and acknowledged by Delta admins

Comments

@bn2302
Copy link

bn2302 commented Jun 11, 2021

Description
When writing a delta table using pyspark, the table schema is not written into the hive metastore. When querying the table using spark thrift server via jdbc , I can't see the columns.

image

Steps
The table is created using
df.write.format("delta").saveAsTable("mytable").

Results
The information in the hive metastore is as followed:

SELECT * FROM TABLE_PARAMS

TBL_ID PARAM_KEY PARAM_VALUE
5 spark.sql.create.version 3.1.2
5 numFiles 6
5 spark.sql.sources.provider delta
5 transient_lastDdlTime 1623398918
5 totalSize 2688705839
5 spark.sql.partitionProvider catalog
5 spark.sql.sources.schema.numParts 1
5 spark.sql.sources.schema.part.0 "{""type"":""struct"",""fields"":[]}"

with the following warning in the spark session:

21/06/11 07:08:37 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table default.myname into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

Environments
Tested configurations

`Pyspark: 3.1.2
Hadoop: 3.2.0
Delta: 1.0.0
HiveMetastore: 3.0.0

Pyspark: 3.1.2
Hadoop: 3.2.0
Delta: 1.0.0
HiveMetastore: 2.7.3

Pyspark: 3.1.1
Hadoop: 3.2.0
Delta: 1.0.0
HiveMetastore: 2.7.3)`

Settings
conf.set("spark.hadoop.hive.metastore.client.connect.retry.delay", "5") conf.set("spark.hadoop.hive.metastore.client.socket.timeout", "1800") conf.set("spark.hadoop.hive.metastore.uris", "thrift://metastore.hive-metastore.svc.cluster.local:9083") conf.set("spark.hadoop.hive.input.format", "io.delta.hive.HiveInputFormat") conf.set("spark.hadoop.hive.tez.input.format", "io.delta.hive.HiveInputFormat")

@azsefi
Copy link

azsefi commented Jun 11, 2021

It is because currently thriftserver's SparkGetColumnsOperation uses old SessionCatalog object, which relies on hive metastore. Instead, TableCatalog should be used, which will delegate to correct catalog in case of different table providers (delta provider in your case).
https://github.com/apache/spark/blob/e958833c727442fc9efa4fc92f93db16cd5c8476/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala#L57

@bn2302
Copy link
Author

bn2302 commented Jun 11, 2021

As a workaround, I ceated a view on the delta table

spark.sql("CREATE OR REPLACE VIEW mytable_view AS SELECT * FROM mytable;")

This allows me to view the columns, and also query the data from PowerBi.

@ghost
Copy link

ghost commented Jun 13, 2021

The view creation worked for me using DBeaver, though I have a problem with PowerBI reading the views. Can you tell me the driver and settings you are using to connect PowerBI?

@bn2302
Copy link
Author

bn2302 commented Jun 15, 2021

I use the following settings:
Server myserver:10000
Protocol: Standard
Data connection mode: DirectQuery

Since I didn't configure any auth for the STS, yet, I just put in some arbitrary user/pw combination.

@Data-drone
Copy link

It is because currently thriftserver's SparkGetColumnsOperation uses old SessionCatalog object, which relies on hive metastore. Instead, TableCatalog should be used, which will delegate to correct catalog in case of different table providers (delta provider in your case).
https://github.com/apache/spark/blob/e958833c727442fc9efa4fc92f93db16cd5c8476/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala#L57

is there a way to fix it through the way we start the thrift server?

@azsefi
Copy link

azsefi commented Jun 22, 2021 via email

@zsxwing
Copy link
Member

zsxwing commented Jul 9, 2021

Since this is a Spark issue, it would be great to open a new ticket to Spark community instead. You can raise a ticket in https://issues.apache.org/jira/projects/SPARK

@dennyglee
Copy link
Contributor

By any chance, @bn2302 were you able to create a new ticket in the Spark community? It would be great if you could tag this here prior to closing? Thanks!

@dennyglee dennyglee added acknowledged This issue has been read and acknowledged by Delta admins need author feedback Issue is waiting for the author to respond labels Oct 12, 2021
@hanna-liashchuk
Copy link

I've created Spark issue here https://issues.apache.org/jira/browse/SPARK-37648
Meanwhile, @azsefi could you please share how you fixed it on your side? I probably could use it until some fix is done

@dennyglee dennyglee removed the need author feedback Issue is waiting for the author to respond label Dec 16, 2021
@dennyglee
Copy link
Contributor

Thanks, @hanna-liashchuk - appreciate you creating the issue. And yes, @azsefi - if you could share what you did that would be super helpful. thanks!

@pan3793
Copy link

pan3793 commented Dec 20, 2021

Hi @bn2302, we have a workaround for this issue in Apache Kyuubi (Incubating), apache/kyuubi#1476
Kyuubi can be considered as a more powerful Spark Thrift Server, it's worth a try.

@dennyglee
Copy link
Contributor

Closing this issue as it is a Spark bug; please re-open if this is incorrect. Thanks!

@nitindatta
Copy link

Hi , I looked at this issue in spark it is still open, anyone aware apart from creating view over table is there any other way we can directly get delta table schema in hive,

@felipepessoto
Copy link
Contributor

Isn't fixed by #2409?

Related #1478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged This issue has been read and acknowledged by Delta admins
Projects
None yet
Development

No branches or pull requests

9 participants