Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Hive table broken after delta merge #375

Closed
jsaurav opened this issue Apr 3, 2020 · 7 comments
Closed

External Hive table broken after delta merge #375

jsaurav opened this issue Apr 3, 2020 · 7 comments

Comments

@jsaurav
Copy link

jsaurav commented Apr 3, 2020

We have a directory on S3 under which parquet files are present. We have an external hive table created on these files.

e.g.
create external table ratings_np(userId int, movieId int, rating double, time_stamp date) stored as parquet location "s3://<some-location>/data/";

Now we had performed a delta operation (open source) at the same location which converted it to a delta table. After the delta operation is completed it created a _delta_log directory under which transactional log json files are present.

Now when I running a query against my hive table it is throwing an exception because of these JSON files.
Exception:
Failed with exception java.io.IOException:java.lang.RuntimeException: s3://<some-location>/data/_delta_log/00000000000000000000.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [101, 125, 125, 10]

After I manually delete _delta_log directory, I am able to run the queries again.

I had looked at the delta connector solution as proposed by @tdas comment here. I haven't tried it though.

I am wondering if it can be solved in any simpler way?

e.g. configure the path under which _delta_log directory is created.
A location which I can give as an input and delta process will create _delta_log directory at that location instead of where data is present.

@zsxwing
Copy link
Member

zsxwing commented Apr 3, 2020

Did you create the table and query it using Hive? This exception looks like the query tried to use parquet format to load a Delta table so it was confused with transaction logs under the _delta_log directory.

@abiratsis
Copy link

abiratsis commented Apr 18, 2020

@zsxwing not sure if that is related or useful but this is how we created a Delta table and managed to register it to HiveMetastore (RDS/MySql) and it seems to work fine on Databricks:

val someDf = spark.emptyDataset[SomeSchema].toDF

// first create Delta table
someDf.write.format("delta").partitionBy("day").save(outputPath)

// then register it to hiveMetastore
spark.sql(s"CREATE TABLE ${tableName} USING DELTA LOCATION '${outputPath}'")

The only strange behaviour is the unexpected path of the Delta table as already mentioned here

@jsaurav
Copy link
Author

jsaurav commented Apr 26, 2020

@zsxwing We already had an external hive table registered in metastore. Later this table was converted to delta table and then it broke. We are unable to query this table in the hive and getting the above exception.

@zsxwing
Copy link
Member

zsxwing commented Apr 26, 2020

@jsaurav Do you mean you have an external hive table registered using parquet? If you convert it to a Delta table, it cannot be queries by Hive. convert to delta on a path doesn't touch anything in the Hive metastore. It just creates Delta transaction logs under the _delta_log directory. You can remove the _delta_log directory to recover.

@zsxwing
Copy link
Member

zsxwing commented Apr 26, 2020

NVM. I just saw you already described the workaround in the issue description. Currently convert to delta doesn't support to change information in Hive metastore. I created #403 to support this case.

@zsxwing
Copy link
Member

zsxwing commented Apr 26, 2020

We are unable to query this table in the hive and getting the above exception.

If you would like to query the Delta table in Hive (Spark support is still waiting for #85), you can use the Hive connector https://github.com/delta-io/connectors/releases . You will need to drop your table and re-create with the following statement so that Hive knows how to load the Delta table.

CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'

@zsxwing
Copy link
Member

zsxwing commented Apr 7, 2021

Closing this. CONVERT TO DELTA now also updates the table information in Hive Metastore.

@zsxwing zsxwing closed this as completed Apr 7, 2021
tdas pushed a commit to tdas/delta that referenced this issue May 31, 2023
…urce (delta-io#375)

* TestUtilCleanup - Move few methods from Flink sink test utils to common Flink connector tests utils.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanup - delta-io#2 Move few methods from Flink sink test utils to common Flink connector tests utils.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanup - delta-io#3 Move few methods from Flink sink test utils to common Flink connector tests utils.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - chery pick changes to Test Utils -> pull up Row Type as an argument.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - end2end WIP test

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - Changes after Review

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - new changes

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - tests.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - more refactoring

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - end to end tests

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - additional refactoring

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - end2end test for unbounded stream with updates.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - end to end test with reading/writing all types.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - merge from master

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSorce_SupressFLinkLogs_FixNPE - Set Log level for Flink to ERROR, fix NPE in logs for a couple of tests.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSorce_SupressFLinkLogs_FixNPE - repeat failed integration test

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test to read all data types.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test for options.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - Changes after Code Review

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - change Delta Log last modification time attribute for tests.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

* FlinkDeltaSource_PR_14_IT_Tests - changes after code review.

Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>

Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants