-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Hive table broken after delta merge #375
Comments
Did you create the table and query it using Hive? This exception looks like the query tried to use
|
@zsxwing not sure if that is related or useful but this is how we created a Delta table and managed to register it to HiveMetastore (RDS/MySql) and it seems to work fine on Databricks: val someDf = spark.emptyDataset[SomeSchema].toDF
// first create Delta table
someDf.write.format("delta").partitionBy("day").save(outputPath)
// then register it to hiveMetastore
spark.sql(s"CREATE TABLE ${tableName} USING DELTA LOCATION '${outputPath}'") The only strange behaviour is the unexpected path of the Delta table as already mentioned here |
@zsxwing We already had an external hive table registered in metastore. Later this table was converted to delta table and then it broke. We are unable to query this table in the hive and getting the above exception. |
@jsaurav Do you mean you have an external hive table registered using |
NVM. I just saw you already described the workaround in the issue description. Currently |
If you would like to query the Delta table in Hive (Spark support is still waiting for #85), you can use the Hive connector https://github.com/delta-io/connectors/releases . You will need to drop your table and re-create with the following statement so that Hive knows how to load the Delta table.
|
Closing this. |
…urce (delta-io#375) * TestUtilCleanup - Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanup - delta-io#2 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanup - delta-io#3 Move few methods from Flink sink test utils to common Flink connector tests utils. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - chery pick changes to Test Utils -> pull up Row Type as an argument. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - end2end WIP test Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - Changes after Review Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - new changes Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - tests. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - more refactoring Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - end to end tests Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - additional refactoring Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - end2end test for unbounded stream with updates. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * TestUtilCleanUp - Merge From master + remove source partition table. Add log4j property file for tests. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - end to end test with reading/writing all types. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - merge from master Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - Set Log level for Flink to ERROR, fix NPE in logs for a couple of tests. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSorce_SupressFLinkLogs_FixNPE - repeat failed integration test Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test to read all data types. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - Add Source Execution test for options. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - Changes after merge from master. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - Changes after Code Review Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - change Delta Log last modification time attribute for tests. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> * FlinkDeltaSource_PR_14_IT_Tests - changes after code review. Signed-off-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com> Co-authored-by: Krzysztof Chmielewski <krzysztof.chmielewski@getindata.com>
We have a directory on S3 under which parquet files are present. We have an external hive table created on these files.
e.g.
create external table ratings_np(userId int, movieId int, rating double, time_stamp date) stored as parquet location "s3://<some-location>/data/";
Now we had performed a delta operation (open source) at the same location which converted it to a delta table. After the delta operation is completed it created a _delta_log directory under which transactional log json files are present.
Now when I running a query against my hive table it is throwing an exception because of these JSON files.
Exception:
Failed with exception java.io.IOException:java.lang.RuntimeException: s3://<some-location>/data/_delta_log/00000000000000000000.json is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [101, 125, 125, 10]
After I manually delete _delta_log directory, I am able to run the queries again.
I had looked at the delta connector solution as proposed by @tdas comment here. I haven't tried it though.
I am wondering if it can be solved in any simpler way?
e.g. configure the path under which _delta_log directory is created.
A location which I can give as an input and delta process will create _delta_log directory at that location instead of where data is present.
The text was updated successfully, but these errors were encountered: