-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for: HiveMetastore outputFormat should not be accessed from a null StorageFormat (#6972) #9837
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
To fix issue - #6972 |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Here is the issue: We have some hive tables that sit on top of Kinesis streams:
When creating these tables in Hive, the metadata automatically make the INPUTFORMAT and OUTPUTFORMAT of these tables NULL. In earlier versions of Presto, this was not an issue. With the newer version of Presto (i.e >0.152), this causes an error, because if even one table in the metastore have a NULL input/output format, then the metadata from any other table cannot be fetched through presto |
Does anything block merging this? |
+1 |
@findepi any plans to merge this PR? |
@pdanilew i didn't think what how the problem should be addressed, so I didn't plan to review this, but if you take a look at the code in this PR, you will see it is not ready for review just yet. |
@findepi thanks for stepping in – do you mean that the commented blocks should be removed, or is there something less obvious that needs changing? We've been patching our presto with this in production for half a year now with this exact diff and it's worked fine. I wonder if @pashalogin is still around to update this or should someone else continue from with the editing? |
@juhoautio yes, i meant the commented out code, and failing Travis build.
Now you're touching upon functional merits of this change. As i said, I don't have the context to judge these. |
@juhoautio, how do you know this is the correct behavior for Hive. Also, is this hard coded in Hive or configurable? |
We have been running this fix in our production environment. But, not able to merge the code. Can anyone review this code? |
@pashalogin, I can but I will need answers to these questions:
|
Text Input Format is the default InputFormat of MapReduce. |
@pashalogin Are the "defaults" configurable in MapReduce? I am asking because I need to know if we are expected to make this configurable in Presto. |
@dain sorry, I don't have answers to your questions. I'm assuming that @pashalogin will provide. If not, I could look into it. Thanks all! |
When you create external hive tables on Kinesis stream, it captures null values for INPUTFORMAT and OUTPUTFORMAT in external hive metastore. It works well in Hive as it picks default values. But it fails in the Presto connector as it is throwing illegal exception if these values are null. It is not mandatory to set these values. This code was introduced in newer version of Presto (i.e >0.152). |
As a side note, @pashalogin, I suggest you file a bug with the Kinesis team, as they should really take the time to declare the format to everyone... just is cleaner. |
This documentation is more user facing, and what we need to understand is what the code is doing. Here is what I believe is the relevant section in Hive: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L502 I believe this code is trying the following for input and output format:
For SerDe, it seems to allow null. Assuming my analysis is correct, that would mean your table has the input and output formats set to text. Is that true? |
I have initially approached AWS for fix. But, they have stated it is Presto issue as it is working fine in Hive. |
Yes, if it is null, it is set to default values instead of throwing error message. |
Not sure if this helps, but in our case the table was created in Hive with: CREATE TEMPORARY EXTERNAL TABLE temp_dynamo_table (
UserId STRING,
SourceGame STRING,
LastModified BIGINT,
Scores STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
'dynamodb.table.name' = 'UserScores',
'dynamodb.column.mapping' = 'UserId:UserId,SourceGame:SourceGame,LastModified:LastModified,Scores:Scores',
'dynamodb.throughput.read.percent' = '0.2',
'dynamodb.throughput.write.percent' = '1.0'
); Then (still Hive): DESCRIBE FORMATTED temp_dynamo_table;
If you can spot something wrong here, please indicate so that it can be reported to AWS. This kind of table causes the By patching Presto with the changes in this PR that error can be mitigated. I haven't tried reading this table in Presto or writing to it. As a user I find it important that Presto doesn't return an error in an information_schema query even if such tables exist in the metastore. Having a proper read & write support for such tables seems like another story, and I believe that also other users that have commented on this issue don't really need that right now(?). Nothing wrong with implementing it properly of course. IMHO it would be fine to have this fix ASAP and improve later if read/write with such tables should be officially supported. Thank you once once again. |
@pashalogin, I'm not saying it is a bug, but they are definitely relying on risky legacy behavior. The problem is if they do not record the formats in the partition, if the user changes the table level format, which is used to describe how new partitions should be added by default, all existing partitions would no longer be readable. It is safer and what Hive does by default now days.... so not technically a bug, but a good improvement to make. |
@juhoautio I've never heard of |
@dain indeed I don't mind if Presto doesn't support read/write for tables with null StorageFormat. I'm only hoping that Presto would tolerate the existence of tables with null StorageFormat in the metastore ie. allowing the table meta to be queried without returning an error. Is that somehow a bad idea? Thanks for your answers. |
I agree and there have been some recent PRs to fix this kind of issue. I think the fix to that issue, might be different also (there are lots of things that can go wring), so can you open an issue with the exact commands you are running and the stack traces your get? This will help us add safety measures to prevent a single bad table failing metadata commands. |
|
@dain @findepi @juhoautio |
@pashalogin Can we close this in favor of #11847? |
Yes, We can close this. |
#6972