Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for: HiveMetastore outputFormat should not be accessed from a null StorageFormat (#6972) #9837

Closed
wants to merge 1 commit into from

Conversation

pashalogin
Copy link

@facebook-github-bot
Copy link
Collaborator

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@pashalogin
Copy link
Author

To fix issue - #6972

@facebook-github-bot
Copy link
Collaborator

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@pashalogin
Copy link
Author

pashalogin commented Jan 28, 2018

Here is the issue:

We have some hive tables that sit on top of Kinesis streams:
For example:

CREATE EXTERNAL TABLE `l1_XXX.analytical_home_events`(
  `json_data` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED BY 
  'com.amazon.emr.kinesis.hive.KinesisStorageHandler' 
TBLPROPERTIES (
  'kinesis.stream.name'='eds-CCC-prod-stream-home-analytical')

When creating these tables in Hive, the metadata automatically make the INPUTFORMAT and OUTPUTFORMAT of these tables NULL.

In earlier versions of Presto, this was not an issue. With the newer version of Presto (i.e >0.152), this causes an error, because if even one table in the metastore have a NULL input/output format, then the metadata from any other table cannot be fetched through presto

@findepi findepi changed the title https://github.com/prestodb/presto/issues/6972 Fix for: HiveMetastore outputFormat should not be accessed from a null StorageFormat (#6972) Jan 29, 2018
@juhoautio
Copy link

Does anything block merging this?

@tjlee
Copy link

tjlee commented Aug 24, 2018

+1

@pdanilew
Copy link

pdanilew commented Nov 7, 2018

@findepi any plans to merge this PR?

@findepi
Copy link
Contributor

findepi commented Nov 7, 2018

@pdanilew i didn't think what how the problem should be addressed, so I didn't plan to review this, but if you take a look at the code in this PR, you will see it is not ready for review just yet.

@juhoautio
Copy link

@findepi thanks for stepping in – do you mean that the commented blocks should be removed, or is there something less obvious that needs changing? We've been patching our presto with this in production for half a year now with this exact diff and it's worked fine.

I wonder if @pashalogin is still around to update this or should someone else continue from with the editing?

@findepi
Copy link
Contributor

findepi commented Nov 7, 2018

@juhoautio yes, i meant the commented out code, and failing Travis build.

We've been patching our presto with this in production for half a year now with this exact diff and it's worked fine.

Now you're touching upon functional merits of this change. As i said, I don't have the context to judge these.

@dain
Copy link
Contributor

dain commented Nov 7, 2018

@juhoautio, how do you know this is the correct behavior for Hive. Also, is this hard coded in Hive or configurable?

@pashalogin
Copy link
Author

We have been running this fix in our production environment. But, not able to merge the code. Can anyone review this code?

@dain
Copy link
Contributor

dain commented Nov 7, 2018

@pashalogin, I can but I will need answers to these questions:

  • How do you know this is the correct behavior for Hive?
  • Is this hard coded in Hive or configurable?

@pashalogin
Copy link
Author

pashalogin commented Nov 8, 2018

Text Input Format is the default InputFormat of MapReduce.
Text Output Format is the default OutputFormat of MapReduce.
Same thing apply for Serde as well.
It shouldn't throw runtime exception if these object are not explicitly defined.

@dain
Copy link
Contributor

dain commented Nov 8, 2018

@pashalogin Are the "defaults" configurable in MapReduce? I am asking because I need to know if we are expected to make this configurable in Presto.

@juhoautio
Copy link

@dain sorry, I don't have answers to your questions. I'm assuming that @pashalogin will provide. If not, I could look into it. Thanks all!

@pdanilew
Copy link

pdanilew commented Nov 8, 2018

Default format is in configuration. This link should cover all aspects as a starting point.
Don't know if it could be fetched from this. I'm mentioning it as I don't like idea of double configuration (parameters in hive and presto connector).

@pashalogin
Copy link
Author

pashalogin commented Nov 8, 2018

When you create external hive tables on Kinesis stream, it captures null values for INPUTFORMAT and OUTPUTFORMAT in external hive metastore. It works well in Hive as it picks default values. But it fails in the Presto connector as it is throwing illegal exception if these values are null. It is not mandatory to set these values. This code was introduced in newer version of Presto (i.e >0.152).

@dain
Copy link
Contributor

dain commented Nov 8, 2018

As a side note, @pashalogin, I suggest you file a bug with the Kinesis team, as they should really take the time to declare the format to everyone... just is cleaner.

@dain
Copy link
Contributor

dain commented Nov 8, 2018

Default format is in configuration. This link should cover all aspects as a starting point.

This documentation is more user facing, and what we need to understand is what the code is doing. Here is what I believe is the relevant section in Hive: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L502

I believe this code is trying the following for input and output format:

  1. Storage descriptor of the partition
  2. Schema properties of the table
  3. Assume SequenceFile format

For SerDe, it seems to allow null.

Assuming my analysis is correct, that would mean your table has the input and output formats set to text. Is that true?

@pashalogin
Copy link
Author

As a side note, @pashalogin, I suggest you file a bug with the Kinesis team, as they should really take the time to declare the format to everyone... just is cleaner.

I have initially approached AWS for fix. But, they have stated it is Presto issue as it is working fine in Hive.

@pashalogin
Copy link
Author

Default format is in configuration. This link should cover all aspects as a starting point.

This documentation is more user facing, and what we need to understand is what the code is doing. Here is what I believe is the relevant section in Hive: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L502

I believe this code is trying the following for input and output format:

  1. Storage descriptor of the partition
  2. Schema properties of the table
  3. Assume SequenceFile format

For SerDe, it seems to allow null.

Assuming my analysis is correct, that would mean your table has the input and output formats set to text. Is that true?

Yes, if it is null, it is set to default values instead of throwing error message.

@juhoautio
Copy link

Not sure if this helps, but in our case the table was created in Hive with:

CREATE TEMPORARY EXTERNAL TABLE temp_dynamo_table (
    UserId STRING,
    SourceGame STRING,
    LastModified BIGINT,
    Scores STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
    'dynamodb.table.name' = 'UserScores',
    'dynamodb.column.mapping' = 'UserId:UserId,SourceGame:SourceGame,LastModified:LastModified,Scores:Scores',
    'dynamodb.throughput.read.percent' = '0.2',
    'dynamodb.throughput.write.percent' = '1.0'
);

Then (still Hive):

DESCRIBE FORMATTED temp_dynamo_table;
col_name data_type comment
NULL NULL
userid string from deserializer
sourcegame string from deserializer
lastmodified bigint from deserializer
scores string from deserializer
NULL NULL
# Detailed Table Information NULL NULL
Database: tmp NULL
Owner: hadoop NULL
CreateTime: Thu Nov 08 21:24:07 UTC 2018 NULL
LastAccessTime: UNKNOWN NULL
Retention: 0 NULL
Location: hdfs://ip-10-1-59-186.obfuscated:8020/tmp/hive/hadoop/fd231996-8078-4729-b5c1-3afe15f10109/_tmp_space.db/f169674f-4afc-438e-a5f6-447714d37213 NULL
Table Type: EXTERNAL_TABLE NULL
Table Parameters: NULL NULL
EXTERNAL TRUE
dynamodb.column.mapping "UserId:UserId
dynamodb.table.name UserScores
dynamodb.throughput.read.percent 0.2
dynamodb.throughput.write.percent 1.0
storage_handler org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler
NULL NULL
# Storage Information NULL NULL
SerDe Library: org.apache.hadoop.hive.dynamodb.DynamoDBSerDe NULL
InputFormat: null NULL
OutputFormat: null NULL
Compressed: No NULL
Num Buckets: -1 NULL
Bucket Columns: [] NULL
Sort Columns: [] NULL
Storage Desc Params: NULL NULL
serialization.format 1

If you can spot something wrong here, please indicate so that it can be reported to AWS.

This kind of table causes the HiveMetastore outputFormat should not be accessed from a null StorageFormat error on Presto. As mentioned in the issue (#6972), this makes some tools fail when they try to read information_schema and system.jdbc.columns (to refresh schemas & tables in Presto).

By patching Presto with the changes in this PR that error can be mitigated. I haven't tried reading this table in Presto or writing to it. As a user I find it important that Presto doesn't return an error in an information_schema query even if such tables exist in the metastore.

Having a proper read & write support for such tables seems like another story, and I believe that also other users that have commented on this issue don't really need that right now(?). Nothing wrong with implementing it properly of course. IMHO it would be fine to have this fix ASAP and improve later if read/write with such tables should be officially supported. Thank you once once again.

@dain
Copy link
Contributor

dain commented Nov 8, 2018

@pashalogin, I'm not saying it is a bug, but they are definitely relying on risky legacy behavior. The problem is if they do not record the formats in the partition, if the user changes the table level format, which is used to describe how new partitions should be added by default, all existing partitions would no longer be readable. It is safer and what Hive does by default now days.... so not technically a bug, but a good improvement to make.

@dain
Copy link
Contributor

dain commented Nov 8, 2018

@juhoautio I've never heard of org.apache.hadoop.hive.dynamodb.DynamoDBSerDe, so my guess is this fix isn't going to work anyway (unless that format happens to be the default Hive Text format). I would suggest you file a separate issue to have that format be supported. Maybe we can get one of the AWS folks can implement this.

@juhoautio
Copy link

@dain indeed I don't mind if Presto doesn't support read/write for tables with null StorageFormat. I'm only hoping that Presto would tolerate the existence of tables with null StorageFormat in the metastore ie. allowing the table meta to be queried without returning an error. Is that somehow a bad idea? Thanks for your answers.

@dain
Copy link
Contributor

dain commented Nov 8, 2018

@dain indeed I don't mind if Presto doesn't support read/write for tables with null StorageFormat. I'm only hoping that Presto would tolerate the existence of tables with null StorageFormat in the metastore ie. allowing the table meta to be queried without returning an error. Is that somehow a bad idea?

I agree and there have been some recent PRs to fix this kind of issue. I think the fix to that issue, might be different also (there are lots of things that can go wring), so can you open an issue with the exact commands you are running and the stack traces your get? This will help us add safety measures to prevent a single bad table failing metadata commands.

@findepi
Copy link
Contributor

findepi commented Nov 9, 2018

@dain indeed I don't mind if Presto doesn't support read/write for tables with null StorageFormat. I'm only hoping that Presto would tolerate the existence of tables with null StorageFormat in the metastore ie. allowing the table meta to be queried without returning an error. Is that somehow a bad idea? Thanks for your answers.

@juhoautio @dain
#11847

@VicoWu
Copy link

VicoWu commented Nov 10, 2018

@dain @findepi @juhoautio
I believe that Presto , as a shared engine for many different backend catalogs and data format, it should not make any self-righteous assumption or hard-code judge for uncertain things, like null storage format for DynamoDBSerDe, even making a hardcode config for this is over-do and unnecessary; So when user is trying to read these storage-null table, just report error, don't hesitate;
But it will be more user-friendly and even necessary to skip this table whose storageFormat is null when user trying to scan catalogs (Reported in #11847 ) because in most cases users didn't care about the details that whether or not some specific table metadata is crashed when scanning table ; At least, a null Storage Format should be considered as an PrestoException(HIVE_UNSUPPORTED_FORMAT and be catched by upper caller to skip this table instead of a direct fail for scan table . To be conservative, we could also add a server-side configuration like skip-null-storage-format-when-scanning to pass over the choice to user;

@kokosing
Copy link
Contributor

@pashalogin Can we close this in favor of #11847?

@pashalogin
Copy link
Author

Yes, We can close this.

@pashalogin pashalogin closed this Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants