Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type mismatch between partition schema and table schema- table schema is string & partition schema is decimal(4,0) #2817

Closed
anirbanch opened this issue Feb 12, 2020 · 16 comments

Comments

@anirbanch
Copy link

Query fails when table schema is defined as string but partition schema is decimal. Error message below:

Query failed (#1111): There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'aaaa' in table 'ttttt' is declared as type 'string', but partition ‘pppp’ declared column 'aaaa' as type 'decimtal(4,0)'. [DB Errorcode=16777224]

The Hive Metastore has this column aaaa defined as type string but older partitions (like pppp) of parquet schema has that column defined as type decimal. Hive is able to read from the older partitions.

Ended up in this situation as the field type had to be changed from decimal to string, for business reasons.

Can Presto cast decimal to type string in above situation to return a result? (Not asking for string to decimal casting)

@electrum
Copy link
Member

electrum commented Feb 12, 2020

The Hive connector should support any conversion that is supported consistently in Hive (certain conversions work different for different file formats which is harder). Any numeric type to string should work. It starts with HiveCoercionPolicy and likely needs to actually be implemented elsewhere.

This commit is a good starting point to see what needs to be changed and where the tests go: f0eb4f3

@tooptoop4
Copy link
Contributor

tooptoop4 commented Feb 12, 2020

decimtal instead of decimal ? @anirbanch

@findepi findepi added the good first issue Good for newcomers label Feb 13, 2020
@degoyal
Copy link

degoyal commented Mar 3, 2020

Is anyone looking into this issue?

Also 1 qq, why Presto do check for partition metadata for all columns? For the above reported issue, user is not using any problematic column in the query anywhere but still Presto reports an issue as it compares the full table/partition metadata. We noticed that hive also reports the issue for same column but only when the column is used in the query.

Now since the data type was changed for only 1-2 columns unsupported by Presto, all our users who are accessing this table are paying the penalty as none of our users are able to query this table now. This is our detail table which is used very heavily.

Until this issue is fixed, as a workaround, we are thinking of creating another external table with the old structure on the older set of partitions or historical data specific to our Presto users to unblock them but would be great if the code can be fixed ASAP to support this.

@electrum
Copy link
Member

electrum commented Mar 3, 2020

The reason is that Presto connectors need to return metadata for all columns in the table. The engine doesn’t ask for metadata about specific columns.

@electrum
Copy link
Member

electrum commented Mar 3, 2020

The table/partition mismatch is actually a different issue. This happens at runtime during scheduling and it might be possible to only do it for used columns, but I don’t know the complexity of that change.

@findepi findepi removed the good first issue Good for newcomers label May 12, 2020
@adnanz
Copy link

adnanz commented Jun 9, 2020

Is there a workaround to this problem? We are getting the same issue:

There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'equipments' in table 'xxxxx' is declared as type 'array<struct<action:string,devicename:string,devicetype:string,model:string,outlet:string,serialnumber:string,isplaceholder:boolean,reason:string,categorycode:string,reasoncategory:string,reasoncode:string,otherreason:string>>', but partition 'scheduledate=2020-01-10' declared column 'equipments' as type 'array<struct<action:string,deviceName:string,deviceName:string,deviceType:string,model:string,outlet:string,serialNumber:string,isPlaceHolder:boolean,reason:string,categoryCode:string,reasonCategory:string,reasonCode:string,otherReason:string>>'.

All we are trying to do is to get the data for the previous day of the current date using code WHERE date(SCHEDULEDATE) = date_add('day', -1, current_date)

We are not sure why it's looking into scheduledate=2020-01-10 and we are not even interested in column equipments.

@hashhar
Copy link
Member

hashhar commented Jun 16, 2020

I'm willing to pick this up starting the coming weekend. This is a pain point for me too.

The existing workarounds are to create a new external table with the problematic column removed but it would be nice to support such conversions where possible.

@hashhar
Copy link
Member

hashhar commented Jul 11, 2020

@electrum As a first step will allowing coercion from any numeric type to string be useful? I've started work on this in #4426.

Having Presto be silent in case of schema mismatch on an unused column would require changes to the SPI as you said and I'm not sure it's a better decision than failing earlier with a clear message rather than people discovering the issue so late that they cannot even fix it.

@electrum
Copy link
Member

Our general rule for the Hive connector is that it should have the same behavior as Hive. If Hive can read the data in this evolution case, then we should as well (returning the same result).

@degoyal
Copy link

degoyal commented Jul 31, 2020

Thanks @hashhar and @electrum . Is it ready to be merged to master?

@hashhar
Copy link
Member

hashhar commented Jul 31, 2020

@degoyal No. I hadn't been able to spend a lot of time on this. Pending things are to verify if Hive allows implicit conversion between ANY numeric type and Strings. If someone could create a matrix of supported conversions it would help since I don't have a Hive setup handy and only devote weekends to this.

@findepi
Copy link
Member

findepi commented Jul 31, 2020

For Hive setup, you can use this.

Hive 1.2

./presto-product-tests-launcher/bin/run-launcher env up --environment singlenode --without-presto

Hive 3.1

./presto-product-tests-launcher/bin/run-launcher env up --environment singlenode-hdp3 --without-presto

@cyofeiyue
Copy link

any progress about this issue? @hashhar

@hashhar
Copy link
Member

hashhar commented Nov 24, 2021

@cyofeiyue I had a PR open at #4426 but I ran into issues regarding all the possible type coercions that can be performed within Hive (since they are undocumented).

I'm not actively working on it at the moment - whoever picks this up should focus on just a single coercion within a single pull-request otherwise it's unlikely that you'll be able to finish the work.

@TravelEarth10001Days
Copy link

After modifying the fields, the msck repair table is still inaccessible. The correct way is to delete the table, rebuild it, and then repair the partition, but it may be troublesome for internal tables

@findepi
Copy link
Member

findepi commented Sep 21, 2022

Fixed by #14172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

9 participants