Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LogicalType for TypeSignature Numeric and String, Coercible #13240

Merged
merged 14 commits into from
Nov 6, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Nov 4, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

An attempt to use logical type for TypeSignature

Why NumericAndNumericString is added

In Postgres, math functions accept numeric string, but functions like aggregation var they don't accept numeric string.
Therefore we need two kinds of numeric signature, one accepts string one doesn't

What changes are included in this PR?

TypeSignature::Coercible, TypeSignature::Numeric and TypeSignature::String, TypeSignature:: NumericAndNumericString

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) common Related to common crate functions labels Nov 4, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title Use LogicalType for TypeSignature Coercible and String Use LogicalType for TypeSignature Coercible, Numeric and String Nov 4, 2024
],
Volatility::Immutable,
),
signature: Signature::any(1, Volatility::Immutable),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally first/last value get the first/last item in the column, so it should be the type of the column.

I forgot why we don't use Any.
Without the change, we need to add another type signature for boolean, so I change it to Any in this PR

Comment on lines 424 to 429
/// Avoid adding specific coercion cases here.
/// Aim to keep this logic as SIMPLE as possible!
pub fn can_cast_to(&self, target_type: &Self) -> bool {
// In Postgres, most functions coerce numeric strings to numeric inputs,
// but they do not accept numeric inputs as strings.
if self.is_numeric() && target_type == &NativeType::String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about "can cast" and "can coerce" without making clear distinction between them.
Can we make it less ambigous and clarify whether we may "can cast" (ie does CAST(type_from AS type_to) exist?) or "can coerce" (does cast exist and is applied implicitly in various contexts?)

return false;
}

true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default for "can cast" and "can coerce" should be false.
Maps cannot coerce to numbers or lists.

Let'e have

Suggested change
true
false

here and let's define what can be converted in rules above. This will make the code simpler to reason about

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title Use LogicalType for TypeSignature Coercible, Numeric and String Use LogicalType for TypeSignature Numeric and String. Deprecate Coercible and add NumericAndNumericString Nov 4, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 changed the title Use LogicalType for TypeSignature Numeric and String. Deprecate Coercible and add NumericAndNumericString Use LogicalType for TypeSignature Numeric and String, Coercible and introduce NumericAndNumericString Nov 4, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review November 5, 2024 02:23
@jayzhan211 jayzhan211 requested a review from findepi November 5, 2024 02:23
/// See [`NativeType::is_numeric`] to know which type is considered numeric
/// This signature accepts numeric string
/// Example of functions In Postgres that support numeric string
/// 1. Mathematical Functions, like `abs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i confirm this works in PostgreSQL (was testing with PostgreSQL v 17)

select abs('-123');

https://www.postgresql.org/docs/17/typeconv-oper.html#id-1.5.9.7.8 suggests why this works
'...' is not a varchar literal in PostgreSQL.
it's "unknown-type" literal, which gets interpreted as float8

Here the system has implicitly resolved the unknown-type literal as type float8 before applying the chosen operator.

indeed, select pg_typeof(abs('-123')); returns double precision

However, this doesn't work in PostgreSQL

select abs(CAST('-123' AS varchar));

This fails with "Query Error: function abs(character varying) does not exist"
Indicating that there is no special coercion rules from varchar to numbers when calling functions like abs().


BTW, coercion rules can be retrieved from PostgreSQL from https://www.postgresql.org/docs/17/catalog-pg-cast.html

PostgreSQL doesn't declare any implicit coercions between (selected) numeric types and (selected) varchar type, and abs() function resolution behavior as described above matches that.

with selected_types(name) AS (VALUES ('int4'), ('int8'), ('float8'), ('varchar'))
select 
	src.typname src_type,
    dst.typname dst_type,
    castcontext,
    case castcontext
    	when 'e' then 'only explicit'
        when 'a' then 'explicit | implicitly in assignment'
        when 'i' then 'implicitly in expressions, as well as the other cases'
        else '???' -- undocumented @ https://www.postgresql.org/docs/17/catalog-pg-cast.html
    end castcontext_explained
from pg_cast
join pg_type src on pg_cast.castsource = src.oid
join pg_type dst on pg_cast.casttarget = dst.oid
where true
and src.oid != dst.oid
and src.typname in (select name from selected_types)
and dst.typname in (select name from selected_types)
order by src.typname, dst.typname
;
src_type dst_type castcontext castcontext_explained
float8 int4 a explicit | implicitly in assignment
float8 int8 a explicit | implicitly in assignment
int4 float8 i implicitly in expressions, as well as the other cases
int4 int8 i implicitly in expressions, as well as the other cases
int8 float8 i implicitly in expressions, as well as the other cases
int8 int4 a explicit | implicitly in assignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @findepi.
It makes sense to me. NumericAndNumericString allows abs('-1') but also allows abs('-1'::varchar), which isn't allowed by Postgres.

A similar behavior in DuckDB is with to_timestamp.

D select to_timestamp('-1');
┌──────────────────────────┐
│    to_timestamp('-1')    │
│ timestamp with time zone │
├──────────────────────────┤
│ 1970-01-01 07:59:59+08   │
└──────────────────────────┘
D select to_timestamp('-1'::varchar);
Binder Error: No function matches the given name and argument types 'to_timestamp(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	to_timestamp(DOUBLE) -> TIMESTAMP WITH TIME ZONE

LINE 1: select to_timestamp('-1'::varchar);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to follow PostgreSQL behavior, we would need to model '...' literals as unknown type literals. This would be parser / frontend work. It should have no impact on projects that do not use the DF parser (#12723).
Not sure it's worthwhile to do. For now, let's just focus on coercions and make them reasonably correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We need to model unknown type literal and avoid numeric string to string coercion

@@ -98,6 +98,12 @@ impl fmt::Debug for dyn LogicalType {
}
}

impl std::fmt::Display for dyn LogicalType {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make a distinction between Debug and Display. Currently, we print something like

        let int: Box<dyn LogicalType> = Box::new(Int32);
        println!(format!("{}", int));

------- output ------
LogicalType(Native(Int32), Int32))

I imagine we can print Int32 toInt32, print JSON to JSON ... 🤔
However, I think it may not be the point of this PR. We can do it in another PR.

/// See [`NativeType::is_numeric`] to know which type is considered numeric
/// This signature accepts numeric string
/// Example of functions In Postgres that support numeric string
/// 1. Mathematical Functions, like `abs`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @findepi.
It makes sense to me. NumericAndNumericString allows abs('-1') but also allows abs('-1'::varchar), which isn't allowed by Postgres.

A similar behavior in DuckDB is with to_timestamp.

D select to_timestamp('-1');
┌──────────────────────────┐
│    to_timestamp('-1')    │
│ timestamp with time zone │
├──────────────────────────┤
│ 1970-01-01 07:59:59+08   │
└──────────────────────────┘
D select to_timestamp('-1'::varchar);
Binder Error: No function matches the given name and argument types 'to_timestamp(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	to_timestamp(DOUBLE) -> TIMESTAMP WITH TIME ZONE

LINE 1: select to_timestamp('-1'::varchar);

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211
Copy link
Contributor Author

I rm remove incorrect numeric string case in this PR

@jayzhan211 jayzhan211 changed the title Use LogicalType for TypeSignature Numeric and String, Coercible and introduce NumericAndNumericString Use LogicalType for TypeSignature Numeric and String, Coercible Nov 6, 2024
Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211, overall looks good to me. Just some minor comments. 👍

datafusion/expr-common/src/signature.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/scalar.slt Outdated Show resolved Hide resolved
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 merged commit 6686e03 into apache:main Nov 6, 2024
24 checks passed
@jayzhan211 jayzhan211 deleted the logical-signature branch November 6, 2024 23:56
@jayzhan211
Copy link
Contributor Author

Thanks @goldmedal @findepi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate functions logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants