Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signature::Coercible with user defined implicit casting #14440

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 9 additions & 0 deletions datafusion/common/src/types/native.rs
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,11 @@ impl LogicalType for NativeType {
TypeSignature::Native(self)
}

/// Returns the default casted type for the given arrow type
///
/// For types like String or Date, multiple arrow types mapped to the same logical type
/// If the given arrow type is one of them, we return the same type
/// Otherwise, we define the default casted type for the given arrow type
fn default_cast_for(&self, origin: &DataType) -> Result<DataType> {
use DataType::*;

Expand Down Expand Up @@ -226,6 +231,10 @@ impl LogicalType for NativeType {
(Self::Decimal(p, s), _) if p <= &38 => Decimal128(*p, *s),
(Self::Decimal(p, s), _) => Decimal256(*p, *s),
(Self::Timestamp(tu, tz), _) => Timestamp(*tu, tz.clone()),
// If given type is Date, return the same type
(Self::Date, origin) if matches!(origin, Date32 | Date64) => {
origin.to_owned()
}
(Self::Date, _) => Date32,
(Self::Time(tu), _) => match tu {
TimeUnit::Second | TimeUnit::Millisecond => Time32(*tu),
Expand Down
1 change: 1 addition & 0 deletions datafusion/expr-common/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,6 @@ path = "src/lib.rs"
[dependencies]
arrow = { workspace = true }
datafusion-common = { workspace = true }
indexmap = { workspace = true }
itertools = { workspace = true }
paste = "^1.0"
202 changes: 164 additions & 38 deletions datafusion/expr-common/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@
//! and return types of functions in DataFusion.

use std::fmt::Display;
use std::hash::Hash;
use std::num::NonZeroUsize;

use crate::type_coercion::aggregates::NUMERICS;
use arrow::datatypes::{DataType, IntervalUnit, TimeUnit};
use datafusion_common::types::{LogicalTypeRef, NativeType};
use indexmap::IndexSet;
use itertools::Itertools;

/// Constant that is used as a placeholder for any valid timezone.
Expand Down Expand Up @@ -127,12 +129,11 @@ pub enum TypeSignature {
Exact(Vec<DataType>),
/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
///
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
/// [`Coercion`] contains not only the desired type but also the allowed casts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API makes a lot of sense to me-- in fact I think it is pretty close to being able to express most other signatures.

/// For example, if you expect a function has string type, but you also allow it to be casted from binary type.
///
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
Coercible(Vec<TypeSignatureClass>),
Coercible(Vec<Coercion>),
/// One or more arguments coercible to a single, comparable type.
///
/// Each argument will be coerced to a single type using the
Expand Down Expand Up @@ -209,14 +210,13 @@ impl TypeSignature {
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)]
pub enum TypeSignatureClass {
Timestamp,
Date,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use NativeType::Date instead, not need TypeSignatureClass::Date

Time,
Interval,
Duration,
Native(LogicalTypeRef),
// TODO:
// Numeric
// Integer
Integer,
}

impl Display for TypeSignatureClass {
Expand Down Expand Up @@ -310,8 +310,8 @@ impl TypeSignature {
TypeSignature::Comparable(num) => {
vec![format!("Comparable({num})")]
}
TypeSignature::Coercible(types) => {
vec![Self::join_types(types, ", ")]
TypeSignature::Coercible(coercions) => {
vec![Self::join_types(coercions, ", ")]
}
TypeSignature::Exact(types) => {
vec![Self::join_types(types, ", ")]
Expand Down Expand Up @@ -365,7 +365,12 @@ impl TypeSignature {
}
}

/// get all possible types for the given `TypeSignature`
/// This function is used specifically internally for `information_schema`
/// We suggest not to rely on this function
///
/// Get all possible types for `information_schema` from the given `TypeSignature`
//
// TODO: Make this function private
pub fn get_possible_types(&self) -> Vec<Vec<DataType>> {
match self {
TypeSignature::Exact(types) => vec![types.clone()],
Expand All @@ -378,31 +383,24 @@ impl TypeSignature {
.cloned()
.map(|data_type| vec![data_type; *arg_count])
.collect(),
TypeSignature::Coercible(types) => types
TypeSignature::Coercible(coercions) => coercions
.iter()
.map(|logical_type| match logical_type {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![
DataType::Timestamp(TimeUnit::Nanosecond, None),
DataType::Timestamp(
TimeUnit::Nanosecond,
Some(TIMEZONE_WILDCARD.into()),
),
]
}
TypeSignatureClass::Date => {
vec![DataType::Date64]
}
TypeSignatureClass::Time => {
vec![DataType::Time64(TimeUnit::Nanosecond)]
}
TypeSignatureClass::Interval => {
vec![DataType::Interval(IntervalUnit::DayTime)]
}
TypeSignatureClass::Duration => {
vec![DataType::Duration(TimeUnit::Nanosecond)]
.map(|c| {
let mut all_types: IndexSet<DataType> =
get_possible_types_from_signature_classes(&c.desired_type)
.into_iter()
.collect();

if let Some(implicit_coercion) = &c.implicit_coercion {
let allowed_casts: Vec<DataType> = implicit_coercion
.allowed_source_types
.iter()
.flat_map(get_possible_types_from_signature_classes)
.collect();
all_types.extend(allowed_casts);
}

all_types.into_iter().collect::<Vec<_>>()
})
.multi_cartesian_product()
.collect(),
Expand Down Expand Up @@ -431,6 +429,32 @@ impl TypeSignature {
}
}

fn get_possible_types_from_signature_classes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would make more sense to me as a method on TypeSignatureClass

Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used in information.schema to list all possible signature combination, but it is too granular than necessary, sometimes we can use NativeType or a new struct that represent a set of DataType and it should be enough.

@goldmedal
If there is function requires Integer, we don't need to list all possible i8, i16, i32 or i64 but integer instead. I think we need a output other than Vec<DataType> but something that could combines both DataType or a set of type similar to NativeType. Otherwise we will generate tons of DataType combination for Coercible signature and I guess it is not readable for information.schema

signature_classes: &TypeSignatureClass,
) -> Vec<DataType> {
match signature_classes {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the TimeUnits are missing. It would be great if TIMEZONE_WILDCARD encompassed None and it would also be great if there was a wildcard for TimeUnit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "simplified" because of some reasons, if we add all possible types, the combination will be huge. This function is used in information.schema that shows the possible signatures a function has. Is listing all possible DataType combination helpful?

DataType::Timestamp(TimeUnit::Nanosecond, None),
DataType::Timestamp(TimeUnit::Nanosecond, Some(TIMEZONE_WILDCARD.into())),
]
}
TypeSignatureClass::Time => {
vec![DataType::Time64(TimeUnit::Nanosecond)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Time32 and DataType::Time64. Also same comment as above regarding TimeUnit wildcard.

}
TypeSignatureClass::Interval => {
vec![DataType::Interval(IntervalUnit::DayTime)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Interval

}
TypeSignatureClass::Duration => {
vec![DataType::Duration(TimeUnit::Nanosecond)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Duration. Also same comment as above regarding TimeUnit wildcard.

}
TypeSignatureClass::Integer => {
vec![DataType::Int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible ints.

}
}
}

fn get_data_types(native_type: &NativeType) -> Vec<DataType> {
match native_type {
NativeType::Null => vec![DataType::Null],
Expand Down Expand Up @@ -460,6 +484,110 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> {
}
}

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of Coercion.

Can this idea be used for user defined coercions or coercions to specific (Arrow) types?

Also, is there any way to allow users to provide their own coercion rules? For example, if Sail / @shehabgamin wants to support converting numeric values to strings automatically, would he be express that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this would be really great to have!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to ascii that binary to string is supported, you add numeric types in allowed_source_types

pub desired_type: TypeSignatureClass,
implicit_coercion: Option<ImplicitCoercion>,
}

impl Coercion {
pub fn new(desired_type: TypeSignatureClass) -> Self {
Self {
desired_type,
implicit_coercion: None,
}
}

/// Create a new coercion with implicit coercion rules.
///
/// `allowed_source_types` defines the possible types that can be coerced to `desired_type`.
/// `default_casted_type` is the default type to be used for coercion if we cast from other types via `allowed_source_types`.
pub fn new_with_implicit_coercion(
desired_type: TypeSignatureClass,
allowed_source_types: Vec<TypeSignatureClass>,
default_casted_type: NativeType,
) -> Self {
Self {
desired_type,
implicit_coercion: Some(ImplicitCoercion {
allowed_source_types,
default_casted_type,
}),
}
}

pub fn allowed_source_types(&self) -> &[TypeSignatureClass] {
self.implicit_coercion
.as_ref()
.map(|c| c.allowed_source_types.as_slice())
.unwrap_or_default()
}

pub fn default_casted_type(&self) -> Option<&NativeType> {
self.implicit_coercion
.as_ref()
.map(|c| &c.default_casted_type)
}
}

impl Display for Coercion {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "Coercion({}", self.desired_type)?;
if let Some(implicit_coercion) = &self.implicit_coercion {
write!(f, ", implicit_coercion={implicit_coercion}",)
} else {
write!(f, ")")
}
}
}

impl PartialEq for Coercion {
fn eq(&self, other: &Self) -> bool {
self.desired_type == other.desired_type
&& self.implicit_coercion == other.implicit_coercion
}
}

impl Hash for Coercion {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.desired_type.hash(state);
self.implicit_coercion.hash(state);
}
}

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct ImplicitCoercion {
allowed_source_types: Vec<TypeSignatureClass>,
/// For types like Timestamp, there are multiple possible timeunit and timezone from a given TypeSignatureClass
/// We need to specify the default type to be used for coercion if we cast from other types via `allowed_source_types`
/// Other types like Int64, you don't need to specify this field since there is only one possible type.
default_casted_type: NativeType,
}

impl Display for ImplicitCoercion {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"ImplicitCoercion({:?}, default_type={:?})",
self.allowed_source_types, self.default_casted_type
)
}
}

impl PartialEq for ImplicitCoercion {
fn eq(&self, other: &Self) -> bool {
self.allowed_source_types == other.allowed_source_types
&& self.default_casted_type == other.default_casted_type
}
}

impl Hash for ImplicitCoercion {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.allowed_source_types.hash(state);
self.default_casted_type.hash(state);
}
}

/// Defines the supported argument types ([`TypeSignature`]) and [`Volatility`] for a function.
///
/// DataFusion will automatically coerce (cast) argument types to one of the supported
Expand Down Expand Up @@ -536,11 +664,9 @@ impl Signature {
volatility,
}
}

/// Target coerce types in order
pub fn coercible(
target_types: Vec<TypeSignatureClass>,
volatility: Volatility,
) -> Self {
pub fn coercible(target_types: Vec<Coercion>, volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::Coercible(target_types),
volatility,
Expand Down Expand Up @@ -739,8 +865,8 @@ mod tests {
);

let type_signature = TypeSignature::Coercible(vec![
TypeSignatureClass::Native(logical_string()),
TypeSignatureClass::Native(logical_int64()),
Coercion::new(TypeSignatureClass::Native(logical_string())),
Coercion::new(TypeSignatureClass::Native(logical_int64())),
]);
let possible_types = type_signature.get_possible_types();
assert_eq!(
Expand Down
Loading