Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SparkSql function to_pretty_string #10359

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jinchengchenghh
Copy link
Contributor

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2024
Copy link

netlify bot commented Jul 1, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit a4ae7d4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66c6aa81ff2b0e0008b69c2f

@jinchengchenghh
Copy link
Contributor Author

@rui-mo @PHILO-HE Can you help review this PR? Thanks!

SELECT toprettystring(null); -- "NULL"
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]"

.. spark:function:: toprettystring(x, timeZone) -> varchar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think timeZone can be removed and let the above doc for toprettystring(x) cover timestamp type. TimeZone is actually read from Velox config. And like other time/date function, Gluten will not directly pass it to native function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I change the implement but forget the document.

{prefix + "toprettystring"});
registerFunction<ToPrettyStringFunction, Varchar, float>(
{prefix + "toprettystring"});
registerFunction<ToPrettyStringFunction, Varchar, double>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a helper function for these numeric types. E.g., add registerUnaryNumeric in RegistrationHelpers.h, then use it here.

velox/functions/sparksql/ToPrettyString.h Show resolved Hide resolved
if constexpr (std::is_same_v<TInput, StringView>) {
if (inputType_->isVarchar()) {
result = input;
} else if (inputType_->isVarbinary()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code for cast cannot be re-used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <typename TInput>
void callNullable(out_type<Varchar>& result, const TInput* a) {
if (a) {
call(result, *a);
Copy link
Contributor

@PHILO-HE PHILO-HE Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As call method co-exists, the non-null input will directly go into call method and will not go here. Maybe, we can only keep callNullable method with the impl. for call covered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We can merge the implementation of call into callNullable.

template <typename TInput>
void callNullable(out_type<Varchar>& result, const TInput* a) {
if (a) {
call(result, *a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

velox/functions/sparksql/ToPrettyString.h Show resolved Hide resolved
input, scale_, maxRowSize_, result.data());
if (view.isInline()) {
result.setNoCopy(view);
result.resize(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. Why are we resizing to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is right. The result is inline, so we don't need to allocate from buffer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we write some bytes to result.data() indeed, so it looks not right to resize it to 0. Alternatively, we could write data to an inlined buffer and set it to result.

char inlined[StringView::kInlineSize];

result.setNoCopy(view);
result.resize(0);
} else {
result.resize(view.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is not need for if-else block. You can just do this unconditionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for decimal, min long decimal value -1'000'000'000'000'000'000 + 1 with . is not inline

velox/functions/sparksql/ToPrettyString.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added several comments.

@@ -283,6 +283,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa"
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa"

.. spark:function:: toprettystring(x) -> varchar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name suggestion: to_pretty_string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is used in Spark plan, so I use this name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps follow Velox's style for the function name and add a mapping in compute engine.


- It prints binary values (either from column or struct field) using the hex format. ::

SELECT toprettystring('spark'); -- "spark"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The queries fail in Spark as internal function cannot be called by user. Shall we remove from documentation if their tests are covered?

SELECT toprettystring('spark'); -- "spark"
SELECT toprettystring(12); -- "12"
SELECT toprettystring(null); -- "NULL"
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the result [61 62 63 64]?

SELECT toprettystring(12); -- "12"
SELECT toprettystring(null); -- "NULL"
SELECT toprettystring(cast('abcd' as binary)); -- "[61 62 63 64 65 66]"
SELECT toprettystring('2000-01-01 12:21:56'); -- "2000-01-01 04:21:56"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps document how timezone is handled.

@@ -51,6 +51,9 @@ inline std::exception_ptr makeBadCastException(
makeErrorMessage(input, row, resultType, errorDetails),
false));
}
} // namespace

namespace detail {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to extract convertToStringView out in a header to use it elsewhere.

APIs in 'detail' namespaces are not supposed to be used outside of the header/cpp files.
#10241 (comment)

velox/functions/sparksql/ToPrettyString.h Show resolved Hide resolved
velox/functions/sparksql/ToPrettyString.h Outdated Show resolved Hide resolved
velox/functions/sparksql/ToPrettyString.h Outdated Show resolved Hide resolved
input, scale_, maxRowSize_, result.data());
if (view.isInline()) {
result.setNoCopy(view);
result.resize(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we write some bytes to result.data() indeed, so it looks not right to resize it to 0. Alternatively, we could write data to an inlined buffer and set it to result.

char inlined[StringView::kInlineSize];

velox/functions/sparksql/ToPrettyString.h Show resolved Hide resolved
@jinchengchenghh
Copy link
Contributor Author

setNoCopy will copy the data to StringView itself when it is inline.
https://github.com/facebookincubator/velox/blob/main/velox/vector/FlatVector.cpp#L137
So we don't need to reserve the data in String allocators @rui-mo

@rui-mo
Copy link
Collaborator

rui-mo commented Jul 5, 2024

@jinchengchenghh I notice some bytes are written to result.data(), but resizing it to 0 does not clear the data. Therefore, result has some data but its size is zero, which looks not correct.
https://github.com/facebookincubator/velox/blob/main/velox/functions/UDFOutputString.h#L45-L51

@@ -283,6 +283,20 @@ Unless specified otherwise, all functions return NULL if at least one of the arg
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa"
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa"

.. spark:function:: toprettystring(x) -> varchar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps follow Velox's style for the function name and add a mapping in compute engine.

@@ -283,6 +283,16 @@ Unless specified otherwise, all functions return NULL if at least one of the arg
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa"
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa"

.. spark:function:: toprettystring(x) -> varchar

Returns pretty string for all scalar type. It has several differences with casting value to string:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document suggestion:

Returns pretty string for the input value ``x``. All scalar types are supported. There are several differences with the behavior of casting ``x`` as string.

Also notice Spark describes it to support all kinds of values, and are we choosing to support scalar types only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't support complex datatype now.


- It prints binary values (either from column or struct field) using the hex format.

Considering the time zone if config ``session_timezone`` is set. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to L288.

@@ -165,6 +166,29 @@ inline void registerArrayMinMaxFunctions(const std::string& prefix) {
registerArrayMinMaxFunctions<Timestamp>(prefix);
registerArrayMinMaxFunctions<Date>(prefix);
}

void registerToPrettyStringFunctions(const std::string& prefix) {
registerUnaryIntegralWithTReturn<ToPrettyStringFunction, Varchar>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: declare a const variable to avoid repeating "toprettystring".

velox/functions/sparksql/Register.cpp Show resolved Hide resolved
}
if constexpr (std::is_same_v<TInput, int32_t>) {
if (inputType_->isDate()) {
auto output = DATE()->toString(*input);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion could also throw exception as it calls DateType::toIso8601. We may need to handle the exception correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for your catch. I will change the return type to Status.
Spark will also throw exception when it cannot be valid Date, the LocalDate.ofEpochDay(days) may throw exception. https://github.com/apache/spark/blob/branch-3.5/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala#L70

util::Converter<TypeKind::VARCHAR, void, util::SparkCastPolicy>::
tryCast(*input);
if (castResult.hasError()) {
result.setNoCopy(detail::kNull);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates for invaild input null is returned. Is that the case in Spark? Perhaps also document and test it.

/// toprettystring(x) -> varchar
/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It
/// has one difference with casting value to string:
/// - It prints null values (either from column or struct field) as "NULL".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Date -> date, Varchar -> varchar to align with others.

/// It has one difference with casting value to string:
/// - It prints null values (either from column or struct field) as "NULL".

Comment suggestion: The difference with casting input value as string is that "NULL" is returned for null inputs.

either from column or struct field

It seems complex type is not supported. Do we need this description?

/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It
/// has one difference with casting value to string:
/// - It prints null values (either from column or struct field) as "NULL".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this empty line.

};

/// Returns pretty string for Varbinary. It has several differences with casting
/// value to string:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Varbinary -> varbinary
There are several differences with cast(varbinary as string).


/// Returns pretty string for Varbinary. It has several differences with casting
/// value to string:
/// - It prints null values (either from column or struct field) as "NULL".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: - > 1), perhaps remove either from column or struct field as varbinary could not have struct field.

/// Returns pretty string for Varbinary. It has several differences with casting
/// value to string:
/// - It prints null values (either from column or struct field) as "NULL".
/// - It prints binary values (either from column or struct field) using the hex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

/// - It prints null values (either from column or struct field) as "NULL".
/// - It prints binary values (either from column or struct field) using the hex
/// format. Returns a pretty string of the byte array which prints each byte as
/// a hex digit and add spaces between them. For example, [1A C0].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces between them. For example, [1A C0].

Comment suggestion:

/// The pretty string is composed of the hex digits of bytes and spaces between
/// them. E.g., the result of to_pretty_string("abc") is "[31 32 33]".

TypePtr inputType_;
};

/// Returns pretty string for Varbinary. It has several differences with casting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the binary format is controlled by a config SQLConf.BINARY_OUTPUT_STYLE, and shall we document that only hex format is supported?

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToPrettyString.scala#L52

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char* pos = startPosition;
*pos++ = '[';
for (auto i = 0; i < input->size(); i++) {
auto formated = fmt::format("{:X}", input->data()[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formated -> formatted

@jinchengchenghh jinchengchenghh changed the title Add SparkSql function toprettystring Add SparkSql function to_pretty_string Aug 19, 2024
@jinchengchenghh
Copy link
Contributor Author

Address all the comments.

@@ -342,6 +342,16 @@ String Functions
SELECT substring_index('aaaaa', 'aa', 5); -- "aaaaa"
SELECT substring_index('aaaaa', 'aa', -5); -- "aaaaa"

.. spark:function:: to_pretty_string(x) -> varchar

Returns pretty string for all scalar type. Considering the time zone if config ``session_timezone`` is set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document suggestion:

Returns pretty string for ``x``. All scalar types are supported. Adjusts the timestamp input to the given time zone if set through ``session_timezone`` config.


Returns pretty string for all scalar type. Considering the time zone if config ``session_timezone`` is set.

It has several differences with casting value to string:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document suggestion:

The result is different from that of casting ``x`` as string in the following aspects.


It has several differences with casting value to string:

- It prints null values as "NULL".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document suggestion:

It prints null input as "NULL" rather than producing null output.

/// to_pretty_string(x) -> varchar
/// Returns pretty string for int8, int16, int32, int64, bool, Date, Varchar. It
/// has one difference with casting value to string:
/// 1) It prints null values as "NULL".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prints null input as "NULL" rather than producing null output.

for (auto i = 0; i < input->size(); i++) {
auto formatted = fmt::format("{:X}", input->data()[i]);
*pos++ = formatted.data()[0];
*pos++ = formatted.data()[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use 'snprintf' to write to 'pos' directly and avoid creating intermediate string.

std::snprintf(pos, 2, "%02X", static_cast(input->data()[i]));


- It prints null values as "NULL".

- It prints binary values using the hex format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prints binary value as hex string representation rather than UTF-8.

Timestamp::tsToStringView(inputValue, options_, result.data());
result.resize(stringView.size());
} catch (const std::exception& e) {
return Status::Invalid(e.what());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try-catch is used for date and timestamp types. Wondering if it is preferred to make these APIs return status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants