Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for StringView / character functions, fix regexp_like and regexp_match to work with StringView #11753

Merged
merged 7 commits into from
Aug 8, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 31, 2024

Which issue does this PR close?

Part of #11752
Part of #11790

Rationale for this change

I am trying to write up the remaining work for StringView, and one remaining item is add StringView support for StringFunctions. I want to know what functions already support StringView

It turns out we have a lot of string functions that probably need work to convert to natively support string view

What changes are included in this PR?

  1. Add some tests to string_view.slt showing if functions support StringView or not
  2. Fix regexp_like and regexp_match to work with StringView

Are these changes tested?

All tests

Are there any user-facing changes?

@alamb alamb marked this pull request as draft July 31, 2024 19:43
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 31, 2024
@alamb alamb force-pushed the alamb/test_string_view branch from 264523b to 45117f6 Compare July 31, 2024 20:13
@alamb alamb changed the title Minor: Add tests for StringView / character functions Add tests for StringView / character functions, fix regexp_like and regexp_match to work with StringView Jul 31, 2024
@alamb alamb marked this pull request as ready for review August 2, 2024 21:13
"The regexp_like function can only accept strings. Got {other}"
);
}
_ => Boolean,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, regexp_like fails with an error about unsupported types if passed a StringView

With this change it succeeds (though the StringView is cast to a String)

@tshauck and I are going to organize a bunch of changes to get StringView working across functions as part of #11790

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return error for non-string type?

LargeUtf8 | Utf8 | Utf8View => Boolean
Null => Null
other => not supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the type coercion already handles, in theory, ensuring the correct types are passed in, so the extra check was not necessary. I can put back the explicit check if you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems #11618 is striking again. I think @alamb is right, but it is very counter intuitive to return a Boolean because that case shouldn't exists.

Should we use 'unreachable!' ?

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the type coercion already handles, in theory, ensuring the correct types are passed in, so the extra check was not necessary. I can put back the explicit check if you prefer

I think you are right, we don't need another type check in return_type.

it is very counter intuitive to return a Boolean because that case shouldn't exists.

The type check should be handled in signature before return_type. We could improve docs about how the types and length check are handled in UDF/UDAF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I'll at least add some comments to make it clear why this isn't doing the typechecks here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 'unreachable!' ?

I prefer using internal error when possible as unreachable! will panic. But now that you mention this, the function is going to panic at runtime anyways 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad I mixed up two things. In the thread about better error handling I think we agreed to return a planning error in this scenario, and some form of unreachable within the invoke function to signal that branch should never be executed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments in ead4a7c



## Ensure no casts for ASCII
## TODO file ticket
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to file tickets for these functions as follow ons -- see #11787 (comment) for details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the one for ASCII: #11834

I don't have the ability to update the label, so maybe it could be changed to a good first issue? Might help it get picked up sooner, and with the examples, it's not doing it from scratch.

## TODO file ticket
query TT
EXPLAIN SELECT
BTRIM(column1_utf8view, 'foo') AS l
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTRIM: #11835

## TODO file ticket
query TT
EXPLAIN SELECT
CHARACTER_LENGTH(column1_utf8view) AS l
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #11677

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 but for some reason the argument is still cast to Utf8

01)Projection: character_length(CAST(test.column1_utf8view AS Utf8)) AS l

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debugged and fixed it here: f42aa84

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks!

## TODO file ticket
query TT
EXPLAIN SELECT
concat(column1_utf8view, column2_utf8view) as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONCAT: #11836

## TODO file ticket
query TT
EXPLAIN SELECT
concat_ws(', ', column1_utf8view, column2_utf8view) as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
CONTAINS(column1_utf8view, 'foo') as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch

## TODO file ticket
query TT
EXPLAIN SELECT
ENDS_WITH(column1_utf8view, 'foo') as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
INITCAP(column1_utf8view) as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
levenshtein(column1_utf8view, 'foo') as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
LOWER(column1_utf8view) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
LTRIM(column1_utf8view) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
LPAD(column1_utf8view, 12, ' ') as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
OCTET_LENGTH(column1_utf8view) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
STARTS_WITH(column1_utf8view, 'foo') as c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #11787

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck @jayzhan211 and others who helped with this PR

I updated the tests / added the ticket links, and I now plan to merge this PR once the CI is green

@@ -425,102 +425,26 @@ logical_plan
01)Projection: starts_with(test.column1_utf8view, Utf8View("äöüß")) AS c1, starts_with(test.column1_utf8view, Utf8View("")) AS c2, starts_with(test.column1_utf8view, Utf8View(NULL)) AS c3, starts_with(Utf8View(NULL), test.column1_utf8view) AS c4
02)--TableScan: test projection=[column1_utf8view]

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved these tests down to the other test for ASCII

@alamb alamb merged commit 368df80 into apache:main Aug 8, 2024
24 checks passed
@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2024

Thanks again@

@alamb alamb deleted the alamb/test_string_view branch August 8, 2024 20:39
## TODO file ticket
query TT
EXPLAIN SELECT
OVERLAY(column1_utf8view PLACING 'foo' FROM 2 ) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Ensure no casts for REGEXP_LIKE
query TT
EXPLAIN SELECT
REGEXP_LIKE(column1_utf8view, '^https?://(?:www\.)?([^/]+)/.*$') AS k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another batch

## Ensure no casts for REGEXP_MATCH
query TT
EXPLAIN SELECT
REGEXP_MATCH(column1_utf8view, '^https?://(?:www\.)?([^/]+)/.*$') AS k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Ensure no casts for REGEXP_REPLACE
query TT
EXPLAIN SELECT
REGEXP_REPLACE(column1_utf8view, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
REPLACE(column1_utf8view, 'foo', 'bar') as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
REPEAT(column1_utf8view, 2) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
REVERSE(column1_utf8view) as c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
RTRIM(column1_utf8view) as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
RTRIM(column1_utf8view) as c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTRIM is here twice, ticket: #11916

## TODO file ticket
query TT
EXPLAIN SELECT
RIGHT(column1_utf8view, 3) as c2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
RPAD(column1_utf8view, 1) as c1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tshauck tshauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final batch

## TODO file ticket
query TT
EXPLAIN SELECT
SPLIT_PART(column1_utf8view, 'f', 1) as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
STRPOS(column1_utf8view, 'f') as c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
SUBSTR(column1_utf8view, 1) as c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
TRANSLATE(column1_utf8view, 'foo', 'bar') as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## TODO file ticket
query TT
EXPLAIN SELECT
FIND_IN_SET(column1_utf8view, 'a,b,c,d') as c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

('value2', arrow_cast('datafusion', 'Utf8View'), arrow_cast('cool', 'Utf8View'));

query T
select column2||' is fast' from temp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket for concat operator is at: #11766

@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2024

You are the best @tshauck -- given how fast these implementations are pouring in I am not even going to try and update the URLs in the text :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants