Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mssql): add lpad and rpad ops #10060

Merged
merged 3 commits into from
Sep 10, 2024
Merged

Conversation

IndexSeek
Copy link
Contributor

Description of changes

I noticed that MSSQL was the only backend missing lpad and rpad, so I wanted to add support for these.

The LPAD operation was a bit tricky. I'm wondering if there might be a more straightforward way, but after several attempts, this seemed to handle all of the cases well.

@IndexSeek
Copy link
Contributor Author

Here was my initial implementation before it became what it is now:

    def visit_LPad(self, op, *, arg, length, pad=" "):
        return self.f.right(self.f.concat(self.f.replicate(pad, length), arg), length)

But it was giving me this result:

string_col index_col LPad(string_col, 4, '-')
AbC\t 0 AbC\t
\n123\n 1 \n
abc, 123 2 123
123 3 -123
aBc 4 -aBc
🐍 5 ---🐍
ÉéÈèêç 6 Èèêç

Where I was expecting this result:

string_col index_col LPad(string_col, 4, '-')
AbC\t 0 AbC\t
\n123\n 1 \n123
abc, 123 2 abc,
123 3 -123
aBc 4 -aBc
🐍 5 ---🐍
ÉéÈèêç 6 ÉéÈè

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this!

ibis/backends/sql/compilers/mssql.py Outdated Show resolved Hide resolved
ibis/backends/sql/compilers/mssql.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added feature Features or general enhancements mssql The Microsoft SQL Server backend labels Sep 9, 2024
@cpcloud
Copy link
Member

cpcloud commented Sep 9, 2024

@IndexSeek The pad functions are not consistent in their truncation behavior right now, so this PR should handle that at least for MS SQL.

The desired behavior is the Python behavior, which is to never truncate if the padded string would be shorter than existing string.

For example:

>>> s = "abc, 123"
>>> s.ljust(4, '-')
'abc, 123'

@cpcloud
Copy link
Member

cpcloud commented Sep 9, 2024

Added @gforsyth for review since he's been doing a bunch of work trying to vanquish (or least contain) the dark underbelly of accursed database strings.

@IndexSeek
Copy link
Contributor Author

The desired behavior is the Python behavior, which is to never truncate if the padded string would be shorter than existing string.

For example:

>>> s = "abc, 123"
>>> s.ljust(4, '-')
'abc, 123'

That makes total sense! I have an idea (that involves a case expression) of how we can make this work, and I'll try to tackle it this evening. Thanks for the initial review @cpcloud and the quick fix on the default value!

Added @gforsyth for review since he's been doing a bunch of work trying to vanquish (or least contain) the dark underbelly of accursed database strings.

Thanks for looping @gforsyth in! Containing the strings is no easy feat! When I looked through the tests, I thought the following section would make a great candidate for addition to the new "Cursed Knowledge" page:

pytest.mark.notyet(
["flink", "oracle"],
raises=AssertionError,
reason="Treats len(🐍) == 2 so padding is off",
),
pytest.mark.notyet(
["impala"],
raises=AssertionError,
reason="Treats len(🐍) == 4, len(Éé) == 4",
),

@cpcloud
Copy link
Member

cpcloud commented Sep 9, 2024

Oh, yeah, that page is just getting started 😂

@gforsyth
Copy link
Member

gforsyth commented Sep 9, 2024

Oh, yeah, that page is just getting started 😂

It will grow without bound

@gforsyth
Copy link
Member

gforsyth commented Sep 9, 2024

I think I'll add a few more (technically redundant) test tables for padding and length (don't let this block you here @IndexSeek) so we can more explicitly map out "levels" of string support into things like:

  1. Python style padding for "simple" strings (ascii-like)
  2. Python style padding for basic unicode (accented letters, non-english alphabets)
  3. Python style padding for graphemes, treating emoji as single characters

@IndexSeek
Copy link
Contributor Author

I have adjusted the functions now to support Python-style padding. I was able to reuse much of the previous logic with a slight change, essentially:

if the length in the l/rpad argument is greater than or equal to the length of the arg being padded:
    return the arg being padded
else:
    return the original expression

Since this behavior differs from the other backends, I added "mssql" to the notyet section for reason="Python style padding, e.g. doesn't trim strings to pad-length". I hope this will work in the meantime!

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @IndexSeek ! As noted, I have some new string data that's a little more simple-ASCII for validating the simpler lpad and rpad cases and your implementation here matches Python's padding behavior, modulo issues with the unclear definition of what "length" is.

@cpcloud cpcloud added this to the 9.5 milestone Sep 10, 2024
@cpcloud cpcloud merged commit 77af14b into ibis-project:main Sep 10, 2024
81 checks passed
cpcloud pushed a commit that referenced this pull request Sep 10, 2024
## Description of changes

Simplifying the previous implementation from `sge.Case(ifs...)` per
#10060 (comment).
@IndexSeek
Copy link
Contributor Author

This looks good to me, @IndexSeek ! As noted, I have some new string data that's a little more simple-ASCII for validating the simpler lpad and rpad cases and your implementation here matches Python's padding behavior, modulo issues with the unclear definition of what "length" is.

Thank you for the review! "length" is definitely tricky here.

SQL Server/T-SQL has the DATALENGTH function, returning the number of bytes, where as it's LEN function returns the number of characters. But it seems that some backends treat LEN like DATALENGTH. ¯\__(ツ)__/¯

I am excited to review the new string data - it will be great to cover more cases!

@IndexSeek IndexSeek deleted the mssql-lpad-rpad branch September 10, 2024 23:35
ncclementi pushed a commit to ncclementi/ibis that referenced this pull request Sep 24, 2024
Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
ncclementi pushed a commit to ncclementi/ibis that referenced this pull request Sep 24, 2024
## Description of changes

Simplifying the previous implementation from `sge.Case(ifs...)` per
ibis-project#10060 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements mssql The Microsoft SQL Server backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants