Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(strings): make StringValue.capitalize() consistent across backends #8270

Merged
merged 11 commits into from
Feb 15, 2024

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Feb 7, 2024

Fixes #8271.

BREAKING CHANGE: Backends that previouslly used initcap (analogous to str.title) to implement StringValue.capitalize() will produce different results when the input string contains multiple words (a word's definition being backend-specific).

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this Nick!

Without doing a survey of what's more common, I personally would find capitalizing anything but the first character (and lowercasing the remaining) in the string surprising. I'd vote to follow Python's str.capitalize behavior here. It's also easier to implement using more common string slice/concat/upper/lower primatives for backends that don't have this functionality.

ibis/backends/tests/test_string.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_string.py Outdated Show resolved Hide resolved
@NickCrews
Copy link
Contributor Author

@jcrist I wrote this when I thought the intended behavior was like .title(). But I think you are right, let me change these test cases.

@NickCrews
Copy link
Contributor Author

Wow, after the sqlglot refactor this is SO much cleaner. Love it, thanks so much for your work on that big rewrite!

@NickCrews
Copy link
Contributor Author

This might deserve a callout fairly prominently in the release notes, because although it might technically be a fix, that is only due to a poor definition of what capitalize was supposed to do. I bet some users' results are going to change as a result of this. There was even a test in the postgres backend testing for the old semantics that I had to deleted.

@NickCrews NickCrews force-pushed the capitalize-bug branch 2 times, most recently from cfe6bfd to fa05985 Compare February 13, 2024 19:26
@NickCrews
Copy link
Contributor Author

NickCrews commented Feb 13, 2024

It appears that exasol treats || as equivalent to concat(), which is not ANSI standard SQL when you have NULL || 'abc': You should get NULL, but exasol gives 'abc'. We already have a workaround for this in ops.StringConcat compiler for the exasol backend. Maybe we should redo this implementation to not compile directly with the DPipe, but instead rewrite the entire op.Capitalize in terms of op.StringConcat, op.Upper, etc ?

@NickCrews
Copy link
Contributor Author

I filed a support ticket with exasol and asked them to chime in here. Curious if they will fix their implementation, or if we will need to have a custom workaround forever.

@NickCrews NickCrews marked this pull request as draft February 13, 2024 20:22
@NickCrews NickCrews marked this pull request as ready for review February 13, 2024 21:33
@NickCrews
Copy link
Contributor Author

@jcrist @cpcloud OK I'm a little stumped as to the reason for the failing tests, since they should be using StringConcat and Substring, which both look tested on these backends. I can't run any of these backends locally since I don't have the creds. Do you think either of you could finish this up?

@NickCrews
Copy link
Contributor Author

also, it appears that many backends have that "bug" that exasol does, so maybe it's not a bug.

@gforsyth
Copy link
Member

I can't run any of these backends locally since I don't have the creds.

Except for BigQuery and Snowflake, where there are real credentials that are necessarily secret, you can find the credentials for all the backends in their respective conftest.py files.
I'll often run the test suite with a pytest -m exasol and then kill it after the first test passes, so that the test startup has a chance to load the test data into the backend container.

@NickCrews
Copy link
Contributor Author

I try just up and see

$ just up
docker compose up --wait 
[+] Building 0.0s (0/0)                                                                                                                   docker:desktop-linux
[+] Running 17/18
 ✔ Container ibis-flink-jobmanager-1   Created                                                                                                            0.0s 
 ✘ Container ibis-minio-1              Error                                                                                                              0.0s 
 ✔ Container ibis-statestored-1        Started                                                                                                            0.0s 
 ✔ Container impala-hive-metastore     Created                                                                                                            0.0s 
 ✔ Container ibis-mysql-1              Created                                                                                                            0.0s 
 ✔ Container druid-postgres            Created                                                                                                            0.0s 
 ✔ Container ibis-kudu-1               Started                                                                                                            0.0s 
 ✔ Container ibis-postgres-1           Running                                                                                                            0.0s 
 ✔ Container ibis-mssql-1              Created                                                                                                            0.0s 
 ✔ Container ibis-hive-metastore-db-1  Created                                                                                                            0.0s 
 ✔ Container zookeeper                 Created                                                                                                            0.0s 
 ✔ Container ibis-exasol-1             Created                                                                                                            0.0s 
 ✔ Container ibis-clickhouse-1         Running                                                                                                            0.0s 
 ✔ Container ibis-flink-1              Created                                                                                                            0.0s 
 ✔ Container ibis-hive-metastore-1     Created                                                                                                            0.0s 
 ✔ Container coordinator               Created                                                                                                            0.0s 
 ✔ Container ibis-trino-1              Created                                                                                                            0.0s 
 ⠦ Container ibis-oracle-1             Starting                                                                                                           0.5s 
Error response from daemon: mkdir /var/lib/docker/overlay2/9e12c9117dc506231210816af71282562ba269c4bd946ce870c4ad65316c75cc/merged: no space left on device
error: Recipe `up` failed on line 101 with exit code 1

so I am ignoring the minio error (maybe a separate issue?), since exasol looks to be working, and then try pytest -m exasol -k capitalize ibis/backends/tests/test_string.py

and then get a bunch of

__________________________________________________ ERROR at setup of test_capitalize[exasol-aBc1dEf-Abc1def] __________________________________________________

self = <ExaConnection session_id= dsn=localhost:8563 user=sys>

    def _init_ws(self):
        """
        Init websocket connection
        Connection redundancy is supported
        Specific Exasol node is randomly selected for every connection attempt
        """
        dsn_items = self._process_dsn(self.options['dsn'])
        failed_attempts = 0
    
        ws_prefix = 'wss://' if self.options['encryption'] else 'ws://'
        ws_options = self._get_ws_options()
    
        for hostname, ipaddr, port, fingerprint in dsn_items:
            self.logger.debug(f"Connection attempt [{ipaddr}:{port}]")
    
            # Use correct hostname matching IP address for each connection attempt
            if self.options['encryption']:
                ws_options['sslopt']['server_hostname'] = hostname
    
            try:
>               self._ws = websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)

.venv/lib/python3.11/site-packages/pyexasol/connection.py:667: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.11/site-packages/websocket/_core.py:646: in create_connection
    websock.connect(url, **options)
.venv/lib/python3.11/site-packages/websocket/_core.py:256: in connect
    self.sock, addrs = connect(
.venv/lib/python3.11/site-packages/websocket/_http.py:141: in connect
    sock = _open_socket(addrinfo_list, options.sockopt, options.timeout)
.venv/lib/python3.11/site-packages/websocket/_http.py:228: in _open_socket
    raise err
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

addrinfo_list = [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 8563))], sockopt = [], timeout = 10

    def _open_socket(addrinfo_list, sockopt, timeout):
        err = None
        for addrinfo in addrinfo_list:
            family, socktype, proto = addrinfo[:3]
            sock = socket.socket(family, socktype, proto)
            sock.settimeout(timeout)
            for opts in DEFAULT_SOCKET_OPTION:
                sock.setsockopt(*opts)
            for opts in sockopt:
                sock.setsockopt(*opts)
    
            address = addrinfo[4]
            err = None
            while not err:
                try:
>                   sock.connect(address)
E                   ConnectionRefusedError: [Errno 61] Connection refused

.venv/lib/python3.11/site-packages/websocket/_http.py:205: ConnectionRefusedError

During handling of the above exception, another exception occurred:

request = <SubRequest 'backend' for <Function test_capitalize[exasol-Abc-Abc]>>, data_dir = PosixPath('/Users/nc/code/ibis/ci/ibis-testing-data')
tmp_path_factory = TempPathFactory(_given_basetemp=None, _trace=<pluggy._tracing.TagTracerSub object at 0x152821510>, _basetemp=PosixPath...var/folders/0b/tpgpsvx54f9g1gg1lkq45syc0000gp/T/pytest-of-nc/pytest-164'), _retention_count=3, _retention_policy='all')
worker_id = 'master'

    @pytest.fixture(params=_get_backends_to_test(), scope="session")
    def backend(request, data_dir, tmp_path_factory, worker_id) -> BackendTest:
        """Return an instance of BackendTest, loaded with data."""
    
        cls = _get_backend_conf(request.param)
>       return cls.load_data(data_dir, tmp_path_factory, worker_id)

ibis/backends/conftest.py:434: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ibis/backends/tests/base.py:158: in load_data
    inst = cls(data_dir=data_dir, tmpdir=tmpdir, worker_id=worker_id, **kw)
ibis/backends/tests/base.py:100: in __init__
    self.connection = self.connect(tmpdir=tmpdir, worker_id=worker_id, **kw)
ibis/backends/exasol/tests/conftest.py:49: in connect
    return ibis.exasol.connect(
ibis/__init__.py:107: in connect
    return backend.connect(*args, **kwargs)
ibis/backends/base/__init__.py:864: in connect
    new_backend.reconnect()
ibis/backends/base/__init__.py:874: in reconnect
    self.do_connect(*self._con_args, **self._con_kwargs)
ibis/backends/exasol/__init__.py:87: in do_connect
    self.con = pyexasol.connect(
.venv/lib/python3.11/site-packages/pyexasol/__init__.py:68: in connect
    return ExaConnection(**kwargs)
.venv/lib/python3.11/site-packages/pyexasol/connection.py:184: in __init__
    self._init_ws()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <ExaConnection session_id= dsn=localhost:8563 user=sys>

    def _init_ws(self):
        """
        Init websocket connection
        Connection redundancy is supported
        Specific Exasol node is randomly selected for every connection attempt
        """
        dsn_items = self._process_dsn(self.options['dsn'])
        failed_attempts = 0
    
        ws_prefix = 'wss://' if self.options['encryption'] else 'ws://'
        ws_options = self._get_ws_options()
    
        for hostname, ipaddr, port, fingerprint in dsn_items:
            self.logger.debug(f"Connection attempt [{ipaddr}:{port}]")
    
            # Use correct hostname matching IP address for each connection attempt
            if self.options['encryption']:
                ws_options['sslopt']['server_hostname'] = hostname
    
            try:
                self._ws = websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)
            except Exception as e:
                self.logger.debug(f'Failed to connect [{ipaddr}:{port}]: {e}')
    
                failed_attempts += 1
    
                if failed_attempts == len(dsn_items):
>                   raise ExaConnectionFailedError(self, 'Could not connect to Exasol: ' + str(e))
E                   pyexasol.exceptions.ExaConnectionFailedError: 
E                   (
E                       message     =>  Could not connect to Exasol: [Errno 61] Connection refused
E                       dsn         =>  localhost:8563
E                       user        =>  sys
E                       schema      =>  
E                       session_id  =>  
E                   )

.venv/lib/python3.11/site-packages/pyexasol/connection.py:674: ExaConnectionFailedError

I am on an M1 mac, is that an issue? Am I doing something else wrong?

@cpcloud
Copy link
Member

cpcloud commented Feb 14, 2024

You won't be able to run every service on macos. Some of these services use Linux-specific APIs.

@gforsyth
Copy link
Member

Am I doing something ... wrong?

I am on an M1 mac

So yeah, as Phillip mentioned, not all of these work out of the box on mac. I believe @ncclementi has at least some of them working using colima and is planning to write up some instructions on how to do it (not sure if that will cover exasol or not, but maybe?)

@NickCrews
Copy link
Contributor Author

OK, thanks. Do you think either of you could finish this PR up then in the meantime?

@ncclementi
Copy link
Contributor

ncclementi commented Feb 14, 2024

I got the postgress one up and running, and I was able to run tests. But I was trying to see if I could get up the exasol container to run the tests but I didn't make it pass. I couldn't get the container up and healthy.

I tried also using colima's virtualization but I get

container ibis-exasol-1 is unhealthy
error: Recipe `up` failed on line 101 with exit code 1

When I check the logs I get

[2024-02-14 20:54:39] child 757 (membership) returned with state 1.
Wed Feb 14 20:54:42 CET 2024: start script '/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/sbin/cos_sync_files'.
Wed Feb 14 20:54:42 CET 2024: pack files for sync.
exa/etc/dwad/db_DB1_ca.pem
exa/etc/dwad/db_DB1_cert.pem
exa/etc/dwad/db_DB1_key.pem
No nodes to synchronize.
Wed Feb 14 20:54:46 CET 2024: start script '/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/sbin/cos_sync_files'.
Wed Feb 14 20:54:46 CET 2024: pack files for sync.
exa/etc/dwad/db_DB1.conf
No nodes to synchronize.
Traceback (most recent call last):
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 132, in <module>
    main(sys.argv)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 124, in main
    run(args[1])
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/libexec/exainit.py", line 92, in run
    serv.run_result = serv.run()
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 135, in run
    self._db_setup_and_start (db, dbname, dbpars, True)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 117, in _db_setup_and_start
    self.start(db)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 84, in <lambda>
    return lambda db, *a, **kw: self._wrappercall(getattr(db, name), *a, **kw)
  File "/usr/opt/EXASuite-7/EXAClusterOS-7.1.25/lib/exainit/tasks/database.py", line 90, in _wrappercall
    raise RuntimeError("Could not call function %s(*%s, **%s): %s,%s" % (repr(func), repr(args), repr(kw), repr(ret), repr(val)))
RuntimeError: Could not call function <built-in method start of libDWAd_py3.exa_db object at 0x7fe863c5a830>(*(), **{}): 1,'Could not start database: system does not have enough active nodes or DWAd was not able to create startup parameters for system'
[2024-02-14 20:54:48] root child 235 (exainit.py) returned with state 1.

any idea what is wrong?

Maybe this is because exasol only supports linux, and their image says https://github.com/EXASOL/docker-db?tab=readme-ov-file#host-os

@NickCrews
Copy link
Contributor Author

hmm, it seems to me that MacOS is the most likely cause of exasol not working, and doesn't look like there are easy workarounds.

@ncclementi
Copy link
Contributor

Yes, it seems people have been asking for a while for support of the exasol image for M1 but no answers.
see https://github.com/exasol/docker-db/issues/63

I bypass the error above providing more memory but the container is always "unhealthy", I hit this same error https://github.com/exasol/docker-db/issues/77 (it's been open for a while) and not sure what's the reason behind.

Improve docstring, add some more test cases.
Also remove some backend-specific
test cases that are now redundant.

I tried compiling the operation directly, but
since differenct backedns have different semantics
around concatenating null strings, there were a lot
of compat code. Since StringConcat already
has that compat built in already, let's leverage that.
@cpcloud cpcloud self-assigned this Feb 15, 2024
@cpcloud
Copy link
Member

cpcloud commented Feb 15, 2024

The amount of additional code generated here doesn't seem worth the trouble.

For Oracle we end up generating this monstrosity to capitalize the empty string:

SELECT
  CASE
    WHEN LENGTH('') = 0
    THEN ''
    ELSE CASE
      WHEN UPPER(
        CASE
          WHEN (
            0 + 1
          ) >= 1
          THEN SUBSTR('', 0 + 1, 1)
          ELSE SUBSTR('', 0 + 1 + LENGTH(''), 1)
        END
      ) IS NULL
      OR LOWER(
        CASE
          WHEN (
            1 + 1
          ) >= 1
          THEN SUBSTR('', 1 + 1, CASE WHEN (
            LENGTH('') - 1
          ) < 0 THEN 0 ELSE LENGTH('') - 1 END)
          ELSE SUBSTR(
            '',
            1 + 1 + LENGTH(''),
            CASE WHEN (
              LENGTH('') - 1
            ) < 0 THEN 0 ELSE LENGTH('') - 1 END
          )
        END
      ) IS NULL
      THEN NULL
      ELSE CONCAT(
        UPPER(
          CASE
            WHEN (
              0 + 1
            ) >= 1
            THEN SUBSTR('', 0 + 1, 1)
            ELSE SUBSTR('', 0 + 1 + LENGTH(''), 1)
          END
        ),
        LOWER(
          CASE
            WHEN (
              1 + 1
            ) >= 1
            THEN SUBSTR('', 1 + 1, CASE WHEN (
              LENGTH('') - 1
            ) < 0 THEN 0 ELSE LENGTH('') - 1 END)
            ELSE SUBSTR(
              '',
              1 + 1 + LENGTH(''),
              CASE WHEN (
                LENGTH('') - 1
              ) < 0 THEN 0 ELSE LENGTH('') - 1 END
            )
          END
        )
      )
    END
  END AS "Capitalize('')"

And it's still not correct.

@cpcloud
Copy link
Member

cpcloud commented Feb 15, 2024

Alright, it turns out the mishandling of the empty string is reproducible in the oracledb driver alone. I opened oracle/python-oracledb#298 for that.

@cpcloud cpcloud added this to the 9.0 milestone Feb 15, 2024
@cpcloud cpcloud added bug Incorrect behavior inside of ibis refactor Issues or PRs related to refactoring the codebase labels Feb 15, 2024
@cpcloud cpcloud changed the title fix: make String.capitalize() consistent fix(strings): make StringValue.capitalize() consistent across backends Feb 15, 2024
@cpcloud
Copy link
Member

cpcloud commented Feb 15, 2024

Gosh, what a hellscape this problem is.

@cpcloud cpcloud enabled auto-merge (squash) February 15, 2024 14:29
@cpcloud cpcloud disabled auto-merge February 15, 2024 14:29
@cpcloud cpcloud merged commit c4055d6 into ibis-project:main Feb 15, 2024
74 checks passed
@NickCrews
Copy link
Contributor Author

You're a hero @cpcloud ! Thank you!

ncclementi pushed a commit to ncclementi/ibis that referenced this pull request Feb 21, 2024
…nds (ibis-project#8270)

Fixes ibis-project#8271.

BREAKING CHANGE: Backends that previously used initcap (analogous to str.title) to implement StringValue.capitalize() will produce different results when the input string contains multiple words (a word's definition being backend-specific).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: String.capitalize() is broken/inconsistent
5 participants