Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection CharacterSet is incorrectly determined #322

Closed
dezinezync opened this issue Oct 7, 2023 · 5 comments
Closed

Connection CharacterSet is incorrectly determined #322

dezinezync opened this issue Oct 7, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@dezinezync
Copy link

Describe the bug

For the following query:

SELECT * FROM performance_schema.session_variables
WHERE VARIABLE_NAME IN (
 'character_set_client', 'character_set_connection',
 'character_set_results', 'collation_connection'
) ORDER BY VARIABLE_NAME;

Sequel Pro reports the results as follows:

VARIABLE_NAME VARIABLE_VALUE
character_set_client utf8mb4
character_set_connection utf8mb4
character_set_results utf8mb4
collation_connection utf8mb4_general_ci

However, my Vapor-Fluent setup with the mysql driver reports it as follows:

VARIABLE_NAME VARIABLE_VALUE
character_set_client latin1
character_set_connection latin1
character_set_results latin1
collation_connection latin1_swedish_ci

Configuring the database in Vapor:

import Vapor
import Fluent
import FluentMySQLDriver

public func configure(_ app: Application) throws {
  // ...
  app.databases.use(
    .mysql(
      hostname: readHostname,
      port: port,
      username: username,
      password: password,
      database: DB_NAME,
      tlsConfiguration: tlsConfig,
      maxConnectionsPerEventLoop: 5
    ),
    as: .readDB
  )
}

Many years ago, there was an option to set the CharacterSet on the connection, but has since been removed. I couldn't figure out a way to configure this with the latest versions of the related packages.

Due to the incorrect inferred character set, unicode characters get substituted by ? in VARCHAR and TEXT fields.

I'd appreciate any help debugging this issue or being able to configure the connection character set and collation params.

Environment

  • mysql: 5.7.19
  • macOS: 14.0
  • ubuntu: 20.04.6 LTS
  • Swift: 5.8.1
  • Vapor: 4.84.4
  • mysql-kit: 4.7.1
  • fluent-mysql-driver: 4.4.0

Related

The following issues/PRs may be related:

@dezinezync dezinezync added the bug Something isn't working label Oct 7, 2023
@dezinezync
Copy link
Author

Temporary workaround: Executing the following query switches the connection to use the desired charset and collation:

SET NAMES 'utf8mb4' COLLATE 'utf8mb4_general_ci';

@gwynne
Copy link
Member

gwynne commented Oct 7, 2023

This is addressed by the mysql-nio rewrite I've 90% completed (and trust me, no one's more sick of hearing me say it's "almost done" for a year straight than I am 😓😆). I'll make sure this issue gets mentioned in the PR as soon as it's up.

In the meantime, you can work around it by issuing an appropriate SET NAMES query, but because of how Fluent's connection pools currently work, the only way this can work reliably is if you run it inside a .withConnection() closure wrapped around every single route handler (and/or other database usage). For the record, the query would be as simple as:

try await (db as! any MySQLDatabase).simpleQuery("SET NAMES utf8mb4", onRow: { _ in }).run()

Technical note: This is equally as much an issue caused by MySQL 5.7's epically outdated handling of Unicode as it is any fault of the (very shoddy) existing MySQLNIO implementation. For some gory details, have a look at the documentation comments I've included with the relevant logic in the MySQLNIO rewrite: https://gist.github.com/gwynne/95679fc31b6b897799684ad5b9073066. (Please note this will almost certainly not be the final code; as mentioned, the rewrite is still in progress.)

P.S.: As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x (which I would very strongly recommend if at all possible).

@dezinezync
Copy link
Author

@gwynne I really appreciate the super quick response, with details (especially on a weekend 🫡)

Glad to know this is being refactored and support will be landing soon.

In the meantime, you can work around it by issuing an appropriate SET NAMES query

Yes, I have a temporary workaround for this, but it'd be just cleaner to have upstream support.

As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x

I'll test with the suggested collation param and follow up shortly.

I'm unable to update to MySQL 8.x immediately, but this is scheduled for December. Do you reckon this is a non-issue when running MySQL 8.x or would the workaround still be necessary until your refactor is complete?

@gwynne
Copy link
Member

gwynne commented Oct 7, 2023

The single-byte collection identifier that the current version of MySQLNIO sends in the protocol HandshakeResponse41 packet is 255, which corresponds to utf8mb4_0900_ai_ci, which means that with MySQL 8.x you get the correct behavior every time.

Of course, that collation ID doesn't exist yet in 5.7 (or, for that matter, in even the most recent versions of MariaDB), so the server falls back on its default-configured character set instead. 5.7 has still been getting maintenance updates all this time (although it's soon to FINALLY be declared EOL and put out of its misery at long last); at some point and they changed the default for new installations (and most hosted solutions and UNIX distros in general patched that default in a much longer time ago), so even those still using it have a decent chance of lucking into a correct configuration. Unfortunately, many - such as yourself - still end up with utf8 (aka utf8mb3, the crippled half-Unicode BMP-only encoding MySQL fail-invented) or latin1 - and since by deliberate design Unicode's first 256 codepoints correspond exactly to those of ISO-Latin-1 (which in turn means the first 128 are the ASCII table), it's easy to not notice it for an extended period.

@dezinezync
Copy link
Author

As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x (which I would very strongly recommend if at all possible)

Tested locally updating MySQL to v8.0.3-rc (latest supported by the OS) and I can confirm this works without any workarounds.

I'll keep the workaround for my production MySQL server which I'm unable to update immediately to v8.

Of course, that collation ID doesn't exist yet in 5.7 (or, for that matter, in even the most recent versions of MariaDB), so the server falls back on its default-configured character set instead. 5.7 has still been getting maintenance updates all this time (although it's soon to FINALLY be declared EOL and put out of its misery at long last);

Yes, as per your suggestion, using utf8mb4_unicode_520_ci works as expected for now.

I'm closing this issue, and I look forward to your updates. I appreciate your help @gwynne.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants