Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend/convert to use case-sensitive collation for MySQL/MSSQL #28662

Merged
merged 8 commits into from
Jan 10, 2024

Conversation

wxiaoguang
Copy link
Contributor

@wxiaoguang wxiaoguang commented Dec 30, 2023

Mainly for MySQL/MSSQL.

It is important for Gitea to use case-sensitive database charset collation. If the database is using a case-insensitive collation, Gitea will show startup error/warning messages, and show the errors/warnings on the admin panel's Self-Check page.

Make gitea doctor convert work for MySQL to convert the collations of database & tables & columns.

⚠️ BREAKING ⚠️

It is not quite breaking, but it's highly recommended to convert the database&table&column to a consistent and case-sensitive collation.

@wxiaoguang wxiaoguang added the type/refactoring Existing code has been cleaned up. There should be no new functionality. label Dec 30, 2023
@wxiaoguang wxiaoguang added this to the 1.22.0 milestone Dec 30, 2023
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Dec 30, 2023
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 30, 2023
@github-actions github-actions bot added modifies/translation modifies/cli PR changes something on the CLI, i.e. gitea doctor or gitea admin modifies/docs labels Dec 30, 2023
@wxiaoguang wxiaoguang marked this pull request as draft December 30, 2023 18:10
@wxiaoguang wxiaoguang force-pushed the improve-db-collation branch 2 times, most recently from a160445 to e59c91f Compare December 30, 2023 18:48
@wxiaoguang wxiaoguang force-pushed the improve-db-collation branch 2 times, most recently from f95fa87 to ebc65a7 Compare December 31, 2023 05:06
@wxiaoguang wxiaoguang added the pr/breaking Merging this PR means builds will break. Needs a description what exactly breaks, and how to fix it! label Dec 31, 2023
@wxiaoguang wxiaoguang force-pushed the improve-db-collation branch 5 times, most recently from 03fc215 to 6fb3f5f Compare December 31, 2023 12:10
@wxiaoguang wxiaoguang changed the title [WIP] Recommend/convert to use case-sensitive collation for MySQL/MSSQL Recommend/convert to use case-sensitive collation for MySQL/MSSQL Dec 31, 2023
@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented Dec 31, 2023

The problem

In history, Gitea doesn't strictly require charset/collation. It causes a lot of problems. Especially for MySQL/MSSQL:

  • Emoji causes errors (MySQL/MariaDB utf8mb3).
  • UNIQUE conflicts (branch name main vs Main) if the collation is case-insensitive.
  • SQL error when trying to compare columns with different collations.
  • Ordering problem, different collation generates different "ORDER BY" result.
  • Searching problem. LIKE '%...%' behaves differently for case-sensitive and case-insensitive collations.
  • MySQL/MariaDB have different collation systems but Gitea only have "MySQL" database type.

The solution

So maybe it's the time to fix the problem.

Ideally, Gitea should require strict and consistent collations. Considering there are a lot of old instances which may already have various database collations, this PR only does some simple fix and problem detection, and don't really force users to do anything.

This PR does:

  1. Only improve for MySQL/MSSQL. These two are mostly affected by the problems.
    • MSSQL has various limitations, so there is no "converting" solution for it at the moment.
  2. Introduce a config option [database].CHARSET_COLLATION
    • If it is empty, Gitea will try to find a default collation. Try to use utf8mb4_0900_as_cs/uca1400_as_cs for MySQL/MariaDB, and fallback to utf8mb4_bin if the database is too old. Try to use Latin1_General_CS_AS for MSSQL.
    • If it is not empty, for example, end users would really like to customize their collation, Gitea could also accept it, even if it is case-insensitive. Then the old behaviors could be kept (not too much breaking)
  3. Introduce a concept ExpectedCollation: [database].CHARSET_COLLATION if not emtpy, or the one found by Gitea (above).
  4. When Gitea starts, it tries to alter the empty database's collation to the ExpectedCollation (quite fast)
  5. Gitea will check database collations and all column collations, to find "unexpected collations"
    • These "unexpected collations" will be reported as startup error/warning logs and will be shown on admin panel's Self Check page.
  6. If end users see the error/warning:
    1. They could choose to do nothing, just keep using the "unexpected collations". Well, it would cause problems in rare cases, indeed that's the user's choice.
    2. Run gitea doctor convert, then the database and columns will be converted to ExpectedCollation.
      • Only MySQL has such support at the moment.
      • MSSQL users need to do the converting manually.

The admin panel self-check page also partially addresses #27011

image

Then after gitea doctor convert:

image

@wxiaoguang wxiaoguang marked this pull request as ready for review December 31, 2023 12:55
@lunny

This comment was marked as outdated.

@wxiaoguang

This comment was marked as outdated.

@wxiaoguang

This comment was marked as outdated.

@lunny

This comment was marked as outdated.

@pull-request-size pull-request-size bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 1, 2024
@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented Jan 1, 2024

Found another bug for MSSQL testing: Make SQL work with case-sensitivie SQLServer go-testfixtures/testfixtures#182


I think it doesn't need to wait for the go-testfixtures fix. This PR doesn't affect MSSQL at the moment, so the MSSQL fix could be done later separately, the approach is general enough to support it.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Jan 2, 2024
@lunny
Copy link
Member

lunny commented Jan 10, 2024

One review needed and Last Call @go-gitea/technical-oversight-committee

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Jan 10, 2024
@lunny lunny enabled auto-merge (squash) January 10, 2024 10:38
@lunny lunny added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jan 10, 2024
@lunny lunny merged commit 2df7563 into go-gitea:main Jan 10, 2024
25 checks passed
@GiteaBot GiteaBot removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Jan 10, 2024
@wxiaoguang wxiaoguang deleted the improve-db-collation branch January 10, 2024 11:22
fuxiaohei pushed a commit to fuxiaohei/gitea that referenced this pull request Jan 17, 2024
…-gitea#28662)

Mainly for MySQL/MSSQL.

It is important for Gitea to use case-sensitive database charset
collation. If the database is using a case-insensitive collation, Gitea
will show startup error/warning messages, and show the errors/warnings
on the admin panel's Self-Check page.

Make `gitea doctor convert` work for MySQL to convert the collations of
database & tables & columns.

* Fix go-gitea#28131

## ⚠️ BREAKING ⚠️

It is not quite breaking, but it's highly recommended to convert the
database&table&column to a consistent and case-sensitive collation.
silverwind pushed a commit to silverwind/gitea that referenced this pull request Feb 20, 2024
…-gitea#28662)

Mainly for MySQL/MSSQL.

It is important for Gitea to use case-sensitive database charset
collation. If the database is using a case-insensitive collation, Gitea
will show startup error/warning messages, and show the errors/warnings
on the admin panel's Self-Check page.

Make `gitea doctor convert` work for MySQL to convert the collations of
database & tables & columns.

* Fix go-gitea#28131

## ⚠️ BREAKING ⚠️

It is not quite breaking, but it's highly recommended to convert the
database&table&column to a consistent and case-sensitive collation.
lunny pushed a commit that referenced this pull request Mar 8, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/cli PR changes something on the CLI, i.e. gitea doctor or gitea admin modifies/docs modifies/translation pr/breaking Merging this PR means builds will break. Needs a description what exactly breaks, and how to fix it! size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/refactoring Existing code has been cleaned up. There should be no new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MySQL/MariaDB: Unique Constraint on Branches doesn't respect casing: UQE_branch_s
4 participants