setup mysql tables as utf8mb4 and convert them #3516

philfry · 2018-02-15T10:29:51Z

fixes #3513
looks like these changes in models.go are sufficient for database creation. This PR also adds a migration module for converting the mysql tables to utf8mb4.

Tested so far:

installed gitea 1.4.0-rc1 using mysql with this patch
show create table issue shows

CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

and

vanilla gitea 1.3.2 installation using mysql
show create table issue shows

CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8

stopped gitea, updated to 1.4.0-rc1 with this patch
log shows all tables were converted
show create table issue shows

CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

lafriks · 2018-02-15T10:57:06Z

But for new instance tables would still be created as utf8

philfry · 2018-02-15T12:02:40Z

@lafriks this should not be the case because of the connStr-changes in models/models.go. Have you tested it?

lafriks · 2018-02-15T12:33:45Z

Not yet. Sorry that was more like question just without question mark :)

codecov-io · 2018-02-15T19:23:46Z

Codecov Report

Merging #3516 into master will increase coverage by 15.61%.
The diff coverage is 2.22%.

@@            Coverage Diff             @@
##           master   #3516       +/-   ##
==========================================
+ Coverage   20.08%   35.7%   +15.61%     
==========================================
  Files         146     285      +139     
  Lines       29867   40835    +10968     
==========================================
+ Hits         6000   14579     +8579     
- Misses      22961   24094     +1133     
- Partials      906    2162     +1256

Impacted Files	Coverage Δ
models/migrations/migrations.go	`2.89% <ø> (ø)`
models/login_source.go	`8.45% <ø> (+7.6%)`	⬆️
models/repo_redirect.go	`60% <ø> (ø)`	⬆️
models/external_login_user.go	`23.8% <ø> (+16.66%)`	⬆️
models/lfs.go	`28.26% <ø> (+28.26%)`	⬆️
models/repo.go	`42.8% <ø> (+24.9%)`	⬆️
models/user.go	`39.56% <ø> (+15.98%)`	⬆️
models/notification.go	`74.57% <ø> (+6.77%)`	⬆️
models/issue_reaction.go	`89.86% <ø> (ø)`	⬆️
models/user_openid.go	`28.98% <ø> (ø)`	⬆️
... and 256 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ceb92f...278fd66. Read the comment docs.

lunny · 2018-02-16T01:50:56Z

So you have to check the connection string and enforce utf8mb4

thehowl · 2018-02-16T22:09:19Z

Tried to create the db from scratch locally, and I got this error:

[I] [SQL] CREATE UNIQUE INDEX `UQE_user_lower_name` ON `user` (`lower_name`)
[...itea/routers/init.go:60 GlobalInit()] [E] Failed to initialize ORM engine: sync database struct error: Error 1071: Specified key was too long; max key length is 767 bytes

MariaDB 10.1.29.

It looks like it will be needed to reduce the length of VARCHAR fields. Or maybe another solution, idk 🤷‍♂️

philfry · 2018-02-16T23:52:29Z

@thehowl strange. It works fine for me using MariaDB 10.2.13, see my xorm log file. Installed gitea 1.4.0+rc1 with my PR from scratch.
I'll try to reproduce your issue with MariaDB 10.1 on monday.

philfry · 2018-02-19T10:31:16Z

I can reproduce it with MariaDB < 10.2 and MySQL < 5.7, as their index key prefix can only be up to 767 bytes long[1]. InnoDB Versions >= 5.7 (MariaDB >= 10.2, MySQL >= 5.7) handle up to 3072 bytes by default. To retain compatibility, the indexed fields must be a maximum of 191 (int(767/4)) bytes long. These are:

notification.commit_id varchar(255)
reaction.type varchar(255)
release.tag_name varchar(255)
repository.lower_name varchar(255)
repository.name varchar(255)

The varchar is hardcoded at 255 bytes-stuff[2] is something we shouldn't change imho, because there might already be repositores in the wild with names or tags longer than 191 chars.

What's your opinion?

[1] There's a workaround by using large prefixes:

set global innodb_file_format = Barracuda;
set global innodb_file_per_table = on;
set global innodb_large_prefix = 1;
alter table `foo` ROW_FORMAT = DYNAMIC; -- COMPRESSED also works

[2] vendor/github.com/go-xorm/core/type.go:264: st = SQLType{Varchar, 255, 0}

lunny · 2018-02-19T12:19:05Z

So that, we have to use xorm tags to define the length. For example, xorm:"VARCHAR(64)".

philfry · 2018-02-19T14:30:10Z

Ok, I changed every indexed column I could find to VARCHAR(191) and the database was correctly initialized with MariaDB 10.1 (InnoDB 5.6). ~~Still unsure whether or not we should convert some cols with fixed lengths to CHAR(xxx)~~.
~~Still to do: write migrations for these changes.~~

thehowl · 2018-02-19T17:38:06Z

models/migrations/v58.go

+			log.Info("%s: converting table to utf8mb4", table.Name)
+			if _, err := x.Exec("alter table `" + table.Name + "` convert to character set utf8mb4"); err != nil {
+				return fmt.Errorf("conversation of %s failed: %v", table, err)
+			}


Hmm. Honestly I'd prefer if this handled effectively the case of some rows being >191 chars. I would suggest:

adding at the beginning of the for loop a check to see if any data would be lost (select 1 from tbl where char_length(field1) > 191 [ or char_length(field2) > 191 ])

If so, abort the migration for that table and tell the user how to manually update by logging the needed statements.

thehowl · 2018-02-19T17:39:03Z

models/migrations/v58.go

+			}
+		}
+	default:
+		log.Info("Nothing to do")


at the top

if !setting.UseMySQL { log.Info("Nothing to do") return nil }

and remove switch?

great idea! I'll provide the requested changes tomorrow.

philfry · 2018-02-20T12:03:59Z

It's funny how I'm concerned about reducing field sized, because the gui doesn't allow one to use user names (35 chars) or full names (100 chars) that long.
Anyways.
MySQL won't reduce the field size if there'd be possible data loss but throw an error instead (e.g. error 1406 or 1265). So there's no need to check for char_length.
I think gitea should handle most of the table conversions where possible. Human interaction should only be needed for unsolvable problems like cutting down fields that are > 191 bytes long. So the new migration script tries to reduce indexed field lengths and warns if that failed, then it tries to convert all tables to utf8mb4 and also warns in case of a failure. At the end it checks whether or not a problem has occurred and throws an error, preventing gitea to start up.

thehowl

Mostly style comments, the logic is sound.

thehowl · 2018-02-20T13:21:05Z

models/migrations/v58.go

+						}
+					}
+				}
+			}


This level of indentation makes me a bit dizzy 😵

col := table.GetColumn(ix) if col != nil { continue } if col.SQLType.Name != "VARCHAR" || col.Length <= maxvc { continue }

and so on would be better I think. (Except for the error handling at the end. This is sort of my style to write Go code, so feel free to trash my opinion if you feel I'm wrong, but my general approach is to a) handle at the top level indentation the case we are looking for like varchar indexes, and b) handle in if branches the cases where something differs from the case we are looking for. Errors are often still unexpected and thus deserve to be placed in an if branch.)

you're right. Such indentations are the result of grown code. Sorry for the eye cancer.

thehowl · 2018-02-20T13:21:55Z

models/migrations/v58.go

+	}
+
+	const maxvc = 191
+	var migration_success = true


Go's naming convention inside of functions is to use camelCase. Also, := true instead of var?

thehowl · 2018-02-20T13:22:06Z

models/migrations/v58.go

+		return fmt.Errorf("cannot get tables: %v", err)
+	}
+	for _, table := range tables {
+		var ready_for_conversion = true


Same as the previous style comment

thehowl · 2018-02-20T20:27:04Z

models/migrations/v58.go

+				continue
+			}
+			log.Info("reducing column %s.%s from %d to %d bytes", table.Name, ix, col.Length, maxvc)
+			var sqlstmt = fmt.Sprintf("alter table `%s` change column `%s` `%s` varchar(%d)", table.Name, ix, ix, maxvc)


Oops, something that slipped through the cracks: this one should be := as well.

thehowl

Okay, I tried running it on my own instance. Creating DB from scratch works fine, the migration? Not so much.

thehowl · 2018-02-20T21:10:53Z

models/migrations/v58.go

@@ -0,0 +1,68 @@
+// Copyright 2017 The Gitea Authors. All rights reserved.


oh also it's 2018 :P

thehowl · 2018-02-20T21:19:46Z

models/migrations/v58.go

+				continue
+			}
+			log.Info("reducing column %s.%s from %d to %d bytes", table.Name, ix, col.Length, maxvc)
+			sqlstmt := fmt.Sprintf("alter table `%s` change column `%s` `%s` varchar(%d)", table.Name, ix, ix, maxvc)


Here lies the assumption that column name == index name, which is incorrect. (Took me about 20 minutes to track this down...) Since we already have the column anyway, use col.Name.

But alas, the issue is deeper. GetColumn only returns the first column of the index, but that is not to say that there may be more. So it's actually better if instead you iterate over table.Columns() and then check whether it has any indexes (len(col.Indexes) > 0), and if it does then you proceed to reducing the col size.

You're right, thanks. Also we not only need to check for indexes but also for primary keys as they might need to be cut down, too.

lafriks · 2018-02-21T10:12:18Z

Somehow I don't like that now indexed text column values will be decreased for all databases (for new instances) to 191 character max. While currently this is not the problem and even openid URI would probably be fine with 191 character, in future just to support older MySQL versions every indexed text column for all databases can not be created with more than 191 characters.
I think this should probably be better done in xorm to automatically check mysql version/engine and decrease default max length when creating table/adding new column. What do you think @lunny ?

lunny · 2018-02-21T14:17:31Z

@lafriks Good idea for xorm to do that smartly. Currently, it always varchar(255) if you haven't specify a xorm tag.

thehowl · 2018-02-21T21:39:58Z

Somehow I don't like that now indexed text column values will be decreased for all databases (for new instances) to 191 character max. While currently this is not the problem and even openid URI would probably be fine with 191 character, in future just to support older MySQL versions every indexed text column for all databases can not be created with more than 191 characters.

I do not think we should need to have >191 char indexes. If anything, we might consider hashing the text to get e.g. a 128 char long field, but I think allowing more than 191 chars is bad.

strk

I don't undersand why a limit is being added to structures ("VARCHAR(191)") - also I'd feel much better if you added a test for the fix, in one of the existing integration tests

strk · 2018-02-27T18:27:13Z

models/branches.go

@@ -26,7 +26,7 @@ const (
 type ProtectedBranch struct {
 	ID               int64  `xorm:"pk autoincr"`
 	RepoID           int64  `xorm:"UNIQUE(s)"`
-	BranchName       string `xorm:"UNIQUE(s)"`
+	BranchName       string `xorm:"VARCHAR(191) UNIQUE(s)"`


why is a limit being added here ?

strk · 2018-02-27T18:27:19Z

models/branches.go

-	Name        string         `xorm:"UNIQUE(s) NOT NULL"`
-	Commit      string         `xorm:"UNIQUE(s) NOT NULL"`
+	Name        string         `xorm:"VARCHAR(191) UNIQUE(s) NOT NULL"`
+	Commit      string         `xorm:"VARCHAR(191) UNIQUE(s) NOT NULL"`


why is a limit being added here ?

strk · 2018-02-27T18:27:27Z

models/external_login_user.go

@@ -8,7 +8,7 @@ import "github.com/markbates/goth"

 // ExternalLoginUser makes the connecting between some existing user and additional external login sources
 type ExternalLoginUser struct {
-	ExternalID    string `xorm:"pk NOT NULL"`
+	ExternalID    string `xorm:"VARCHAR(191) pk NOT NULL"`


why is a limit being added here ?

strk · 2018-02-27T18:27:33Z

models/issue_reaction.go

@@ -18,7 +18,7 @@ import (
 // Reaction represents a reactions on issues and comments.
 type Reaction struct {
 	ID          int64          `xorm:"pk autoincr"`
-	Type        string         `xorm:"INDEX UNIQUE(s) NOT NULL"`
+	Type        string         `xorm:"VARCHAR(191) INDEX UNIQUE(s) NOT NULL"`


why is a limit being added here ?

@strk please take a look at #3516 (comment) where I explained where that limit comes from. In short, it's for compatibility with MariaDB/MySQL versions that run the InnoDB 5.6 engine as we cannot have indexed fields that are > 767 bytes long (191x 4 bytes for utf8mb4).

lunny · 2018-03-07T15:57:00Z

@philfry I also think we should give the meaningful length that columns but not all are 191.

philfry · 2018-03-09T22:06:59Z

@lunny about reasonable lengths… let's take the branch name as example:

$ git init ; touch foo ; git add foo ; git commit -am "."
$ for i in {255..128}; do git checkout -b $(perl -e "print 'a'x$i") >& /dev/null && { echo $i; break; }; done
250

afaik github allows branch names up to 255 chars, locally, on an ext4 I'm apparently limited to 250 chars. That's still more than 191.

afaict we have multiple possibilities:

ignore legacy innodb indexes and stay with a length of 255 bytes

+ easy migration, we only need alter table .. convert to ..
+ very little impact in gitea's source code
+ not limiting git
- lose backwards compatibility

let xorm decide about the column length depending on the innodb version

+ flexible
- artificially limiting git when using a legacy innodb

cut down all indexed columns to 191 bytes

+ backwards compatible
- artificially limiting git

as we really don't need utf8mb4 for e.g. a branch name, set explicit character sets on those columns, like:

create table `branch` (
  `name` varchar(255) character set latin1 not null,
  `bar` varchar(255) character set latin1 not null,
  primary key (`name`), key barkey (`bar`)
) default charset=utf8mb4;

+ not limiting git
+ compatible with either innodb version
- migration nearly impossible (afaik we cannot exclude columns from alter table .. convert to ..)
- mixed charsets are ugly as hell

¯(°_o)/¯

techknowlogick · 2018-03-09T22:15:09Z

For no. 4 instead of latin1 you can use utf8 for the character sets of the specific columns, as long as it isn't utf8mb4. (although no. 4 is my least preferred option)

My vote is for no. 3 (due to being backwards compatible, and also who needs > 191 chars for a branch name). Although I'd be fine with the other options (as long as it isn't no. 4).

lunny · 2018-03-10T02:38:06Z

@philfry Since only some columns need utf8mb4, maybe we could find which column need utf8mb4 and tag a specify character?
As I know, it seems only title and content columns need utf8mb4 on issue and comment tables.

…ar(255), which is the default, to varchar(191) in order to deal with utf8mb4/innodb 5.6

philfry · 2018-05-23T07:12:19Z

It's too complicated to implement the charset-thingy the other way round because the default charset (utf8mb4) is and has to be defined within the connector. Maybe xorm can help out by

checking the innodb version
using appropriate column sizes
using the right charset

Let's close this PR as "wontfix". Whoever is interested in using 4-byte-chars in gitea and is running mysql/mariadb with at least innodb 5.7: change the connstr in models/models.go and do a manual conversion of all tables.

stale · 2019-01-05T08:52:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

stale · 2019-01-19T14:28:32Z

This pull request has been automatically closed because of inactivity. You can re-open it if needed.

Trollwut · 2019-01-29T14:15:53Z

change the connstr in models/models.go and do a manual conversion of all tables.

But this is not a nice solution when using the Docker image. Can we at least get something like a docker-compose environment where we can set the desired connection charset?

echodreamz · 2019-03-22T21:30:35Z

So is UTF8MB4 support dead for Gitea?

lafriks · 2019-03-24T23:43:03Z

Closing currently as probably this problem should be fixed a bit otherwise. Please reopen or submit other pr

Trollwut · 2019-03-25T20:37:54Z

Closing currently as probably this problem should be fixed a bit otherwise. Please reopen or submit other pr

I'm not a native speaker, may you please clear that up for me? Does this mean it's a low-priority problem and won't be fixed soon or did it mean something like this problem may be solved via another fix that's coming?

Thanks for your time!

lafriks · 2019-03-26T04:35:40Z

I meant that we need to find better solution, at least review sizes for more sane column lengths.

tboerger added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 15, 2018

lafriks added the type/enhancement An improvement of existing functionality label Feb 15, 2018

lafriks added this to the 1.5.0 milestone Feb 15, 2018

philfry force-pushed the mysql_utf8mb4 branch from 3c06930 to 956d792 Compare February 19, 2018 14:26

philfry force-pushed the mysql_utf8mb4 branch from 956d792 to 8d6b224 Compare February 19, 2018 14:39

thehowl suggested changes Feb 19, 2018

View reviewed changes

thehowl reviewed Feb 20, 2018

View reviewed changes

philfry force-pushed the mysql_utf8mb4 branch 2 times, most recently from 4759e31 to 8e7e29b Compare February 20, 2018 16:42

thehowl reviewed Feb 20, 2018

View reviewed changes

philfry force-pushed the mysql_utf8mb4 branch from 8e7e29b to 9cd9e29 Compare February 20, 2018 20:46

thehowl suggested changes Feb 20, 2018

View reviewed changes

lunny mentioned this pull request Feb 21, 2018

Align "Upgrade from Gogs" document with more recent versions of Gogs #3487

Closed

7 tasks

thehowl approved these changes Feb 22, 2018

View reviewed changes

tboerger added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 22, 2018

strk reviewed Feb 27, 2018

View reviewed changes

silverwind mentioned this pull request Apr 30, 2018

Failed to initialize OAuth2 support: Error 1071: Specified key was too long; max key length is 767 bytes #3859

Closed

7 tasks

philfry force-pushed the mysql_utf8mb4 branch from 15929c8 to 7256a1e Compare May 3, 2018 12:19

philfry added 2 commits May 16, 2018 17:39

connect to mysql using utf8mb4 and change all indexed cols from varch…

69016f7

…ar(255), which is the default, to varchar(191) in order to deal with utf8mb4/innodb 5.6

add migration to convert all tables to utf8mb4

278fd66

philfry force-pushed the mysql_utf8mb4 branch from 7256a1e to 278fd66 Compare May 16, 2018 15:40

philfry mentioned this pull request May 23, 2018

4 byte unicode returns a 500 w/mysql #3513

Closed

7 tasks

lafriks modified the milestones: 1.5.0, 1.x.x May 23, 2018

stale bot added the issue/stale label Jan 5, 2019

stale bot closed this Jan 19, 2019

lafriks removed this from the 1.x.x milestone Jan 20, 2019

lunny removed the issue/stale label Jan 31, 2019

lafriks reopened this Mar 24, 2019

lafriks closed this Mar 24, 2019

lunny mentioned this pull request May 19, 2019

Add support of utf8mb4 for mysql #6992

Merged

go-gitea locked and limited conversation to collaborators Nov 24, 2020

		@@ -0,0 +1,68 @@
		// Copyright 2017 The Gitea Authors. All rights reserved.

setup mysql tables as utf8mb4 and convert them #3516

setup mysql tables as utf8mb4 and convert them #3516

Conversation

philfry commented Feb 15, 2018

lafriks commented Feb 15, 2018

philfry commented Feb 15, 2018

lafriks commented Feb 15, 2018

codecov-io commented Feb 15, 2018 • edited Loading

Codecov Report

lunny commented Feb 16, 2018

thehowl commented Feb 16, 2018

philfry commented Feb 16, 2018

philfry commented Feb 19, 2018

lunny commented Feb 19, 2018

philfry commented Feb 19, 2018 • edited Loading

Choose a reason for hiding this comment

thehowl Feb 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philfry commented Feb 20, 2018

thehowl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thehowl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lafriks commented Feb 21, 2018

lunny commented Feb 21, 2018

thehowl commented Feb 21, 2018

strk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lunny commented Mar 7, 2018

philfry commented Mar 9, 2018

techknowlogick commented Mar 9, 2018

lunny commented Mar 10, 2018

philfry commented May 23, 2018

stale bot commented Jan 5, 2019

stale bot commented Jan 19, 2019

Trollwut commented Jan 29, 2019

echodreamz commented Mar 22, 2019

lafriks commented Mar 24, 2019

Trollwut commented Mar 25, 2019

lafriks commented Mar 26, 2019

codecov-io commented Feb 15, 2018 •

edited

Loading

philfry commented Feb 19, 2018 •

edited

Loading

thehowl Feb 19, 2018 •

edited

Loading