Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup mysql tables as utf8mb4 and convert them #3516

Closed
wants to merge 2 commits into from

Conversation

philfry
Copy link
Contributor

@philfry philfry commented Feb 15, 2018

fixes #3513
looks like these changes in models.go are sufficient for database creation. This PR also adds a migration module for converting the mysql tables to utf8mb4.

Tested so far:

  1. installed gitea 1.4.0-rc1 using mysql with this patch
  2. show create table issue shows
CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

and

  1. vanilla gitea 1.3.2 installation using mysql
  2. show create table issue shows
CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8
  1. stopped gitea, updated to 1.4.0-rc1 with this patch
  2. log shows all tables were converted
  3. show create table issue shows
CREATE TABLE `issue` (
-- ...
`name` varchar(255) DEFAULT NULL,          
`content` mediumtext DEFAULT NULL,
-- ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

@lafriks
Copy link
Member

lafriks commented Feb 15, 2018

But for new instance tables would still be created as utf8

@tboerger tboerger added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 15, 2018
@philfry
Copy link
Contributor Author

philfry commented Feb 15, 2018

@lafriks this should not be the case because of the connStr-changes in models/models.go. Have you tested it?

@lafriks
Copy link
Member

lafriks commented Feb 15, 2018

Not yet. Sorry that was more like question just without question mark :)

@codecov-io
Copy link

codecov-io commented Feb 15, 2018

Codecov Report

Merging #3516 into master will increase coverage by 15.61%.
The diff coverage is 2.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #3516       +/-   ##
==========================================
+ Coverage   20.08%   35.7%   +15.61%     
==========================================
  Files         146     285      +139     
  Lines       29867   40835    +10968     
==========================================
+ Hits         6000   14579     +8579     
- Misses      22961   24094     +1133     
- Partials      906    2162     +1256
Impacted Files Coverage Δ
models/migrations/migrations.go 2.89% <ø> (ø)
models/login_source.go 8.45% <ø> (+7.6%) ⬆️
models/repo_redirect.go 60% <ø> (ø) ⬆️
models/external_login_user.go 23.8% <ø> (+16.66%) ⬆️
models/lfs.go 28.26% <ø> (+28.26%) ⬆️
models/repo.go 42.8% <ø> (+24.9%) ⬆️
models/user.go 39.56% <ø> (+15.98%) ⬆️
models/notification.go 74.57% <ø> (+6.77%) ⬆️
models/issue_reaction.go 89.86% <ø> (ø) ⬆️
models/user_openid.go 28.98% <ø> (ø) ⬆️
... and 256 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ceb92f...278fd66. Read the comment docs.

@lafriks lafriks added the type/enhancement An improvement of existing functionality label Feb 15, 2018
@lafriks lafriks added this to the 1.5.0 milestone Feb 15, 2018
@lunny
Copy link
Member

lunny commented Feb 16, 2018

So you have to check the connection string and enforce utf8mb4

@thehowl
Copy link
Contributor

thehowl commented Feb 16, 2018

Tried to create the db from scratch locally, and I got this error:

[I] [SQL] CREATE UNIQUE INDEX `UQE_user_lower_name` ON `user` (`lower_name`)
[...itea/routers/init.go:60 GlobalInit()] [E] Failed to initialize ORM engine: sync database struct error: Error 1071: Specified key was too long; max key length is 767 bytes

MariaDB 10.1.29.

It looks like it will be needed to reduce the length of VARCHAR fields. Or maybe another solution, idk 🤷‍♂️

@philfry
Copy link
Contributor Author

philfry commented Feb 16, 2018

@thehowl strange. It works fine for me using MariaDB 10.2.13, see my xorm log file. Installed gitea 1.4.0+rc1 with my PR from scratch.
I'll try to reproduce your issue with MariaDB 10.1 on monday.

@philfry
Copy link
Contributor Author

philfry commented Feb 19, 2018

I can reproduce it with MariaDB < 10.2 and MySQL < 5.7, as their index key prefix can only be up to 767 bytes long[1]. InnoDB Versions >= 5.7 (MariaDB >= 10.2, MySQL >= 5.7) handle up to 3072 bytes by default. To retain compatibility, the indexed fields must be a maximum of 191 (int(767/4)) bytes long. These are:

  • notification.commit_id varchar(255)
  • reaction.type varchar(255)
  • release.tag_name varchar(255)
  • repository.lower_name varchar(255)
  • repository.name varchar(255)

The varchar is hardcoded at 255 bytes-stuff[2] is something we shouldn't change imho, because there might already be repositores in the wild with names or tags longer than 191 chars.

What's your opinion?

[1] There's a workaround by using large prefixes:

set global innodb_file_format = Barracuda;
set global innodb_file_per_table = on;
set global innodb_large_prefix = 1;
alter table `foo` ROW_FORMAT = DYNAMIC; -- COMPRESSED also works

[2] vendor/github.com/go-xorm/core/type.go:264: st = SQLType{Varchar, 255, 0}

@lunny
Copy link
Member

lunny commented Feb 19, 2018

So that, we have to use xorm tags to define the length. For example, xorm:"VARCHAR(64)".

@philfry
Copy link
Contributor Author

philfry commented Feb 19, 2018

Ok, I changed every indexed column I could find to VARCHAR(191) and the database was correctly initialized with MariaDB 10.1 (InnoDB 5.6). Still unsure whether or not we should convert some cols with fixed lengths to CHAR(xxx).
Still to do: write migrations for these changes.

log.Info("%s: converting table to utf8mb4", table.Name)
if _, err := x.Exec("alter table `" + table.Name + "` convert to character set utf8mb4"); err != nil {
return fmt.Errorf("conversation of %s failed: %v", table, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Honestly I'd prefer if this handled effectively the case of some rows being >191 chars. I would suggest:

  1. adding at the beginning of the for loop a check to see if any data would be lost (select 1 from tbl where char_length(field1) > 191 [ or char_length(field2) > 191 ])
  2. If so, abort the migration for that table and tell the user how to manually update by logging the needed statements.

}
}
default:
log.Info("Nothing to do")
Copy link
Contributor

@thehowl thehowl Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the top

if !setting.UseMySQL {
log.Info("Nothing to do")
return nil
}

and remove switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea! I'll provide the requested changes tomorrow.

@philfry
Copy link
Contributor Author

philfry commented Feb 20, 2018

It's funny how I'm concerned about reducing field sized, because the gui doesn't allow one to use user names (35 chars) or full names (100 chars) that long.
Anyways.
MySQL won't reduce the field size if there'd be possible data loss but throw an error instead (e.g. error 1406 or 1265). So there's no need to check for char_length.
I think gitea should handle most of the table conversions where possible. Human interaction should only be needed for unsolvable problems like cutting down fields that are > 191 bytes long. So the new migration script tries to reduce indexed field lengths and warns if that failed, then it tries to convert all tables to utf8mb4 and also warns in case of a failure. At the end it checks whether or not a problem has occurred and throws an error, preventing gitea to start up.

Copy link
Contributor

@thehowl thehowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style comments, the logic is sound.

}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This level of indentation makes me a bit dizzy 😵

col := table.GetColumn(ix)
if col != nil {
  continue
}
if col.SQLType.Name != "VARCHAR" || col.Length <= maxvc {
  continue
}

and so on would be better I think. (Except for the error handling at the end. This is sort of my style to write Go code, so feel free to trash my opinion if you feel I'm wrong, but my general approach is to a) handle at the top level indentation the case we are looking for like varchar indexes, and b) handle in if branches the cases where something differs from the case we are looking for. Errors are often still unexpected and thus deserve to be placed in an if branch.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Such indentations are the result of grown code. Sorry for the eye cancer.

}

const maxvc = 191
var migration_success = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go's naming convention inside of functions is to use camelCase. Also, := true instead of var?

return fmt.Errorf("cannot get tables: %v", err)
}
for _, table := range tables {
var ready_for_conversion = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous style comment

@philfry philfry force-pushed the mysql_utf8mb4 branch 2 times, most recently from 4759e31 to 8e7e29b Compare February 20, 2018 16:42
continue
}
log.Info("reducing column %s.%s from %d to %d bytes", table.Name, ix, col.Length, maxvc)
var sqlstmt = fmt.Sprintf("alter table `%s` change column `%s` `%s` varchar(%d)", table.Name, ix, ix, maxvc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, something that slipped through the cracks: this one should be := as well.

Copy link
Contributor

@thehowl thehowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I tried running it on my own instance. Creating DB from scratch works fine, the migration? Not so much.

@@ -0,0 +1,68 @@
// Copyright 2017 The Gitea Authors. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh also it's 2018 :P

continue
}
log.Info("reducing column %s.%s from %d to %d bytes", table.Name, ix, col.Length, maxvc)
sqlstmt := fmt.Sprintf("alter table `%s` change column `%s` `%s` varchar(%d)", table.Name, ix, ix, maxvc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here lies the assumption that column name == index name, which is incorrect. (Took me about 20 minutes to track this down...) Since we already have the column anyway, use col.Name.

But alas, the issue is deeper. GetColumn only returns the first column of the index, but that is not to say that there may be more. So it's actually better if instead you iterate over table.Columns() and then check whether it has any indexes (len(col.Indexes) > 0), and if it does then you proceed to reducing the col size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks. Also we not only need to check for indexes but also for primary keys as they might need to be cut down, too.

@lafriks
Copy link
Member

lafriks commented Feb 21, 2018

Somehow I don't like that now indexed text column values will be decreased for all databases (for new instances) to 191 character max. While currently this is not the problem and even openid URI would probably be fine with 191 character, in future just to support older MySQL versions every indexed text column for all databases can not be created with more than 191 characters.
I think this should probably be better done in xorm to automatically check mysql version/engine and decrease default max length when creating table/adding new column. What do you think @lunny ?

@lunny
Copy link
Member

lunny commented Feb 21, 2018

@lafriks Good idea for xorm to do that smartly. Currently, it always varchar(255) if you haven't specify a xorm tag.

@thehowl
Copy link
Contributor

thehowl commented Feb 21, 2018

Somehow I don't like that now indexed text column values will be decreased for all databases (for new instances) to 191 character max. While currently this is not the problem and even openid URI would probably be fine with 191 character, in future just to support older MySQL versions every indexed text column for all databases can not be created with more than 191 characters.

I do not think we should need to have >191 char indexes. If anything, we might consider hashing the text to get e.g. a 128 char long field, but I think allowing more than 191 chars is bad.

@tboerger tboerger added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 22, 2018
Copy link
Member

@strk strk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't undersand why a limit is being added to structures ("VARCHAR(191)") - also I'd feel much better if you added a test for the fix, in one of the existing integration tests

@@ -26,7 +26,7 @@ const (
type ProtectedBranch struct {
ID int64 `xorm:"pk autoincr"`
RepoID int64 `xorm:"UNIQUE(s)"`
BranchName string `xorm:"UNIQUE(s)"`
BranchName string `xorm:"VARCHAR(191) UNIQUE(s)"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a limit being added here ?

Name string `xorm:"UNIQUE(s) NOT NULL"`
Commit string `xorm:"UNIQUE(s) NOT NULL"`
Name string `xorm:"VARCHAR(191) UNIQUE(s) NOT NULL"`
Commit string `xorm:"VARCHAR(191) UNIQUE(s) NOT NULL"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a limit being added here ?

@@ -8,7 +8,7 @@ import "github.com/markbates/goth"

// ExternalLoginUser makes the connecting between some existing user and additional external login sources
type ExternalLoginUser struct {
ExternalID string `xorm:"pk NOT NULL"`
ExternalID string `xorm:"VARCHAR(191) pk NOT NULL"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a limit being added here ?

@@ -18,7 +18,7 @@ import (
// Reaction represents a reactions on issues and comments.
type Reaction struct {
ID int64 `xorm:"pk autoincr"`
Type string `xorm:"INDEX UNIQUE(s) NOT NULL"`
Type string `xorm:"VARCHAR(191) INDEX UNIQUE(s) NOT NULL"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a limit being added here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@strk please take a look at #3516 (comment) where I explained where that limit comes from. In short, it's for compatibility with MariaDB/MySQL versions that run the InnoDB 5.6 engine as we cannot have indexed fields that are > 767 bytes long (191x 4 bytes for utf8mb4).

@lunny
Copy link
Member

lunny commented Mar 7, 2018

@philfry I also think we should give the meaningful length that columns but not all are 191.

@philfry
Copy link
Contributor Author

philfry commented Mar 9, 2018

@lunny about reasonable lengths… let's take the branch name as example:

$ git init ; touch foo ; git add foo ; git commit -am "."
$ for i in {255..128}; do git checkout -b $(perl -e "print 'a'x$i") >& /dev/null && { echo $i; break; }; done
250

afaik github allows branch names up to 255 chars, locally, on an ext4 I'm apparently limited to 250 chars. That's still more than 191.

afaict we have multiple possibilities:

  1. ignore legacy innodb indexes and stay with a length of 255 bytes
  • + easy migration, we only need alter table .. convert to ..
  • + very little impact in gitea's source code
  • + not limiting git
  • - lose backwards compatibility
  1. let xorm decide about the column length depending on the innodb version
  • + flexible
  • - artificially limiting git when using a legacy innodb
  1. cut down all indexed columns to 191 bytes
  • + backwards compatible
  • - artificially limiting git
  1. as we really don't need utf8mb4 for e.g. a branch name, set explicit character sets on those columns, like:
create table `branch` (
  `name` varchar(255) character set latin1 not null,
  `bar` varchar(255) character set latin1 not null,
  primary key (`name`), key barkey (`bar`)
) default charset=utf8mb4;
  • + not limiting git
  • + compatible with either innodb version
  • - migration nearly impossible (afaik we cannot exclude columns from alter table .. convert to ..)
  • - mixed charsets are ugly as hell

¯(°_o)/¯

@techknowlogick
Copy link
Member

For no. 4 instead of latin1 you can use utf8 for the character sets of the specific columns, as long as it isn't utf8mb4. (although no. 4 is my least preferred option)

My vote is for no. 3 (due to being backwards compatible, and also who needs > 191 chars for a branch name). Although I'd be fine with the other options (as long as it isn't no. 4).

@lunny
Copy link
Member

lunny commented Mar 10, 2018

@philfry Since only some columns need utf8mb4, maybe we could find which column need utf8mb4 and tag a specify character?
As I know, it seems only title and content columns need utf8mb4 on issue and comment tables.

…ar(255), which is the default, to varchar(191) in order to deal with utf8mb4/innodb 5.6
@philfry
Copy link
Contributor Author

philfry commented May 23, 2018

It's too complicated to implement the charset-thingy the other way round because the default charset (utf8mb4) is and has to be defined within the connector. Maybe xorm can help out by

  • checking the innodb version
  • using appropriate column sizes
  • using the right charset

Let's close this PR as "wontfix". Whoever is interested in using 4-byte-chars in gitea and is running mysql/mariadb with at least innodb 5.7: change the connstr in models/models.go and do a manual conversion of all tables.

@lafriks lafriks modified the milestones: 1.5.0, 1.x.x May 23, 2018
@stale
Copy link

stale bot commented Jan 5, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label Jan 5, 2019
@stale
Copy link

stale bot commented Jan 19, 2019

This pull request has been automatically closed because of inactivity. You can re-open it if needed.

@stale stale bot closed this Jan 19, 2019
@lafriks lafriks removed this from the 1.x.x milestone Jan 20, 2019
@Trollwut
Copy link

change the connstr in models/models.go and do a manual conversion of all tables.

But this is not a nice solution when using the Docker image. Can we at least get something like a docker-compose environment where we can set the desired connection charset?

@lunny lunny removed the issue/stale label Jan 31, 2019
@echodreamz
Copy link

So is UTF8MB4 support dead for Gitea?

@lafriks lafriks reopened this Mar 24, 2019
@lafriks
Copy link
Member

lafriks commented Mar 24, 2019

Closing currently as probably this problem should be fixed a bit otherwise. Please reopen or submit other pr

@lafriks lafriks closed this Mar 24, 2019
@Trollwut
Copy link

Closing currently as probably this problem should be fixed a bit otherwise. Please reopen or submit other pr

I'm not a native speaker, may you please clear that up for me? Does this mean it's a low-priority problem and won't be fixed soon or did it mean something like this problem may be solved via another fix that's coming?

Thanks for your time!

@lafriks
Copy link
Member

lafriks commented Mar 26, 2019

I meant that we need to find better solution, at least review sizes for more sane column lengths.

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 1 This PR needs approval from one additional maintainer to be merged. type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 byte unicode returns a 500 w/mysql
10 participants