Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic text shows as question marks (?) #12094

Closed
1 of 7 tasks
zuhairamahdi opened this issue Jun 30, 2020 · 20 comments · Fixed by #12269
Closed
1 of 7 tasks

Arabic text shows as question marks (?) #12094

zuhairamahdi opened this issue Jun 30, 2020 · 20 comments · Fixed by #12269
Labels
type/enhancement An improvement of existing functionality
Milestone

Comments

@zuhairamahdi
Copy link

  • Gitea version (or commit ref):
  • Git version:
  • Operating system:
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

Hi,
we just noticed that when we create a new issue or pull request all the Arabic text transforms to question marks.
I think that we can fix this by changing the database from varchar to be nvarchar in MS SQL.

Screenshots

arabictext

@zeripath
Copy link
Contributor

This is going to be a problem with your database collation and character set

@zuhairamahdi
Copy link
Author

zuhairamahdi commented Jul 6, 2020

@zeripath
it is a problem with the type in MS SQL as varchar doesn't include Arabic characters and need to be replaced to be nvarchar

@zeripath
Copy link
Contributor

zeripath commented Jul 6, 2020

OK as a holding measure whilst we think about this:

https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15#remarks

If you have sites that support multiple languages:

Starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues.
If using a lower version of the SQL Server Database Engine, consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.

AFAICS You can set the collation to a UTF-8 collation to make varchar and nvarchar essentially the same.

My concern with migrating to use NVARCHAR is that we're essentially doubling the size of our data when setting the collation properly would allow better encodings to be used by the db.

@lunny what are your thoughts?

@lunny
Copy link
Member

lunny commented Jul 7, 2020

In fact I have ever thought xorm should provide an option to consider VARCHAR is NVARCHAR, but I don't know if anyone need that. Maybe it's the answer of this issue?

@LukeOwlclaw
Copy link

We are having the same issue for emoticons. They become ??

This problem cannot be reproduced on https://try.gitea.io/Luke2/Repo2/pulls/1

What kind of database is try.gitea.io using?

@zeripath
Copy link
Contributor

zeripath commented Jul 9, 2020

@LukeOwlclaw what database are you using? If, as I suspect, you are using MySQL in which you need to change to use utf8mb4 as the default charset and use gitea convert to convert your database from utf8 to utf8mb4.

@zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem?

@zuhairamahdi
Copy link
Author

@zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem?

I don't think it is possible to change the collation to UTF-8 as we use SQL Server 2012 not 2019
though there is a long list of collation to use which break the data if we selected Arabic:
image

@zeripath
Copy link
Contributor

Googling I think you might want to try: Latin1_General_100_CI_AI_SC

@LukeOwlclaw
Copy link

@zeripath We are using SQL Server 2016. So we are facing the same problem as @zuhairamahdi.

@zuhairamahdi
Copy link
Author

Googling I think you might want to try: Latin1_General_100_CI_AI_SC

I tried it in an instance in my local environment and did not fix the issue

@zeripath
Copy link
Contributor

OK, I think we need to add an option to https://gitea.com/xorm/xorm to allow us to default to nvarchar as required. Looking at engine.Sync2 I think if you change your columns to NVARCHAR manually that xorm would just deal with it without changing them back or complaining too loud.

Of course - make a backup before you do this.

@PaulBol
Copy link

PaulBol commented Jul 14, 2020

I did a quick test and changed the data type for comments with

ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max)

Everything worked fine afterwards. I'm only not sure if the change might interfere with a future database update.

For us this change seems sufficient. We want to keep things like issue titles, user names and release notes in plain English. So there's little reason to encourage users to include fancy characters. In other cases using UTF-16 makes little sense like for (commit) hashes, email addresses and access tokens.

If helpful for anyone, here's a T-SQL command for generating a script for converting all varchar columns to nvarchar. But I'm not in a position to tell if applying it blindly is a good idea.

SELECT 'ALTER TABLE [' + TABLE_NAME + '] ALTER COLUMN [' + COLUMN_NAME + '] nvarchar(' + IIF(CHARACTER_MAXIMUM_LENGTH=-1,'max',CAST(CHARACTER_MAXIMUM_LENGTH AS varchar)) + ')' + IIF(IS_NULLABLE='NO',' NOT NULL','') FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE='varchar'

@zuhairamahdi
Copy link
Author

ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max)

Thanks @PaulBol
it actually worked:
image

hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from varchar to nvarchar

@PaulBol
Copy link

PaulBol commented Jul 16, 2020

hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from varchar to nvarchar

Note that the SELECT 'ALTER TABLE ... command that I shared above does this. But it will create a script that changes every varchar column which is not needed. E.g. for columns storing commit hashes with only characters [0-9a-f].

@lunny
Copy link
Member

lunny commented Jul 16, 2020

In fact I have ever thought xorm should provide an option to consider VARCHAR is NVARCHAR, but I don't know if anyone need that. Maybe it's the answer of this issue?

I have said that. I think xorm should provide a special option for MSSQL

engine, err := NewEngineWithParams(driver, connStr, map[string]string{
 "DEFAULT_VARCHAR": "NVARCHAR",
})

Then all new columns will be created with nvarchar even if the tag on field of struct are blank or VARCHAR.

Of course, for existance columns, you have to convert them manually like above.

@zeripath
Copy link
Contributor

So I've looked at using that SQL and putting it directly in to a gitea convert command - but unfortunately it would require the constraints to be dropped and readded over time.

@lunny
Copy link
Member

lunny commented Jul 18, 2020

I have sent a PR to xorm and will send a PR to gitea after that merged. Please review https://gitea.com/xorm/xorm/pulls/1741

@lunny lunny added the type/enhancement An improvement of existing functionality label Jul 18, 2020
@lunny lunny added this to the 1.13.0 milestone Jul 18, 2020
@lunny
Copy link
Member

lunny commented Jul 19, 2020

I have sent a PR #12269 to try to resolve this issue.

@zuhairamahdi
Copy link
Author

I have sent a PR #12269 to try to resolve this issue.

thanks @lunny will this change current columns already in the database? or it will work only for new instances?

@lunny
Copy link
Member

lunny commented Jul 19, 2020

@zuhairamahdi It's only for new instance or new column on exist instance. You have to convert old column manually.

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants