Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify all Windows build configuration on the same CharacterSet #888

Open
arixmkii opened this issue Dec 5, 2023 · 11 comments
Open

Unify all Windows build configuration on the same CharacterSet #888

arixmkii opened this issue Dec 5, 2023 · 11 comments

Comments

@arixmkii
Copy link

arixmkii commented Dec 5, 2023

Right now there are cases, when configuration sets Multibyte

<CharacterSet>MultiByte</CharacterSet>
but in other places it uses Unicode
<CharacterSet>Unicode</CharacterSet>
and
<CharacterSet>Unicode</CharacterSet>
and in many it is left unset. I believe it would be beneficial to set all build configurations to the same default for the least surprise behavior. It seems that most common explicitly set in Multibyte, so, probably it should be explicit value everywhere.

@arixmkii arixmkii changed the title Unify all Windows build configuration on the same charset Unify all Windows build configuration on the same CharacterSet Dec 5, 2023
@irwir
Copy link

irwir commented Jan 20, 2024

Multibyte is obsolete, Unicode is the right way to go.

@Mr-Clam
Copy link

Mr-Clam commented Jan 20, 2024

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

@Neustradamus
Copy link

@madler: Can you look this issue?

@irwir
Copy link

irwir commented Jan 20, 2024

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

This is mostly correct, yet shows zero experience with Visual Studio and Windows.
Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows.
Use of multi-byte (DBCS) was deprecated since VS 2013.

@arixmkii
Copy link
Author

According to this https://devblogs.microsoft.com/cppblog/mfc-support-for-mbcs-deprecated-in-visual-studio-2013/ it looks like they later undeprecated it, but still recommend Unicode for new code bases. Also, PowerShell team seems to be using Multibyte for their rebuild of ZLib https://github.com/PowerShell/ZLib/blob/20ce7b065269cda22beb1cd9ecc450be5ea99b8f/build/zlib.vcxproj#L56 most notably used by their OpenSSH fork. I don't think they would be using it if it would be completely deprecated.

Any downstream would be able to change the settings to match their specific needs in the rebuild. So, if it will be Unicode across the board for all projects it should be just fine.

@madler
Copy link
Owner

madler commented Jan 21, 2024

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

@irwir
Copy link

irwir commented Jan 21, 2024

@arixmkii

it looks like they later undeprecated it, but still recommend Unicode for new code bases.

It was repeated twice that the warning would be removed.
Even with continued support this is not exactly undeprecated.
One incurable flaw is that two differnt Unicode strings could be converted to the same MBCS byte sequence.

@irwir
Copy link

irwir commented Jan 21, 2024

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

Never had reasons to look at this part of the library because compression and decompression code do not depend on character width choice, and it worked well for me with Unicode.
But there is difference when calling Windows API even indirectly. Explicit use (or mix) of wide and byte characters is always possible, but requires caution.
For example,

zlib/gzlib.c

Line 206 in 88ec246

strcpy(state->path, path);

Copy of wchar_t string to char string usually would be the first character only, because second byte is null.

Another issue is Unix-ism 0666

zlib/gzlib.c

Line 233 in 88ec246

fd == -2 ? _wopen(path, oflag, 0666) :

By some luck this value includes desired flags, but properly it should be _S_IREAD | _S_IWRITE for read and write access.

Also, there are peculiarities when using printf with wide characters.
Standard way of handling character width is using TCHAR and _T (or _TEXT) macros, and macros for calling functions, such as _topen.

In conclusion it might be said that conversion to Unicode is desirable, but cannot be done by simple substitution of parameters in VS projects.

@madler
Copy link
Owner

madler commented Jan 21, 2024

The question was "Does this choice make any difference at all in the compiled zlib code?"

To be explicit, do any of the bits in the resulting library differ if I change Unicode to MultiByte in the project file?

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

Neither strcpy() nor any printf() variant is ever called with a pointer to wide characters in zlib.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

A Unicode open function in zlib is already provided by gzopen_w().

@Mr-Clam
Copy link

Mr-Clam commented Jan 22, 2024

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.
This is mostly correct, yet shows zero experience with Visual Studio and Windows.
Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows.
Use of multi-byte (DBCS) was deprecated since VS 2013.

Actually, I have plenty of experience with this. I even benchmarked an application that did a lot string process about 20 years ago and found building for Unicode was slower. It seems the conversion in the all the "A" functions in the Win32 API is less expensive than doubling the size of all your strings with mostly null bytes. Of course, if zlib only uses char instead of something portable like TCHAR, the point is moot. Somebody has to convert char to wchar_t if there's a Win32 call - you either do it explicitly or it's done implicitly for you. For most users, using single byte is fine... until of course they have to use strings (e.g. filenames) that contain "foreign" characters that can't be expressed in their current active code page. Even Windows' double-byte Unicode support isn't as simple as you think: how does it handle surrogates? (That's a rhetorical question)

Looking at @madler 's comment, I'm wondering if defining Unicode is necessary at all, and also, what problem occurs if you don't define this?

@irwir
Copy link

irwir commented Jan 23, 2024

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

In any case, convertion to multi-byte string was not necessary.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

This also sets Execute/search permission and other (possibly reserved for the future or used internally) bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants