A lightweight tool that converts txt and source code files into UTF-8 encodings. It can either be executed from command line interface(a.k.a "CLI" or "console"), or imported into your own Python code.
- Make sure Python 3 (Preferably 3.7 or above) is properly installed. 2. [Optional] Dependency management tools such as Poetry are also recommended.
- Install Dependencies
2. In your console, execute
pip3 install cvt2utf
2. Or,pip3 install -r "./requirements.txt"
2. Or, for Poetry users, runpoetry install
- After installation, make sure the
cvt2utf
is in your PATH environment variable.
There is only one mandatory argument: filename, where you can specify the directory or file name.
- Directory mode: You should put in a directory as the input, and all text files that meets the criteria underneath it will be converted to UTF-8.
- Single file mode: If the input argument is just an individual file, it would be straightforwardly converted to UTF-8.
Examples:
-
Changes all .txt files to UTF-8 encoding. Additionally, removes BOMs from utf_8_sig-encoded files:
cvt2utf convert "/path/to/your/repo"
-
Changes all .php files to UTF-8 encoding. But, skip processing those utf_8_sig-encoded PHP files:
cvt2utf convert "/path/to/your/repo" -ext php --skiputf
-
Changes all .csv files to UTF-8-SIG encoding.
Since BOM are used by some applications (such as Microsoft Excel), we want to add BOM
cvt2utf convert "/path/to/your/repo" -bom -ext csv
-
Convert all .c and .cpp files to UTF-8 with BOMs.
This action will also add BOMs to existing UTF-encoded files.
Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio will unable to compile them.
cvt2utf convert "/path/to/your/repo" -bom -ext c cpp
-
Converts an individual file
cvt2utf convert "/path/to/your/repo/a.txt"
-
After manually verify the new UTF-8 files are correct, you can remove all .bak files
cvt2utf cleanbak "/path/to/your/repo"
-
Alternatively, if you are extremely confident with everything, you can simply convert files without creating backups in the beginning.
Use the
--nobak
option with extra caution!cvt2utf convert "/path/to/your/repo" --nobak
-
Display help information
cvt2utf -h
-
Show version information
cvt2utf -v
By default, the converted output text files will NOT contain BOM (byte order mark).
However, you can use the switch -b
or --addbom
to explicitly include BOM in the output text files.
You should only feed text-like files to cvt2utf, while binary files (such as .exe files) should be left untouched.
However, how to distinguish? Well, we use extension names. By default, files with the extension txt
will be processed.
Feel free to customize this list either through editing the source code or with command line arguments.
We will ignore empty files. Also, we ignore files larger than 10MB. This is a reasonable limit. If you really wants to change it, feel free to do so.
To learn more about byte-order-mark (BOM), please check: https://en.wikipedia.org/wiki/Byte_order_mark
Below is a list of places where BOM might cause a problem. To make your life easy and smooth, BOMs in these files are advised to be removed.
- Jekyll : Jekyll is a Ruby-based CMS that generates static websites. Please remove BOMs in your source files. Also, remove them in your CSS if you are SASSifying.
- PHP: BOMs in
*.php
files should be stripped. - JSP: BOMs in
*.jsp
files should be stripped. - (to be added...)
BOMs in these files are not necessary, but it is recommended to add them.
-
Source Code in Visual Studio Projects: It is recommended in MSDN that "Always prefix a Unicode plain text file with a byte order mark" Link. Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio may not be able to compile them.
-
CSV: BOMs in CSV files might be useful and necessary, especially if it is opened by Excel.
- ASCII: Just 1 byte. 1st byte:
00
~7F
- Latin-1: Just 1 byte. ASCII charset + (
80
~FF
) - GB2312: 2 bytes. ASCII charset + (1st byte:
A1
FE
(or more restrictively,A1
F7
) with 2nd byte:A1
~FE
). - GBK: 2 bytes. ASCII charset + (1st byte:
A1
FE
with 2nd byte:40
FE
). - UTF-8: Variable Length:
0x00
0x7F
;0x80
0x7FF
;0x800
0xFFFF
;0x10000
0x10FFFF
It is the de-facto standard for i18n.
Compared with UTF-16, UTF-8 is usually more compact and "with full fidelity". It also doesn't suffer from the endianness issue of UTF-16.
Indeed, there are a bunch of text editors with stunning text encoding capabilities. Yet for users who want to do batch conversions this tool could be handy.
Additionally, some users gave me the feedback to bring into attention those Linux commands such as sed
, iconv
, enca
. All of them have the limitation that they are Linux-only commands, and not applicable for other OS.
iconv
requires you to explicitly specify the "from-encoding" of the file. Moreover, it converts a single file at a time, so that you have to write a bash script for batch conversion. Worst of all, it lacks adaptability so that the set of files have to be encoded in the same character set. See here for more information.recode
is really a nice and powerful tool. It goes further by supporting CR-LF conversion and Base64. See here and here.sed
can be used to add or remove BOM. It can also be used in combination withiconv
.enca
is used to detect the current encoding of a file.