Converts text files or source code files into UTF-8 encoding

A lightweight tool that converts txt and source code files into UTF-8 encodings. It can either be executed from command line interface(a.k.a "CLI" or "console"), or imported into your own Python code.

Installation

Make sure Python 3 (Preferably 3.7 or above) is properly installed. 2. [Optional] Dependency management tools such as Poetry are also recommended.
Install Dependencies 2. In your console, execute pip3 install cvt2utf 2. Or, pip3 install -r "./requirements.txt" 2. Or, for Poetry users, run poetry install
After installation, make sure the cvt2utf is in your PATH environment variable.

Usage

There is only one mandatory argument: filename, where you can specify the directory or file name.

Directory mode: You should put in a directory as the input, and all text files that meets the criteria underneath it will be converted to UTF-8.
Single file mode: If the input argument is just an individual file, it would be straightforwardly converted to UTF-8.

Examples:

Changes all .txt files to UTF-8 encoding. Additionally, removes BOMs from utf_8_sig-encoded files:

cvt2utf convert "/path/to/your/repo"
Changes all .php files to UTF-8 encoding. But, skip processing those utf_8_sig-encoded PHP files:

cvt2utf convert "/path/to/your/repo" -ext php --skiputf
Changes all .csv files to UTF-8-SIG encoding.

Since BOM are used by some applications (such as Microsoft Excel), we want to add BOM

cvt2utf convert "/path/to/your/repo" -bom -ext csv
Convert all .c and .cpp files to UTF-8 with BOMs.

This action will also add BOMs to existing UTF-encoded files.

Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio will unable to compile them.

cvt2utf convert "/path/to/your/repo" -bom -ext c cpp
Converts an individual file

cvt2utf convert "/path/to/your/repo/a.txt"
After manually verify the new UTF-8 files are correct, you can remove all .bak files

cvt2utf cleanbak "/path/to/your/repo"
Alternatively, if you are extremely confident with everything, you can simply convert files without creating backups in the beginning.

Use the --nobak option with extra caution!

cvt2utf convert "/path/to/your/repo" --nobak
Display help information

cvt2utf -h
Show version information

cvt2utf -v

Usage Note

1. About BOM

By default, the converted output text files will NOT contain BOM (byte order mark).

However, you can use the switch -b or --addbom to explicitly include BOM in the output text files.

2. About file extensions

You should only feed text-like files to cvt2utf, while binary files (such as .exe files) should be left untouched. However, how to distinguish? Well, we use extension names. By default, files with the extension txt will be processed. Feel free to customize this list either through editing the source code or with command line arguments.

3. About file size limits

We will ignore empty files. Also, we ignore files larger than 10MB. This is a reasonable limit. If you really wants to change it, feel free to do so.

Trivial knowledge

1. About BOM

To learn more about byte-order-mark (BOM), please check: https://en.wikipedia.org/wiki/Byte_order_mark

1.1 When should we remove BOM?

Below is a list of places where BOM might cause a problem. To make your life easy and smooth, BOMs in these files are advised to be removed.

Jekyll : Jekyll is a Ruby-based CMS that generates static websites. Please remove BOMs in your source files. Also, remove them in your CSS if you are SASSifying.
PHP: BOMs in *.php files should be stripped.
JSP: BOMs in *.jsp files should be stripped.
(to be added...)

2 When should we add BOM?

BOMs in these files are not necessary, but it is recommended to add them.

Source Code in Visual Studio Projects: It is recommended in MSDN that "Always prefix a Unicode plain text file with a byte order mark" Link. Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio may not be able to compile them.
CSV: BOMs in CSV files might be useful and necessary, especially if it is opened by Excel.

2. About UTF & Unicode

ASCII: Just 1 byte. 1st byte: 00~7F
Latin-1: Just 1 byte. ASCII charset + (80~FF)
GB2312: 2 bytes. ASCII charset + (1st byte: A1~~FE (or more restrictively, A1~~F7) with 2nd byte: A1~FE).
GBK: 2 bytes. ASCII charset + (1st byte: A1~~FE with 2nd byte: 40~~FE).
UTF-8: Variable Length: 0x00~~0x7F; 0x80~~0x7FF; 0x800~~0xFFFF; 0x10000~~0x10FFFF

FAQ

Why do we choose UTF-8 among all charsets?

It is the de-facto standard for i18n.

Compared with UTF-16, UTF-8 is usually more compact and "with full fidelity". It also doesn't suffer from the endianness issue of UTF-16.

Why do we need this tool?

Indeed, there are a bunch of text editors with stunning text encoding capabilities. Yet for users who want to do batch conversions this tool could be handy.

Additionally, some users gave me the feedback to bring into attention those Linux commands such as sed, iconv, enca. All of them have the limitation that they are Linux-only commands, and not applicable for other OS.

iconv requires you to explicitly specify the "from-encoding" of the file. Moreover, it converts a single file at a time, so that you have to write a bash script for batch conversion. Worst of all, it lacks adaptability so that the set of files have to be encoded in the same character set. See here for more information.
recode is really a nice and powerful tool. It goes further by supporting CR-LF conversion and Base64. See here and here.
sed can be used to add or remove BOM. It can also be used in combination with iconv.
enca is used to detect the current encoding of a file.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
sample_data		sample_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cvt2utf.py		cvt2utf.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Converts text files or source code files into UTF-8 encoding

Installation

Usage

Usage Note

1. About BOM

2. About file extensions

3. About file size limits

Trivial knowledge

1. About BOM

1.1 When should we remove BOM?

2 When should we add BOM?

2. About UTF & Unicode

See Also

FAQ

Why do we choose UTF-8 among all charsets?

Why do we need this tool?

About

Releases 1

Packages

Contributors 3

Languages

License

x1angli/cvt2utf

Folders and files

Latest commit

History

Repository files navigation

Converts text files or source code files into UTF-8 encoding

Installation

Usage

Usage Note

1. About BOM

2. About file extensions

3. About file size limits

Trivial knowledge

1. About BOM

1.1 When should we remove BOM?

2 When should we add BOM?

2. About UTF & Unicode

See Also

FAQ

Why do we choose UTF-8 among all charsets?

Why do we need this tool?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages