Skip to content

This lightweight tool converts non-UTF-encoded (such as GB2312, GBK, BIG5 encoded) files to UTF-8 encoding.

License

Notifications You must be signed in to change notification settings

x1angli/cvt2utf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Converts text files or source code files into UTF-8 encoding

A lightweight tool that converts txt and source code files into UTF-8 encodings. It can either be executed from command line interface(a.k.a "CLI" or "console"), or imported into your own Python code.

Installation

  1. Make sure Python 3 (Preferably 3.7 or above) is properly installed. 2. [Optional] Dependency management tools such as Poetry are also recommended.
  2. Install Dependencies 2. In your console, execute pip3 install cvt2utf 2. Or, pip3 install -r "./requirements.txt" 2. Or, for Poetry users, run poetry install
  3. After installation, make sure the cvt2utf is in your PATH environment variable.

Usage

There is only one mandatory argument: filename, where you can specify the directory or file name.

  • Directory mode: You should put in a directory as the input, and all text files that meets the criteria underneath it will be converted to UTF-8.
  • Single file mode: If the input argument is just an individual file, it would be straightforwardly converted to UTF-8.

Examples:

  • Changes all .txt files to UTF-8 encoding. Additionally, removes BOMs from utf_8_sig-encoded files:

    cvt2utf convert "/path/to/your/repo"

  • Changes all .php files to UTF-8 encoding. But, skip processing those utf_8_sig-encoded PHP files:

    cvt2utf convert "/path/to/your/repo" -ext php --skiputf

  • Changes all .csv files to UTF-8-SIG encoding.

    Since BOM are used by some applications (such as Microsoft Excel), we want to add BOM

    cvt2utf convert "/path/to/your/repo" -bom -ext csv

  • Convert all .c and .cpp files to UTF-8 with BOMs.

    This action will also add BOMs to existing UTF-encoded files.

    Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio will unable to compile them.

    cvt2utf convert "/path/to/your/repo" -bom -ext c cpp

  • Converts an individual file

    cvt2utf convert "/path/to/your/repo/a.txt"

  • After manually verify the new UTF-8 files are correct, you can remove all .bak files

    cvt2utf cleanbak "/path/to/your/repo"

  • Alternatively, if you are extremely confident with everything, you can simply convert files without creating backups in the beginning.

    Use the --nobak option with extra caution!

    cvt2utf convert "/path/to/your/repo" --nobak

  • Display help information

    cvt2utf -h

  • Show version information

    cvt2utf -v

Usage Note

1. About BOM

By default, the converted output text files will NOT contain BOM (byte order mark).

However, you can use the switch -b or --addbom to explicitly include BOM in the output text files.

2. About file extensions

You should only feed text-like files to cvt2utf, while binary files (such as .exe files) should be left untouched. However, how to distinguish? Well, we use extension names. By default, files with the extension txt will be processed. Feel free to customize this list either through editing the source code or with command line arguments.

3. About file size limits

We will ignore empty files. Also, we ignore files larger than 10MB. This is a reasonable limit. If you really wants to change it, feel free to do so.

Trivial knowledge

1. About BOM

To learn more about byte-order-mark (BOM), please check: https://en.wikipedia.org/wiki/Byte_order_mark

1.1 When should we remove BOM?

Below is a list of places where BOM might cause a problem. To make your life easy and smooth, BOMs in these files are advised to be removed.

  • Jekyll : Jekyll is a Ruby-based CMS that generates static websites. Please remove BOMs in your source files. Also, remove them in your CSS if you are SASSifying.
  • PHP: BOMs in *.php files should be stripped.
  • JSP: BOMs in *.jsp files should be stripped.
  • (to be added...)

2 When should we add BOM?

BOMs in these files are not necessary, but it is recommended to add them.

  • Source Code in Visual Studio Projects: It is recommended in MSDN that "Always prefix a Unicode plain text file with a byte order mark" Link. Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio may not be able to compile them.

  • CSV: BOMs in CSV files might be useful and necessary, especially if it is opened by Excel.

2. About UTF & Unicode

img.png

  • ASCII: Just 1 byte. 1st byte: 00~7F
  • Latin-1: Just 1 byte. ASCII charset + (80~FF)
  • GB2312: 2 bytes. ASCII charset + (1st byte: A1FE (or more restrictively, A1F7) with 2nd byte: A1~FE).
  • GBK: 2 bytes. ASCII charset + (1st byte: A1FE with 2nd byte: 40FE).
  • UTF-8: Variable Length: 0x000x7F; 0x800x7FF; 0x8000xFFFF; 0x100000x10FFFF

See Also

FAQ

Why do we choose UTF-8 among all charsets?

It is the de-facto standard for i18n.

Compared with UTF-16, UTF-8 is usually more compact and "with full fidelity". It also doesn't suffer from the endianness issue of UTF-16.

Why do we need this tool?

Indeed, there are a bunch of text editors with stunning text encoding capabilities. Yet for users who want to do batch conversions this tool could be handy.

Additionally, some users gave me the feedback to bring into attention those Linux commands such as sed, iconv, enca. All of them have the limitation that they are Linux-only commands, and not applicable for other OS.

  • iconv requires you to explicitly specify the "from-encoding" of the file. Moreover, it converts a single file at a time, so that you have to write a bash script for batch conversion. Worst of all, it lacks adaptability so that the set of files have to be encoded in the same character set. See here for more information.
  • recode is really a nice and powerful tool. It goes further by supporting CR-LF conversion and Base64. See here and here.
  • sed can be used to add or remove BOM. It can also be used in combination with iconv.
  • enca is used to detect the current encoding of a file.

About

This lightweight tool converts non-UTF-encoded (such as GB2312, GBK, BIG5 encoded) files to UTF-8 encoding.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages