Skip to content

Latest commit

 

History

History
191 lines (140 loc) · 7.83 KB

README.md

File metadata and controls

191 lines (140 loc) · 7.83 KB

base62

GoDoc Go Report Card Issues GitHub release MIT License

base62 is a correctly implemented, compact and fast implementation of Base62 encoding/decoding algorithm. It is inspired by the java implementation by glowfall.

This Base62 implementation can encode/decode both integers and bytes of arbitrary length, the correctness is tested using a large number of randomly generated bytes.

It is much faster than big.Int based implementation and is not much slower than typical Base64 implementations. See the benchmark results below.

Why Base62

In comparison with Base64/Base32, Base62 is more friendly to human:

  • it contains only alpha-numerical symbols, no special characters
  • can be validated by eyes and more simple regexp
  • can be fully selected by mouse double-click in any text editors and browser address bar
  • it's compact and generates shorter strings than Base32
  • it's the most natural and unambiguous way to encode your data in human-readable form :)

Variations of Base62 algorithm cann be widely used to represent authentication data in printable and easy-copyable form, for example to encode OAuth 2.0 access_token data.

Usage

// Basic usage.
Encode(src []byte) []byte
EncodeToString(src []byte) string
Decode(src []byte) ([]byte, error)
DecodeString(src string) ([]byte, error)
FormatInt(num int64) []byte
FormatUint(num uint64) []byte
ParseInt(src []byte) (int64, error)
ParseUint(src []byte) (uint64, error)

// Providing a dst buffer, you may reuse buffers to reduce memory allocation.
EncodeToBuf(dst []byte, src []byte) []byte
DecodeToBuf(dst []byte, src []byte) ([]byte, error)
AppendInt(dst []byte, num int64) []byte
AppendUint(dst []byte, num uint64) []byte

// Or you may use a custom encoding alphabet.
enc := NewEncoding("...my-62-byte-string-alphabet...")
enc.XXX()

Benchmark

Benchmark_Encode-12                     11654754                97.41 ns/op
Benchmark_Decode-12                     15481666                73.60 ns/op
Benchmark_EncodeToString-12             11950086                99.54 ns/op
Benchmark_DecodeString-12               16301325                74.36 ns/op

Benchmark_EncodeToBuf-12                13855840                84.68 ns/op
Benchmark_DecodeToBuf-12                97695962                12.21 ns/op

Benchmark_EncodeInteger-12              29119437                41.30 ns/op
Benchmark_DecodeInteger-12             120328183                 9.917 ns/op

Benchmark_Encode_BigInt-12               1000000              1048 ns/op

Benchmark_Base64_EncodeToString-12      17803440                70.12 ns/op
Benchmark_Base64_DecodeString-12        19884616                55.09 ns/op

Benchmark_Base64_Encode-12              68163142                17.93 ns/op
Benchmark_Base64_Decode-12              41990004                28.25 ns/op

How it works

Encoding Base64 and Base32 both are bit-aligned, 6 bits for Base64 (2^6 = 64) and 5 bits for Base32 (2^5 = 32), but Base62 is not bit-aligned, it's inefficient to do divmod operation for non-bit-aligned integers, typical Base62 implementations are BigInt based, which encodes input data block by block to get better performance, (e.g. https://github.com/keybase/saltpack/blob/master/encoding/basex).

64 characters can fully represent 6 bits, but 62 characters can not, if each character represents 5 bits, that is the Base32 encoding.

A naive BigInt based algorithm gives the shortest result, which is roughly like this (e.g. https://github.com/eknkc/basex/blob/6baac8ea8b19cc66d125286d213770fec0691867/basex.go#L46):

digits := []int{0}
for i := 0; i < len(src); i++ {
    carry := int(src[i])

    for j := 0; j < len(digits); j++ {
        carry += digits[j] << 8
        digits[j] = carry % e.base
        carry = carry / e.base
    }

    for carry > 0 {
        digits = append(digits, carry%e.base)
        carry = carry / e.base
    }
}
// map to encoding alphabet
_ = digits

This works but the time complexity is O(n^2) where n is the length of src. If the input can be very large, the cost is unacceptable.

Improved algorithm splits the input into blocks, which reduce the time complexity to O(kn) where k is the block size, and n is the length of src, it generates slightly longer result than naive BigInt algorithm, but it is worth for the reduced time complexity. When k is n, it degrades to the naive algorithm, when k is 1, the output length is twice of the input, we don't want that. 32 is chosen as the block size in library saltpack/encoding/basex.

Inspired by the java implementation by glowfall, this library is not BigInt based, it encodes and decodes any arbitrary bytes in O(n) time complexity. Here is a brief description of the algorithm.

This library uses a variadic length encoding, for each 6 bits, if the value is in range [0, 62), it can be directly map to the 62 characters, if the value is 62 or 63 which exceeds 61, we turn to encoding just the lower 5 bits. If we find a way to recognize the 5 bits pattern, then we can correctly decode it back to the source data.

The binary representation of 62 and 63 is:

62 - 0b_0011_1110
63 - 0b_0011_1111

They have a common mask 0b_0011_1110, in range [0, 62), there are another two integers have a similar mask 0b_0001_1110, 30 and 31, which is:

30 - 0b_0001_1110
31 - 0b_0001_1111

The four integers 30, 31, 62, 63 share a common mask 0b_0001_1110, while all other integers in range [0, 64) don't share the mask, i.e. for all other integers, the expression value & 0b_0001_1110 == 0b_0001_1110 evaluates to false.

This is the key point!

We define a compactMask as 0b_0001_1110.

When encoding, for each 6 bits integer x, if x & compactMask == compactMask is true, it must be one of 30, 31, 62 or 63, we just encode the lower 5 bits, which are 0b_0001_1110 (30) and 0b_0001_1111 (31), we leave the 6-th bit to next byte.

When decoding, for each encoded byte x, we check x & compactMask == compactMask, when it is false, we know that the byte represents 6 bits in the raw data, it is in range [0, 30) or [32, 62), else it represents 5 bits in the raw data, it is 30 or 31.

That is it, by using variadic length encoding, we successfully limit the value range to [0, 62), we get very compact result and a simple O(n) time complexity for encoding and decoding data of arbitrary length.

Compatibility

This library guarantees that it can correctly decode data encoded by itself.

The encoded result is not compatible with BigInt based algorithm, (e.g. saltpack/encoding/basex, GMP and GnuPG).

The algorithm may be ported to other languages in future (if someone does a porting, I'm glad to link it here :-) ).

Changelog

v1.1.0 - 2022/1/3

  1. Refactor the encoding code to be simpler and cleaner, as a bonus, it gives better performance which is 1.5X~ faster than the old.
  2. Add a brief description of the algorithm and state the compatibility with other Base62 implementations.

v1.0.0 - 2021/10/23

First stable release, the package has been used in several small and medium projects.

This release adds new APIs which help to reuse buffers to reduce memory allocation.