Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: math/big: base62 and/or arbitrary alphabet and base support #21558

Closed
anitgandhi opened this issue Aug 22, 2017 · 12 comments
Closed

proposal: math/big: base62 and/or arbitrary alphabet and base support #21558

anitgandhi opened this issue Aug 22, 2017 · 12 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Proposal Proposal-Accepted
Milestone

Comments

@anitgandhi
Copy link
Contributor

anitgandhi commented Aug 22, 2017

Introduction

In the process of developing fpe, I used math/big for it's useful big number equivalents of itoa and atoi with base/radix support. However, as of right now, math/big only supports a maximum base/radix of 36: https://github.com/golang/go/blob/master/src/math/big/natconv.go#L18-L24

I believe this can be expanded to allow at least base62 strings and possibly arbitrary alphabets and radices.

Proposal

I'm proposing two separate but related changes such that math/big can add support for:

  1. base62 strings. As of right now, interpreting a big.Int as base62 and base36 strings using Text() and SetString() results in the same string. In the context of FPE, that's effectively a collision in my opinion. See Base62 support capitalone/fpe#1 for details including a small 4-line diff that introduces base62 support
  2. Arbitrary alphabet and base support (Some context: Arbitrary alphabet and radix support capitalone/fpe#2) . This would allow users to define their own alphabets and use their own choice of base to use a subset of said alphabet. This would expand the current alphabet which is base36 (0-9, a-z) and MaxBase=36 even more.
    I believe this may entail:
    1. In natconv.go, change digits and MaxBase consts to vars.
    2. Addition of SetDigits and/or SetMaxBase function to math/big

Admittedly I don't know how many use cases there are for the 2nd point other than what I'm doing on fpe, however the flexibility of having arbitrary alphabets and bases could be useful in other cases.

Notes

CC: @griesemer

@gopherbot gopherbot added this to the Proposal milestone Aug 22, 2017
@griesemer
Copy link
Contributor

While it would be relatively simple to change digits and perhaps MaxBase - at least the MaxBase change would violate the Go 1 API guarantee since it would be a typed variable rather than an untyped constant. Furthermore, we wouldn't want digits to be a global variable - that would make it impossible to use the conversion functions concurrently with different bases. Thus, we would have to push the digits through all APIs which would require doubling that (API).

That said, arbitrary alphabet support for printing alone would be trivial (but for the API footprint increase). However, parsing (reading) would require at least some validation of the alphabet (to avoid parsing ambiguities). It gets complicated quickly.

A relatively easy change would be to increase MaxBase and expand the digits string with additional characters (and perhaps export it). I see that's what you have effectively done with your customized version of math/big. The result may not be using the desired alphabet, but it would be a linear pass over the result string to substitute characters as desired.

I think we might be able to do the latter. It's a trivial change, backward-compatible (ignoring the change of the MaxBase constant value for which we could allow an exception), and get you 90% there.

@anitgandhi
Copy link
Contributor Author

That's a fair point, I didn't realize that would break the Go 1 API.

Based on what you described I don't think the 2nd point would be a very desirable change unless others chime in with their needs and use cases where it would help. I wouldn't completely mind maintaining a separate customized version of math/big that does support my arbitrary alphabet and radix use case. I like the idea of the linear pass for character substitution.

That said, I do still feel the 1st point regarding the base62 "bug"/collision should be addressed with a CL. As you mentioned, the diff from the linked issue is a trivial and non-breaking change assuming the change in MaxBase from 36 to 62 is deemed acceptable.

I can provide the change for review if the above is approved.

@griesemer
Copy link
Contributor

griesemer commented Aug 23, 2017

Just to be clear: I am suggesting to simply increase the value of MaxBase to some higher value and internally expand the digits string. That's all.

Regarding the extension of the digit string: I don't know that simply adding the upper-case letters is the correct approach. In general, upper- and lower-case letters in numbers are considered to have the same value, at least for bases up to 16 (e.g., 0xef == 0xEF). Thoughts?

@rsc
Copy link
Contributor

rsc commented Aug 28, 2017

There are two issues:

  1. MaxBase may or may not need changing. (We could just accept > 36 without changing it.)

  2. We need to figure out what to do for > 36 as far as whether upper case comes before lower case or not. Are there any existing standard wire formats that run in base-62 that would help us figure out whether to prefer upper-before-lower (i.e. ASCII sorting) or lower-before-upper? It looks like GMP uses upper-before-lower but Node defaults to lower-before-upper.

@anitgandhi
Copy link
Contributor Author

anitgandhi commented Sep 13, 2017

Circling back and re-reading this discussion, I feel it's still easiest to add 2 new APIs to big.Int :

  1. func (z *Int) TextWithCharset(charset string) (possibly add base as an additional param)
    • This is similar to the digits(charset) function from pre-Go 1.5 (1.6?)
  2. func (z *Int) SetStringWithCharset(s string, charset string) (possibly add base as an additional param)
    • validate the charset and/or base here
    • no automatic detection of hex/octal/binary based on prefixes

These could have been implemented separately but the current Int.Text(base) and Int.SetString(str, base) functions basically wrap the unexported nat.scan and nat.itoa :/

@rsc
Copy link
Contributor

rsc commented Sep 18, 2017

TextWithCharset etc introduces all sorts of problems with extended Unicode, normalization, and so on.

For a default, it sounds like we should do 0123456789abc...xyzABC...XYZ, because today we use 0123456789abc...xyz.

We already change unicode.Version to reflect changes in reality, so I think we should change MaxBase too. Perhaps we should also update the compat doc to point out that we can change these kinds of "documentation" constants.

@griesemer says he will implement this.

-rsc after discussion with @golang/proposal-review

@robpike
Copy link
Contributor

robpike commented Sep 19, 2017

I worry about setting global state with the original API suggested. If this is to go ahead, I agree the new methods should accept the alphabet to use.

@griesemer
Copy link
Contributor

@robpike Setting global state as initially proposed is a no-go. Instead, the suggested alternative solution here is to simply increase the value of MaxBase from 36 (= 10 + 26) to 62 (= 10 + 26 + 26), and add the upper-case letters to the (constant) alphabet used. If a different alphabet is required, it's a simple linear pass over the output to substitute characters.

@robpike
Copy link
Contributor

robpike commented Sep 19, 2017

@griesemer Understood.

@griesemer
Copy link
Contributor

griesemer commented Sep 19, 2017

@anitgandhi, @rsc It turns out this is more complicated than I originally thought: Providing support to convert from a number to a 62-base string representation is one thing; but we also need to provide the opposite which is to convert a 62-base string back to a number. Extending the current alphabet with "A" to "Z" is problematic because "a" and "A" mean the same for all bases up to 36 at the moment. Changing the meaning of "a" and "A" for bases 37 and up seems at least odd.

For instance, while the value 10 will be printed as "a" in base 36 and 37, reading "A" in base 36 currently means 10, while in base 37 it would have to have the value 36. This seems at least inconsistent.

Waiting for more input before moving ahead with this.

@griesemer griesemer added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Sep 19, 2017
@rsc
Copy link
Contributor

rsc commented Sep 19, 2017

I think it's OK to say that the alphabet is 0123456789abc..xyzABC..XYZ and that, as a convenience, for base <= 36, ABC..XYZ maps down to abc..xyz during SetString. I think 36 being a dividing line is entirely defensible.

I see your point that "A"(35) = "A"(36) != "A"(37), but any such cross-base equality is already limited to single-digit inputs (or inputs with leading zeros). That is, even though "A"(35) = "A"(36), it's already the case that "AA"(35) != "AA"(36). The discontinuity that causes "A"(36) != "A"(37) seems OK.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/65970 mentions this issue: math/big: provide support for conversion bases up to 62

@golang golang locked and limited conversation to collaborators Oct 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Proposal Proposal-Accepted
Projects
None yet
Development

No branches or pull requests

5 participants