Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode int as string #1179

Closed
8 of 11 tasks
lmazuel opened this issue Jul 16, 2024 · 24 comments
Closed
8 of 11 tasks

Encode int as string #1179

lmazuel opened this issue Jul 16, 2024 · 24 comments
Assignees
Labels
epic feature New feature or request

Comments

@lmazuel
Copy link
Member

lmazuel commented Jul 16, 2024

Problemm:
We have a service team that requires string on the wire, but this is actually always an integer. Given the simplicity of converting string to int (and vice versa), we want to add support for declaring that conversion, and get emitters supporting it, to avoid manual code for service team cross-language.

Edit:

After discussion, it became clear that the encode parameter is an encoding, and therefore shouldn't be string, but something describing what the encoding will look like. In the usual case, we would want base-10, which would serialize 12 as "12". As base-10 is technically "decimal", we could use that term, but it might be conflated with the decimal type. The current consensus seems to be number, as a fair semantic of base-10:

model Foo {
  @encode("number")
  x: int
}

Original proposal at creation:

model Foo {
  @encode("string")
  x: int
}

TypeSpec

  1. compiler:core design:accepted triaged:core
    timotheeguerin

Spec

  1. weidongxu-microsoft
  2. docs lib:tcgc

Implementation

  1. lib:tcgc
  2. feature-request p0
    msyyc
  3. weidongxu-microsoft
  4. HRLC P1 RLC
    MaryGao
  5. 6 of 6
    DPG DPG/RLC v2.1
    archerzz
  6. tadelesh
@lmazuel lmazuel added the epic label Jul 16, 2024
@lmazuel lmazuel added feature New feature or request and removed needs-area labels Jul 16, 2024
@lmazuel lmazuel self-assigned this Jul 16, 2024
@ArthurMa1978
Copy link
Member

Hey @lmazuel , in the past, we encountered similar issue where a property was defined as a string, but its value was in an integer format. We requested the service team to correct the Swagger spec, as this was identified as a definition bug. Why are we now considering this a feature for TypeSpec?

@timotheeguerin
Copy link
Member

timotheeguerin commented Jul 17, 2024

one use case is representing int64 without loosing precision(for JS) or decimal

@weidongxu-microsoft
Copy link
Member

weidongxu-microsoft commented Jul 17, 2024

Below discussion is on the condition that emitter is to support a general feature.

Please let dev know whether we have such requirement, besides that Batch bug (where backend should send an int, but instead send a string and cannot fix that due to backward-compatibility requirement).

For the Batch bug, I'd recommend Batch service write contentLength: string in typespec, as they do return it as a JSON string (spec is wrong, and same error on their pre-typespec Swagger). Service could do customization for SDK on the getter of the property.


The full typespec would be

model Foo {
  @encode("string", string)
  x: int64;
}

Question:

  1. Do we only allow the 2nd parameter to be string for this use case? That is, we only allow client conversion from basic scalar type to string type.
    Or do we also allow e.g. @encode("float", float64) x: int32;
    We probably want a lint rule to avoid service write arbitrary scalar conversion like @encode("string", url) x: int32;
  2. Do we have a better name at @encode("string", )? E.g. on date-time, we write @encode("rfc7231"), which is very clear, that it is date-time encoded as rfc7231 text.

A few proposals, of different scope of support

  • Only allow @encode("string", string) x: int64 and @encode("string", string) decimal (include decimal128). Possible reason: certain language may not able to process these types (e.g. JS can only process safeint which is smaller than int64). Lint rule (in typespec-azure-core) to disallow all other options.
  • Allow wire type be string, and client type be basic scalar. Possible reason: a float32 can always be represented as a string (however there could be multiple ways of such representation, e.g, "1000.0" vs "+1000.00" vs 1,000.0, and backend must accept all of them as equivalent if the property is used as input), but not vice versa. Downside is that there is not really a RFC like rfc3339 on what string representation is acceptable, what is not -- there is no guarantee a language-specific Float.parseFloat can handle all possible float-in-string-format.
  • Allow wire type be any basic scalar, and client type too. If in runtime, a "string" on wire cannot be converted to "int32" on client, throw exception. Downside, this probably going too far.

PS: by "basic scalar", I include "numeric types", "string", "boolean" from built-in-types

@MaryGao
Copy link
Member

MaryGao commented Jul 17, 2024

In the general use cases, most often the numbers we are talking about would fit well inside a double's range and precision.

With above assumption, JS decided to use number to represent decimal(see decision). So if service team really cares about the precision loss, it would be recommanded to use @encode (typespec issue).

So talking about encode with int epic, do we have similar scenario where we really care about precision or some other reasons? Actually we have lots of definition for "type": "integer", "format": "int64" in swagger, but JS just generates them as number in HLC and it works well. My understanding is data range outside JS could represent may not be common cases.

Back to weidong's proposal, I would prefer the option 1 with only allowing encoding this with necessary types e.g int64 or decimal. I haven't see any values with case @encode("string", string) x: int8 and obviously service bug may not be a good reason. Another point is this may more align with what OpenAPI is trying to support because their format int64 has base type with number and string(link and issue).

@timotheeguerin
Copy link
Member

timotheeguerin commented Jul 17, 2024

Yeah this encoding would only be allowed to target string(in the same way that rfc7231 will complain if you don't target a string)

Additionally same comment as before I don't think string is the right key. We pretty much always encode as a string. I think "decimal" might be more accurate(as you could imagine an hex or octal encoding as well) but it could be confusing with the decimal type

@lmazuel
Copy link
Member Author

lmazuel commented Jul 17, 2024

@weidongxu-microsoft you're on point that this is contentLength of Batch ;)

Agree with you all, let's keep things simple and accept only the scenario string/int for now, there is no need to get complicated. And if someone one day comes with another conversion scenario, we can investigate at time the validity of that scenario.

@m-nash
Copy link
Member

m-nash commented Jul 17, 2024

@timotheeguerin are you suggesting @encode("decimal", string)? Meaning the type on the wire is a string, but the encoding is decimal format meaning a floating point number?

I kind of get this because decimal narrows the domain of the wire type string, much like rfc3339 does.

Is there a list of known name of an encoding as referenced by these docs so we know which values are available to choose from for this use case?

@lmazuel
Copy link
Member Author

lmazuel commented Jul 17, 2024

I understood @timotheeguerin as:

@encode("decimal", string)
foo: int32

would convert 12 to "12", while

@encode("oct", string)
foo: int32

would convert 12 to "0o14"

IOW, still use the string parameter to describe the format you want it to look like, and usually we use base 10 (or decimal)

@m-nash
Copy link
Member

m-nash commented Jul 17, 2024

Might be confusing to use decimal as this is usually a floating point number in most languages? But yeah that tracks with what I thought the suggestion was, and I agree with the direction.

@timotheeguerin
Copy link
Member

No not decimal as floating point decimal as decimal notation of numbers(how you write numbers normally)

Playground

But yes that was my point that decimal might be confusing

@timotheeguerin
Copy link
Member

Could maybe be

  • decimal-str
  • decimal-string
  • decimal-notation

Also on a related note with encoding we are thinking that using string values it not great and it is better to keep the enum values as the only way for new ones

So could have that(name tbd with above discussion as well)

enum NumericEncoding {
  decimal,
  hexdecimal,
  octal,
  binary,
}

@MaryGao
Copy link
Member

MaryGao commented Jul 18, 2024

let's keep things simple and accept only the scenario string/int for now, there is no need to get complicated. And if someone one day comes with another conversion scenario, we can investigate at time the validity of that scenario.

@lmazuel except int not sure we could scope this a little to consider the decimal type also? I believe this would be useful for service team who cares precision.

@weidongxu-microsoft
Copy link
Member

Maybe we could limit this to integer (int32, int64, safeint) to string, for now. There could be some nuance of the "decimal" representation, but may not be too bad.

If we allow float to string, there could be more questions, e.g. whether 2e10 a valid "decimal" for float -- looks not?

@timotheeguerin
Copy link
Member

timotheeguerin commented Jul 22, 2024

Missed the meeting notes here, problem isn't anymore about the decimal naming as much as deciding what to do all together.

THe current proposal is to have

enum NumericEncoding {
  base10,

 // With option to add the following in the future
  // base16,
  // base 8
}

That you can then use

@encode(NumericEncoding.base10)
a: int64;

The current question is

  1. Should we consider that those non safe values for http services be seriaalized as string by default. I personnaly think this is not a great idea on top of being a breaking change
  2. Should we maybe just allow @encode with no argument to mean encode to string with the default string encoding.
    • this does sound interesting but it also feels like this is the only use case. All other use of @encode is when we want to change the default encoding because that type already needed to be encoded to a string implicitly(utcDateTime, plainTime, etc.)

@timotheeguerin
Copy link
Member

timotheeguerin commented Jul 24, 2024

Proposed changed signature for decorator

extern dec encode(
  target: Scalar | ModelProperty,
  encoding: string | EnumMember,
  encodedAs?: Scalar
);
extern dec encode2(target: Scalar | ModelProperty, encodedAs: Scalar);
extern dec encode2(
  target: Scalar | ModelProperty,
  encodingOrEncodedAs: string | EnumMember | Scalar,
  encodedAs?: Scalar
);

And then you can call

@encode(string) val: int64

restrict this overload format to only numeric scalar types and string as the encodeAs

@weidongxu-microsoft
Copy link
Member

IMHO, I'd avoid @encode(string) directly, if possible. If we had to, limit it only to a subset of scalar that there is a consensus of what is acceptable as string representative.

I think we should be clear about the conversion rule (it need not to be an RFC, but we should be explicit about what is a valid string representation, what is not.

I am not aware that there is consistency between different languages, when convert string to/from a scalar.

Say convert String to Boolean, when de-serialize the JSON).
Java's Boolean.parseBoolean handles true case-insensitively, and everything else is false.
https://docs.oracle.com/javase/8/docs/api/java/lang/Boolean.html#parseBoolean-java.lang.String-

,NET's Boolean.TryParse only handle True or False. Exception if other values.
https://learn.microsoft.com/en-us/dotnet/api/system.boolean.tryparse?view=net-8.0#system-boolean-tryparse(system-string-system-boolean@)

Golang also accept t f and 1 0, case-insensitive. Exception if other values.
https://pkg.go.dev/strconv#ParseBool

@timotheeguerin
Copy link
Member

We are limiting that to numeric types and it means the decimal representation of the number as if you were to serialize the number to a json string

@weidongxu-microsoft
Copy link
Member

Yeah, that works. Or we may say that JSON numeric wrapped in quote is acceptable.

@MaryGao
Copy link
Member

MaryGao commented Jul 25, 2024

Maybe irrelevant to the detailed design. But I'd like to clarify one thing - what does the typespec type stand for?

model Foo {
  @format("date-time-rfc1123") data1: string;
  @encode(string) val: int64;
}
  • with @format, we have string as wire type and emitter could leverage format hint to define client type
  • with @encode, we have int64 as client type and emitter could leverage encode to understand wire type

@timotheeguerin
Copy link
Member

@format is to be seen as a "known pattern" ( a known string validation format) thats it it doesn't affect the type. Here data1 is a string and should always be presented as a string. There is just some decorator that say you should validate this string is in this format(which our client do not do so basically @format is not something you should ever be concerned with)

@MaryGao
Copy link
Member

MaryGao commented Jul 25, 2024

Thanks for the explaination and it makes sense considering we have guideline on no client-side validation. So can I understand typespec type stands for client type or the type we want to show to our customers?

@timotheeguerin
Copy link
Member

Not sure I see the distinction here client type or the type we want to show to our customers

The typespec type represent the logic type. In some client/server/other target you might not be able to represent it as that logical type so you might choose to keep it as the wire type(e.g. @encode(string) a: int64 in JS)

@MaryGao
Copy link
Member

MaryGao commented Jul 25, 2024

I think the logic type is what I mean here. One more question - with below example we would expect different logic type if emitter could somehow differentiate the utcDateTime or offsetDateTime, but on the wire they are the same format?

model Foo {
  data1: utcDateTime;  // RFC3339 for json (default)
  @encode(DateTimeKnownEncoding.rfc3339) data2: offsetDateTime;
}

@timotheeguerin
Copy link
Member

Yeah they both use RFC3339 but data1 should always end with Z and data2 might have the offset (e.g.+8:00)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic feature New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants