Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] ECMA-335 specification for "reflection-notation" serialized type names #4416

Open
davkean opened this issue Aug 5, 2015 · 12 comments
Labels
area-Meta documentation Documentation bug or enhancement, does not impact product or test code
Milestone

Comments

@davkean
Copy link
Member

davkean commented Aug 5, 2015

Updates

  • Update 1: Removed reference to assemblyQualifiedNameWithinTypeArgument, which was left over from previous iterations.
  • Update 2: Disallowed raw "]" completely in assembly name identifiers to simplify spec. These now have to be escaped regardless whether the assembly name appears in a generic type argument or not.
  • Update 3: Merged assembly names and type names into single format.
  • Update 4: Clarified difference between SerString and the canonical form. Misread the spec.

Rationale

This is a proposal that attempts to fully specify reflection-notation serialized types for inclusion in ECMA-335 (referred onwards as "the CLI").

In metadata, when a type is persisted as the value of a fixed or named argument, such as in the following code block, it is serialized in a SerString in its canonical form.

    [Export(typeof(ILogger))]

SerString and the canonical form are documented like so (see _II.23.3 Custom attributes_)

  • If the parameter kind is string, (middle line in above diagram) then the blob contains a SerString – a PackedLen count of bytes, followed by the UTF8 characters. If the string is null, its PackedLen has the value 0xFF (with no following characters). If the string is empty (“”), then PackedLen has the value 0x00 (with no following characters).
  • If the parameter kind is System.Type, (also, the middle line in above diagram) its value is stored as a SerString (as defined in the previous paragraph), representing its canonical name. The canonical name is its full type name, followed optionally by the assembly where it is defined, its version, culture and public-key-token. If the assembly name is omitted, the CLI looks first in the current assembly, and then in the system library (mscorlib); in these two special cases, it is permitted to omit the assembly-name, version, culture and public-key-token.

The last paragraph is under specified and does not provide enough information for metadata readers or other inspectors to consume and interpret this canonical form.

The documentation for Type.GetType also has an attempt to document a similar format, but it also falls short. Also while nothing in the CLI or on MSDN indicate a relationship between canonical name and the type name you pass to Reflection's Type.GetType, they are clearly related.

Based on this I've attempted to write up the grammar that makes up these formats into a single format. Note, I've used a custom form of BNF (Backus-Naur Form), if that puts an unpleasant taste in your mouth, I'm sorry in advance. :)

My hope is first to work towards an agreement on the format, and then move onto figuring out how to actually represent and document this within the CLI itself (that's where I hope @CarolEidt comes in).

Proposed Format

Format of an full type name or assembly-qualified name in "reflection-notation"

 The key is as follows: 

      Symbol:     <name> 
      Optional:   [<name>] 
      Literal:    "," 
      Or:         <pointer> 
                  <array> 

  <format> ::= 
      <assemblyQualifiedName> 
      <fullName> 

  <assemblyQualifiedName> ::= 
      <fullName> "," <assemblyName> 

  <fullName> ::=  
      <declaringTypeName>[<nestedTypeNames>][<genericTypeArguments>][<pointerOrArray>][<byReference>]

  <declaringTypeName> ::= 
      <simpleTypeName> 

  <nestedTypeNames> ::= 
      [<nestedTypeNames>] "+" <nestedTypeName> 

  <nestedTypeName> ::= 
      <simpleTypeName> 

  <simpleTypeName> ::= 
      [<whitespace>] <identifier> 

  <genericTypeArguments> ::= 
      "[" <genericTypeArgumentsList> "]" 

  <genericTypeArgumentsList> ::= 
      [<genericTypeArgumentsList> ","] <genericTypeArgument> 

  <genericTypeArgument> ::= 
      <genericTypeArgumentFullName> 
      <genericTypeArgumentAssemblyQualifiedName> 

  <genericTypeArgumentFullName> ::= 
      <fullName> 

  <genericTypeArgumentAssemblyQualifiedName> ::= 
      "[" <assemblyQualifiedName> "]" 

  <pointerOrArray> ::= 
      [<pointerOrArray>]<pointer> 
      [<pointerOrArray>]<array> 

  <byReference> ::=  
      "&" 

  <pointer> ::=  
      "*" 

  <array> ::= 
      <szArray> 
      <singleDimensionalArray> 
      <multiDimensionalArray>  

  <szArray> ::= 
      "[]" 

  <singleDimensionalArray> ::= 
      "[*]" 

  <multiDimensionalArray> ::= 
      "[" <arrayDimensionSeparator> "]" 

  <arrayDimensionSeparator> ::=  
      [<arrayDimensionSeparator>] "," 

  <identifier> ::= 
      [<identifier>]<identifierChar> 
      [<identifier>]<escapedChar> 

  <identifierChar> ::= 
      any unicode character except <delimiter> 

  <escapedChar> ::= 
      "\" <delimiter> 

  <whitespace> ::= 
      [<whitespace>] " " 

  <delimiter> ::= 
      "*" 
      "[" 
      "]" 
      ","  
      "\" 
      "&" 
      "+" 

  <assemblyName> ::= 
      <name>[<components>] 

  <name> ::= 
      [<whitespace>] <identifierOrQuotedIdentifier> [<whitespace>] 

  <components> ::= 
      [<components>]<component> 

  <component> ::= 
      "," <componentName> "=" <componentValue> 

  <componentName> ::= 
      <identifierOrQuotedIdentifier> 

  <componentValue> ::= 
      """" 
      <identifierOrQuotedIdentifier> 

  <identifierOrQuotedIdentifier> ::= 
      <identifier> 
      """ <quotedIdentifier> """ 

  <identifier> ::= 
      [<identifier>]<identifierChar> 
      [<identifier>]<escapedChar> 

  <quotedIdentifier> ::= 
      [<quotedIdentifier>]<quotedIdentifierChar> 
      [<quotedIdentifier>]<escapedChar> 

  <quotedIdentifierChar> ::= 
      any unicode character except """ 

  <identifierChar> ::= 
      any unicode character except <delimiter> 

  <escapedChar> ::= 
      "\" <delimiter> 

  <whitespace> ::= 
      [<whitespace>] " " 

  <delimiter> ::= 
      "," 
      "=" 
      """ 
      "\" 
      "]"

Notes

I've written an implementation of a decoder of the above format for inclusion as part of System.Reflection.Metadata, 1.2.

Questions

  1. What do we do about types that are valid and can appear in metadata, but are not currently represented either by reflection or ildasm with a texture equivalent? For example, function pointers or modifiers? What does C++/CLI even persist when I pass long::typeid or (const int*)::typeid as the value of a fixed or named argument? Should we disallow them?
  2. Reflection has lots of corner case issues and inconsistences around on how it handles certain things, such as trailing chars and unclosed quotes. What should we do about them? Should we mimic this in the spec? Or should we just spec the format to be a little tighter and treat these as inconsistences as a quirk of Type.GetType?
    We've decided not to mimic these quirks. Writers will be held to the above format, readers can choose to allow more.
@davkean
Copy link
Member Author

davkean commented Aug 5, 2015

tag @CarolEidt @nguerrera @tmat @AlekseyTs.

@davkean davkean changed the title [Proposal] EMCA-335 Specification for "reflection-notation" serialized type names [Proposal] EMCA-335 specification for "reflection-notation" serialized type names Aug 5, 2015
@davkean davkean changed the title [Proposal] EMCA-335 specification for "reflection-notation" serialized type names [Proposal] ECMA-335 specification for "reflection-notation" serialized type names Aug 5, 2015
@davkean
Copy link
Member Author

davkean commented Aug 5, 2015

Also @AnthonyDGreen as it was mentioned he's looked into this before.

@AnthonyDGreen
Copy link

In Roslyn we considered this as a readable persistable text format for SymbolId that could be used to specify any symbol. But the reflection/ILDASM format as I recall doesn't have a format for properties and events. Though I think it would be trivial to extend the format to support them.

@tmat
Copy link
Member

tmat commented Aug 5, 2015

@davkean Re Q2: Reflection inconsistencies should be treated as quirks, imo. The spec should be based on writers, not readers. Writers are in this case the compilers. If there is something that the reader can read but it's never produced by a writer we care about (and thus invalid if we base the spec on the writers we care about) then that's just a quirk of the reader.

@nguerrera
Copy link
Contributor

+1 to @tmat re Q2.

Some first thoughts...

  1. I suggest that we use the same notation as the spec uses for ILAsm syntax.
  2. The whole "(within generic type argument)" handling is hard to decipher: e.g.
    • assemblyQualified and assemblyQualifiedNameWithinTypeArgument look the same
    • Why is there both identifierChar with "(within generic type argument)" and identifierCharWithinTypeArgument, and why are they different?

@davkean
Copy link
Member Author

davkean commented Aug 5, 2015

  1. As far as I can tell, ILAsm doesn't have a syntax - they cop out, basically, they have a mode where you pass an opaque string and they'll take it and just shove it in without any validation.
  2. Good catch - I changed the way this was represented (I used to branch for type names within generic type arguments) and didn't fix up all the places.

@nguerrera
Copy link
Contributor

Re (1) : I mean the grammar notation in Ecma 335 where it describes the syntax of the IL assembly language.

@nguerrera
Copy link
Contributor

Further clarification, I meant my (1) not Q1 in proposal, i.e. use the same notation as how ILAsm syntax is specified.

Also, I believe 'SerString' in the spec refers to how any such string (including string attribute values, not just type attribute values) is encoded as bytes. The spec doesn't seem to refer to give the type representation a name. It just has underspecified description in the second paragraph.

Finally, re: the "within generic type arguments" confusion, if this is just about the handling of '[' in assembly name components, I suggest we either ban it globally or require it to be escaped globally.

@davkean
Copy link
Member Author

davkean commented Aug 5, 2015

@nguerrera All good feedback, have simplified the generic type arguments assembly name confusion; always now treat ']' as a delimiter. Clarified difference between SerString and the canonical form of the type name, misread the spec. Will look at doing the same notation as ILAsm.

@MichalStrehovsky
Copy link
Member

Some quick observations about the captured grammar rules:

  • <identifier>, <identifierChar>, <escapedChar>, <whitespace>, and <delimiter> are defined twice, with different definitions.
  • There's an inconsistency in the <fullName> definition where on one hand it makes sure that the syntax doesn't allow invalid combination with ByRefs (e.g. a ByRef as an element of an array), but on the other it allows invalid combinations of pointers and arrays (e.g. an unmanaged pointer to an array). Why not ban that in the syntax too?

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@ericstj ericstj removed the untriaged New issue has not been triaged by the area owner label Jun 25, 2020
@MSDN-WhiteKnight
Copy link
Contributor

What does C++/CLI even persist when I pass long::typeid

[MyAttr(long::typeid)] in C++/CLI is compiled into the following IL

  .custom instance void MyAttrAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 59 53 79 73 74 65 6D 2E 49 6E 74 33 32 2C   // ..YSystem.Int32,
                                                                                20 6D 73 63 6F 72 6C 69 62 2C 20 56 65 72 73 69   //  mscorlib, Versi
                                                                                6F 6E 3D 34 2E 30 2E 30 2E 30 2C 20 43 75 6C 74   // on=4.0.0.0, Cult
                                                                                75 72 65 3D 6E 65 75 74 72 61 6C 2C 20 50 75 62   // ure=neutral, Pub
                                                                                6C 69 63 4B 65 79 54 6F 6B 65 6E 3D 62 37 37 61   // licKeyToken=b77a
                                                                                35 63 35 36 31 39 33 34 65 30 38 39 00 00 )       // 5c561934e089..

(const int*)::typeid

This is invalid syntax (error C2059: syntax error : 'typeid'). But if we workaround like this:

typedef const int* PCONSTINT;
[MyAttr(PCONSTINT::typeid)]

The resulting IL is:

.custom instance void MyAttrAttribute::.ctor(class [mscorlib]System.Type) = ( 01 00 5A 53 79 73 74 65 6D 2E 49 6E 74 33 32 2A   // ..ZSystem.Int32*
		2C 20 6D 73 63 6F 72 6C 69 62 2C 20 56 65 72 73   // , mscorlib, Vers
		69 6F 6E 3D 34 2E 30 2E 30 2E 30 2C 20 43 75 6C   // ion=4.0.0.0, Cul
		74 75 72 65 3D 6E 65 75 74 72 61 6C 2C 20 50 75   // ture=neutral, Pu
		62 6C 69 63 4B 65 79 54 6F 6B 65 6E 3D 62 37 37   // blicKeyToken=b77
		61 35 63 35 36 31 39 33 34 65 30 38 39 00 00 )    // a5c561934e089..

So the C++/CLI compiler does not preserve modifiers or C++-specific types when serializing attribute typeid arguments. It just stores the closest managed type that corresponds to the passed type.

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 17, 2021
@MichalStrehovsky
Copy link
Member

Thanks for checking it @MSDN-WhiteKnight! It corresponds to what I would expect to see. Modifiers pretty much don't matter outside signature matching.

They e.g. don't impact the LDTOKEN instruction either. Here is one of our tests checking that LDTOKEN of a type and LDTOKEN of a type with a modifier in it produce the same handle:

ldtoken string[]
stloc.0
ldloca 0
ldtoken string modopt (MyModifier)[]
call instance bool valuetype [System.Runtime]System.RuntimeTypeHandle::Equals(valuetype [System.Runtime]System.RuntimeTypeHandle)
brtrue StringArrayModifiedStringArrayOK
ldc.i4.1
ret

Adding representation of modifiers to the SerString format wouldn't likely result in meaningful improvement (I would expect the reflection stack to return types without the modifiers to match what LDTOKEN does - because both of these essentially map to the C# typeof).

@joperezr joperezr removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Meta documentation Documentation bug or enhancement, does not impact product or test code
Projects
No open projects
Development

No branches or pull requests