Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make changes nescessary to build IL asm grammar with Bison. #89704

Closed
wants to merge 4 commits into from

Conversation

kant2002
Copy link
Contributor

@kant2002 kant2002 commented Jul 31, 2023

To generate with Bison

cd src/coreclr/ilasm
yacc asmparse.y --output=prebuilt/asmparse.cpp

If bison is unacceptable due to licensing

cd src/coreclr/ilasm
byacc asmparse.y --output=prebuilt/asmparse.cpp

Thanks @hez2010 for the hint.

Related to #4776

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 31, 2023
@ghost
Copy link

ghost commented Jul 31, 2023

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Related to #4776

Author: kant2002
Assignees: -
Labels:

area-ILTools-coreclr, community-contribution

Milestone: -

@kant2002
Copy link
Contributor Author

Would be great if my changes was tested with internal yacc, so while build infra prepare nescessary images and stuff we can test this little experiment

@kant2002 kant2002 changed the title Make changes nescessary to build grammar with Bison. Make changes nescessary to build IL asm grammar with Bison. Jul 31, 2023
@RIP-webmaster
Copy link

Unrelated to this change, but do someone know where "Lexical tokens" entries in asmparse.grammar come from? Looks like some of them are not accurate, for example: ID - C style alphaNumeric identifier (e.g. Hello_There2) - IlAsm identifiers can also contain character "`" (backtick), while C-style identifiers can only contain letters, digits and underscores. Where the exact meaning of tokens is specified?

Comment on lines +50 to +51
%token <int32> INT32_V /* 3425 0x34FA 0352 */
%token <int64> INT64_V /* 342534523534534 0x34FA434644554 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would instead recommend INT32_T and INT64_T for naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume _T is token here. I use _V for value. but whatether, let maintainers decide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents are: I personally like _V for value. If it were _T I would expect a token representing the type, not a value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"_V" is fine.

@@ -314,6 +314,7 @@ class AsmParse : public ErrorReporter
friend char* nextBlank(_In_ __nullterminated char*);
friend int ProcessEOF();
friend unsigned __int8* skipType(unsigned __int8* ptr, BOOL fFixupType);
friend unsigned corCountArgs(BinStr* args);
Copy link
Contributor

@hez2010 hez2010 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary.
Instead, add a static unsigned corCountArgs(BinStr* args); declaration in grammar_before.cpp.

See 424e9a2

@@ -1557,7 +1557,7 @@ void FixupTyPars(BinStr* pbstype)
FixupTyPars((PCOR_SIGNATURE)(pbstype->ptr()),(ULONG)(pbstype->length()));
}
/**************************************************************************/
static unsigned corCountArgs(BinStr* args)
unsigned corCountArgs(BinStr* args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change.

@kant2002
Copy link
Contributor Author

@RIP-webmaster for example how numbers as lexems are parsed is defined here.

else if (IsDigit(curSym)
|| (curSym == '.' && IsDigit(Sym(nextchar(curPos))))
|| (curSym == '-' && IsDigit(Sym(nextchar(curPos)))))
{
const char* begNum = curPos;
unsigned radix = 10;
neg = (curSym == '-'); // always make it unsigned
if (neg) curPos = nextchar(curPos);
if (Sym(curPos) == '0' && Sym(nextchar(curPos)) != '.')
{
curPos = nextchar(curPos);
radix = 8;
if (Sym(curPos) == 'x' || Sym(curPos) == 'X')
{
curPos = nextchar(curPos);
radix = 16;
}
}
begNum = curPos;
{
unsigned __int64 i64 = str2uint64(begNum, const_cast<const char**>(&curPos), radix);
unsigned __int64 mask64 = neg ? UI64(0xFFFFFFFF80000000) : UI64(0xFFFFFFFF00000000);
unsigned __int64 largestNegVal32 = UI64(0x0000000080000000);
if ((i64 & mask64) && (i64 != largestNegVal32))
{
yylval.int64 = new __int64(i64);
tok = INT64;
if (neg) *yylval.int64 = -*yylval.int64;
}
else
{
yylval.int32 = (__int32)i64;
tok = INT32;
if(neg) yylval.int32 = -yylval.int32;
}
}
if (radix == 10 && ((Sym(curPos) == '.' && Sym(nextchar(curPos)) != '.') || Sym(curPos) == 'E' || Sym(curPos) == 'e'))
{
unsigned L = (unsigned)(PENV->endPos - begNum);
curPos = (char*)begNum + GetDouble((char*)begNum,L,&yylval.float64);
if (neg) *yylval.float64 = -*yylval.float64;
tok = FLOAT64;
}
}

Identifier rules

_ValidCS[(unsigned char)'_'] = TRUE;
_ValidCS[(unsigned char)'?'] = TRUE;
_ValidCS[(unsigned char)'$'] = TRUE;
_ValidCS[(unsigned char)'@'] = TRUE;
_ValidCS[(unsigned char)'`'] = TRUE;
}
BOOL IsAlpha(unsigned x) { return (x < 128)&&_Alpha[x]; }
BOOL IsDigit(unsigned x) { return (x < 128)&&_Digit[x]; }
BOOL IsAlNum(unsigned x) { return (x < 128)&&_AlNum[x]; }
BOOL IsValidStartingSymbol(unsigned x) { return (x < 128)&&_ValidSS[x]; }
BOOL IsValidContinuingSymbol(unsigned x) { return (x < 128)&&_ValidCS[x]; }

which is used here

if (IsValidStartingSymbol(curSym))
{ // is it an ID
Its_An_Id:
size_t offsetDot = (size_t)-1; // first appearance of '.'
size_t offsetDotDigit = (size_t)-1; // first appearance of '.<digit>' (not DOTTEDNAME!)
do
{
curPos = nextchar(curPos);
if (Sym(curPos) == '.')
{
if (offsetDot == (size_t)-1) offsetDot = curPos - curTok;
curPos = nextchar(curPos);
if((offsetDotDigit==(size_t)-1)&&(Sym(curPos) >= '0')&&(Sym(curPos) <= '9'))
offsetDotDigit = curPos - curTok - 1;
}
} while(IsValidContinuingSymbol(Sym(curPos)));

all of that pretty standard lexer stuff encoded in the code directly.

@@ -51,6 +51,7 @@ static char* newString(_In_ __nullterminated const char* str1);
static void corEmitInt(BinStr* buff, unsigned data);
static void AppendStringWithLength(BinStr* pbs, _In_ __nullterminated char* sz);
static void AppendFieldToCustomBlob(BinStr* pBlob, _In_ BinStr* pField);
static unsigned corCountArgs(BinStr* args);
Copy link
Contributor

@hez2010 hez2010 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also append below lines after this line:

extern void yyerror(_In_ __nullterminated const char*);
extern Instr* SetupInstr(unsigned short);
extern int yylex();

See 3bf3d70#diff-45997acd93aca6458c6a5b1ff5ebcc1eb2953ccff2062309d4c8592f366d0384

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why extern needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional, you don't have to add the extern.

Copy link
Contributor

@hez2010 hez2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +50 to +51
%token <int32> INT32_V /* 3425 0x34FA 0352 */
%token <int64> INT64_V /* 342534523534534 0x34FA434644554 */
Copy link
Contributor

@hez2010 hez2010 Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to change it if we use byacc (https://invisible-island.net/byacc/byacc.html) instead of bison.
And worth to note by using bison there will be license conflict between GPL and MIT.
So I would recommend to revert this change (the parser generated by byacc doesn't have issue around here) and use byacc instead.

Copy link
Contributor

@TIHan TIHan Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BISON Conditions

I'm not a lawyer. @richlander , what do you think? This only applies to the parser that BISON generates from the grammar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting this can help prevent developers from generating code using bison by accident (because it would just fail), which can make sure we won't have license issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use byacc due to the licensing personally.

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Sep 5, 2023
@JulieLeeMSFT
Copy link
Member

@TIHan please review this community PR.

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks like a very simple change, though I do not know what the implications are with BISON being GPL and what the means for the parsers that are generated.

We need to get some insight with @richlander .

Comment on lines +50 to +51
%token <int32> INT32_V /* 3425 0x34FA 0352 */
%token <int64> INT64_V /* 342534523534534 0x34FA434644554 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"_V" is fine.

Comment on lines +50 to +51
%token <int32> INT32_V /* 3425 0x34FA 0352 */
%token <int64> INT64_V /* 342534523534534 0x34FA434644554 */
Copy link
Contributor

@TIHan TIHan Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BISON Conditions

I'm not a lawyer. @richlander , what do you think? This only applies to the parser that BISON generates from the grammar.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 13, 2023
@TIHan
Copy link
Contributor

TIHan commented Oct 13, 2023

@kant2002 , we are going to check to see if the generated parser from BISON will be license compatible. If it isn't, will byacc work?

@ghost ghost added the no-recent-activity label Oct 28, 2023
@ghost
Copy link

ghost commented Oct 28, 2023

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@ghost
Copy link

ghost commented Nov 11, 2023

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

@ghost ghost closed this Nov 11, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 12, 2023
@TIHan TIHan reopened this Jan 10, 2024
@ghost ghost removed the no-recent-activity label Jan 10, 2024
@TIHan
Copy link
Contributor

TIHan commented Jan 10, 2024

Re-opening, we can use this as long as we don't remove the special exception that BISON generates, very important.

@TIHan
Copy link
Contributor

TIHan commented Jan 10, 2024

@kant2002 we are willing to accept this. Can you generate the new parser and make it part of this PR?

@dotnet dotnet unlocked this conversation Jan 10, 2024
@TIHan
Copy link
Contributor

TIHan commented Jan 10, 2024

I unlocked the thread, sorry about that. I didn't realize it was locked.

@kant2002
Copy link
Contributor Author

Let me check. Give me couple days to find time for this one

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 11, 2024
@TIHan
Copy link
Contributor

TIHan commented Jan 11, 2024

No worries and no rush.

@JulieLeeMSFT
Copy link
Member

@kant2002, this is a kindly reminder.

@kant2002 we are willing to accept this. Can you generate the new parser and make it part of this PR?

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 25, 2024
Copy link
Contributor

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

Copy link
Contributor

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

@dotnet-policy-service dotnet-policy-service bot removed this from the 9.0.0 milestone Apr 23, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ILTools-coreclr community-contribution Indicates that the PR has been added by a community member needs-author-action An issue or pull request that requires more info or actions from the author. no-recent-activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants