Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenSet, First(), Follow() #3

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

SamDelaney
Copy link
Member

NEW:
A) The First and Follow methods in NonTerminals.
B) TokenSet.cs
C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location)

Notes:
The newFromUnion() method in TokenSet.cs has two implementations:
One, based on the C compiler, is the one currently in use. This one I believe to be more efficient.
The other has been commented out. This one is more readable in my opinion.

These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.

NEW:
A) The First and Follow methods in NonTerminals.
B) TokenSet.cs
C) Various changes implemented to help with A) and B) (marked with //SAM CHANGE for easy location)

Notes:
The newFromUnion() method in TokenSet.cs has two implementations:
One, based on the C compiler, is the one currently in use. This one I believe to be more efficient.
The other has been commented out. This one is more readable in my opinion.

These changes should be ready to be committed into master. For now I will leave them in a separate branch for any desired review before that happens.
@SamDelaney SamDelaney changed the title Add files via upload TokenSet, First(), Follow() May 18, 2017
NonTerminals.cs Outdated

uint Count() {
uint total = 0;
foreach (String s in Enum.GetNames(typeof(Production))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C# we can use the Length() method to get the number of items in an enumeration:

Enum.GetNames(typeof(Production)).Length;

NonTerminals.cs Outdated
p += alternateSetOffset;
} /* end if */

return tokenset;
Copy link
Member

@trijezdci trijezdci May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always returns null.

We need initialisers for all the FIRST sets, ideally in an array where the index matches the value of p.

see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-48
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-231

once we have the initialisers, we can get the set from the array

index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = firstSet[index];
...

NonTerminals.cs Outdated
} /* end if */

return tokenset;
} /* end FOLLOW */
Copy link
Member

@trijezdci trijezdci May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too, always returns null.

We need initialisers for all the FOLLOW sets, in the same manner as for the FIRST sets.

see
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-139
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-production.c#m2-production.c-325

and accordingly ...

index = Convert.ToUInt32(p) + alternateSetOffset;
...
tokenset = followSet[index];
...

NonTerminals.cs Outdated

TokenSet FIRST(Production p) {
TokenSet tokenset = null;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add uint index = 0;

NonTerminals.cs Outdated

if (IsConstParamDependent(p))
{
p += alternateSetOffset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be index = Convert.ToUInt32(p) + alternateSetOffset;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason you like Convert.ToUInt32(p) over (uint)p?
Should run a find & replace and change all of the code to one or the other?

NonTerminals.cs Outdated
} /* end if */
if (IsVariantRecordDependent(p) && CompilerOptions.VariantRecords())
{
p += alternateSetOffset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for line 119

NonTerminals.cs Outdated

TokenSet FOLLOW(Production p) {
TokenSet tokenset = null;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see line 116

NonTerminals.cs Outdated
TokenSet tokenset = null;

if (IsConstParamDependent(p)) {
p += alternateSetOffset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see line 119

NonTerminals.cs Outdated
p += alternateSetOffset;
} /* end if */
if (IsVariantRecordDependent(p) && !CompilerOptions.VariantRecords()) {
p += alternateSetOffset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see line 119

NonTerminals.cs Outdated
lastOptionDependent = lastNoVariantRecDependent;
public static int alternateSetOffset = (int)lastOptionDependent - (int)firstOptionDependent + 1;

/* --------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a private array with the initialisers for all FIRST and FOLLOW sets here. Ideally all initialisation will take place at compile time. The FIRST() and FOLLOW() functions should then compute the index of the appropriate set in the private array, fetch it from the array and return it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to pull the values for the first and follow sets from. It's possible I'm just completely missing something obvious. Can you point me where I need to be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to write standalone programs that generate the code. The C versions are

https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_first_sets.c
https://bitbucket.org/trijezdci/m2c-rework/src/tip/gen_follow_sets.c

The C versions of the generated code are

https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-first-set-inits.h
https://bitbucket.org/trijezdci/m2c-rework/src/tip/m2-follow-set-inits.h

In C we pull them in via #include directives, but in C# there is no such thing. We could use templates but for now let's simply copy-paste the generated code in.

return null;
} /* end if */
return token.ToString();
} /* end LexemeForResword */
Copy link
Member

@trijezdci trijezdci May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this matters less for reserved words, for we use interned strings for all lexemes within the compiler. For consistency we should also apply this to reserved words.

The reason for interning strings is that at any point within the compiler we can compare two strings with a single identity test instead of having to string-compare them.

For example, the identifiers of modules and functions in Modula-2 are repeated at the end of their declarations:

PROCEDURE Foobar ( baz : Bam );
...
END Foobar;

When checking that the END part matches the procedure header, we can simply test for identity:

ident1 = Lexer.CurrentLexeme();
...
ident2 = Lexer.CurrentLexeme();
...
if (ident1 == ident2) {
  /* identifiers match */
}

If we use the To.String() method, the strings won't be interned.

In the C version, I defined private arrays that contain the strings, then calculate the index for the array, fetch the string and return it. Even though C# has methods for interning strings, it is preferable to use the same approach as in the C version. Ideally, the string arrays should be initialised at compile time.

Copy link
Member Author

@SamDelaney SamDelaney May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the array lexemeTable in TokenSet.cs to do the same job as your string arrays. I simply referenced lexemeTable in my fix, but because it's static I could easily move lexemeTable into Terminals.cs and then reference it as Terminals.lexemeTable in TokenSet if you would rather.

Copy link
Member

@trijezdci trijezdci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work thus far, but we need initialisers for the sets. For details see my comments within the code.

TokenSet.cs Outdated

/* out-of-range guard */

"\0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guard is only needed in C, not in C#.

TokenSet.cs Outdated
public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1;

public struct opaque
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the style we use throughout the project is ...

foobar {
  ...
} /* end foobar */

TokenSet.cs Outdated

public static int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1;

public struct opaque
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a more descriptive name here, possibly also change the struct to something else.

An opaque type is a type whose name is defined in a public interface and whose implementation is hidden. Instances of the type are opaque pointers to hidden data, that is, the pointer is exposed to clients but the structure that it points to is unknown to them.

The C syntax to declare opaque types is not self explanatory. It isn't immediately obvious what the type declaration means. For this reason we use a naming convention in C that makes it explicit in the name that the type is an opaque type. Thus a type we might otherwise have called tokenset_t is then called tokenset_opaque_t instead.

Both the structure and the naming needs to be appropriately adjusted in the C# version. In C# the primary means to hide data is the class. The struct is simply a completely hidden type, it is not exposed to the client. The instance of the hidden type is an instance variable.

The advantage of keeping the struct is that it can be mapped to the initialisation data which we want to be struct based so we can initialise it using a structured literal like { 0x0, 0x0, 0x3, 2 }. Otherwise we could simply declare separate instance vars, one for each segment, one for the count.

private struct TokenSetBits {
  uint bits64to95;
  uint bits32to63;
  uint bits0to31;
  uint count;
} /* end TokenSetBits */

alternatively

private struct TokenSetBits {
  uint[SegmentCount] segments;
  uint count;
} /* end TokenSetBits */

which allows us to address the segments by index (desirable), but then SegmentCount has to be a compile time constant and the value needs to be calculated at compile time (absolute must), not at runtime.

Terminals.cs Outdated
} /* end switch */
break;

case /* length 6 */ 6 :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added reserved word OPAQUE here. You need to copy the section for all the reswords of length 6.


TokenSet[] followSet = new TokenSet[Count()];
TokenSet[] firstSet = new TokenSet[Count()];
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constants firstConstParamDependent, lastConstParamDependent, firstNoVariantRecDependent, lastNoVariantRecDependent, firstOptionDependent, lastOptionDependent and alternateSetOffset are only used by the methods of this class, we don't need to expose them to clients. They ought to be private.

TokenSet[] followSet = new TokenSet[Count()];
TokenSet[] firstSet = new TokenSet[Count()];
public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set your editor to expand tab into two spaces [ASCII(32)], do not let it insert any TABs [ASCII(9)].

Lines should be broken as follows

public Foobar foobar =
  foo = Foobar.Foo,
  bar = Foobar.Bar,
  baz = Foobar.Baz;

etc

public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1;

#region followSets
TokenSet[] followSets = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST sets should come first, FOLLOW sets last.


new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */
};
#endregion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all the //SAM CHANGE comments. They have no documentary value for users and maintainers. For any temporary comments during review, we can use Github annotation features.

imp/TokenSet.cs Outdated
@@ -175,6 +175,20 @@ private TokenSet()
// no operation
} /* end TokenSet */

/* ---------------------------------------------------------------------------
* private constructor TokenSet (uint[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have constructor methods newFromList() and newFromUnion(). As you can see in line 170, I have made the default constructor TokenSet() private in order to prevent clients of the class from ever being able to use any other constructors than newFromList() and newFromUnion().

Your comment says "private" but your code declares it public. But even as a private method it makes no sense because the two constructors newFromList() and newFromUnion() use the private default constructor in lines 201 and 248.

imp/TokenSet.cs Outdated
dataStored.segments[2] = TSB[2];
dataStored.elemCount = TSB[3];
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 205 says /* allocate new set */ but that is not true because newSet is already allocated in line 201.

imp/TokenSet.cs Outdated
dataStored.segments[2] = TSB[2];
dataStored.elemCount = TSB[3];
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 206 allocates a new array of segments, but the array is already allocated in line 161.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the name dataStored. It is entirely non-descriptive.

If you had no idea what this code is about and all you saw was a single line of code that references the struct like

newSet.dataStored.segments[segmentIndex] = 0;

does the "dataStored" part in there add any information or does it add confusion?

Either use a descriptive name, like bits ...

newSet.bits.segments[segmentIndex] = 0;
newSet.bits.elemCount = 0;

... or get rid of the struct and use just segments and elemCount ...

newSet.segments[segmentIndex] = 0;
newSet.elemCount = 0;

The latter is generally preferable UNLESS we need to assign a compound literal directly to the struct, like so

newSet.bits = { 0x0, 0x0, 0x11, 2 };

I have not seen your code representing gen_first_sets.c and gen_follow_sets.c. It depends on how you implement those whether you want segments and elemCount to be within a struct or separate instance variables.

Copy link
Member

@trijezdci trijezdci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me how you generated the FIRST and FOLLOW set literals. Did you copy-paste them from the C code? If so, that is not what we want. The project needs to be self contained, thus it needs to have its own standalone programs to generate the FIRST and FOLLOW set literals.

imp/TokenSet.cs Outdated
@@ -154,7 +154,7 @@ public class TokenSet : ITokenSet {

}; /* end m2c_token_name_table */

public static const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1;
public const int segmentCount = (Enum.GetNames(typeof(Token)).Length / 32) + 1;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants and variables within an implementation of a class should always be private. If a client of the class needs to read the constant or variable, then there should be a public function method to return its value. If a client of the class needs to write to a variable, then there should be a public mutator method to overwrite its value. Such public methods will then need to be added to the public interface of the class.

imp/TokenSet.cs Outdated
* ---------------------------------------------------------------------------
* Used to instantiate the prefabricated first and follow lists.
* Expects TSB[segmentCount] to be the number of tokens, and returns null if
* the count does not match.
* ------------------------------------------------------------------------ */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* --------------------------------------------------------------------------
 * constructor newFromRawData( bits95to64, bits63to32, bits31to0, counter )
 * --------------------------------------------------------------------------
 * Returns a newly allocated tokenset object from raw data passed in as
 * three data segments of 32 bits from highest to lowest, followed by a bit
 * counter. Returns null if the input data is inconsistent.
 * ----------------------------------------------------------------------- */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from highest to lowest

do you want me to change my code to run through the segments in reverse, or do you just want the bit values to be printed out in a different order?

imp/TokenSet.cs Outdated
}
TokenSet newSet = new TokenSet();
uint counter = 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we need to test if the number of arguments is correct, return null if it is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we check the last argument against the largest number of bits that can be represented by segmentCount*32 bits. If the counter is greater, then we can already tell it is wrong, forego any further expensive checks and return null.

imp/TokenSet.cs Outdated
uint counter = 0;

for (int i = 0; i < segmentCount; i++) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next, we iterate only over the first count-1 arguments, because the last argument is the counter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check;

        if(args.Length != segmentCount + 1) {
            return null;
        } /* end if */

so segmentCount should always be the same as args.Length - 1.

Regardless, changed the code to:

        for (int segmentIndex = 0; segmentIndex < args.Length - 1; segmentIndex++) {

imp/TokenSet.cs Outdated
uint counter = 0;

for (int i = 0; i < segmentCount; i++) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next, we only iterate over the first count-1 arguments because the last argument is the counter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, we should use self explanatory names. The loop variant is an index to address segments. A self explanatory name would be segmentIndex.

imp/TokenSet.cs Outdated

for (int i = 0; i < segmentCount; i++) {

for (int j = 0; j < 32; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop variant is an index to address individual bits, a self explanatory name would be bitIndex.

imp/TokenSet.cs Outdated
* ------------------------------------------------------------------------ */

public TokenSet(params uint[] TSB)
public static TokenSet newFromRawData(params uint[] TSB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSB is a non-descriptive name, also it goes against C# naming conventions to have a parameter in all-uppercase. Since the variadic argument list includes both the bit patterns and a bit counter, there isn't really any single name that will be specific, so we have to settle for a generic name. I generally use args in such cases.


new TokenSet( /* bits: */ 0x58000004, 0x00000242, 0x00000050, /* counter: */ 9 ), /* typeDeclarationTail */
TokenSet.newFromRawData( /* bits: */ 0xb0000008, 0x00000484, 0x00000140, /* counter: */ 9 ), /* typeDeclarationTail */
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the bit pattern values have doubled relative to the C version, indicating that they were left-shifted by one. A probably cause might be an off-by-one index to address the bits. That is to say, any bit stored at position n should have been stored at position n-1.

Copy link
Member Author

@SamDelaney SamDelaney May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken, this is because the ARGLIST token does not exist in the C compiler. The addition of said token early on in the enumeration shifts every following token's index in the enum by one.
Am I correct?

Copy link
Member Author

@SamDelaney SamDelaney May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the values also shift left again after the index of Backslash, another token which isn't included in the C compiler.

MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD,
REPEAT, RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH,
MOD, MODULE, NOT, OF, OPAQUE, OR, POINTER, PROCEDURE, QUALIFIED, RECORD, REPEAT,
RETURN, SET, THEN, TO, TYPE, UNTIL, VAR, WHILE, WITH,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the original formatting -- the sources are formatted to 79 columns to ensure that there will be no vertical scrollbars even when viewed on a small screen laptop.

public static uint alternateSetOffset = (uint)lastOptionDependent - (uint)firstOptionDependent + 1;

#region followSets
TokenSet[] followSets = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST sets should come first, FOLLOW sets last.


namespace org.m2sf.m2sharp {

using System;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the original formatting:

  • strictly NO TABs, only spaces.
  • two spaces indentation.
  • maximum 79 characters per line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants