-
Notifications
You must be signed in to change notification settings - Fork 14
parser implementation notes
- Alphabet = Token
- Grammar rules are stored in a file
- Non-terminals are stored without
<>
-
->
is omitted from rules- It is clear that first string is LHS and rest is RHS
-
Handle Nullable non-terminals
- will help to calculate follow sets
-
Enumerate non-terminals also so that
FIRST[statements]
can be implemented since array indices should be integers.typedef enum {statements, index, expression, ....} nonterminals
-
Structure of
mapping table
-
String Enumerated value FLAG "statement" STATEMENT NT "switch" SWITCH T -
String => What I am reading from file
-
Flag => terminals or nonterminals
-
Enumerated value =>
typedef union{ terminal t; non_terminal nt; } symbol;
-
typedef struct{ symbol s; char str[MAX_LEXEME_LEN]; int tag; //terminal or non-terminal } mapping table[SIZE]; //SIZE is number of terminals + non-terminals
-
But since implementation will be in form of a hash table, keep size as a
prime number
near to actual value of SIZE.
-
-
RHS of any production is not of fixed length so should be implemented as a linked list
-
So set of productions will be implemented as an array of linked lists
- instead of linked list of linked lists, prefer array of linked lists
- will help in random access
-
structure for linked list node for RHS
-
typedef enum {T, NT} type_of_sym; struct rhsnode{ symbol s; type_of_sym flag; struct rhsnode *next; }; typedef struct rhsnode RHSNODE;
-
-
structure for LHS
-
typedef struct cell{ nonterminal sym; RHSNODE *head; } GRAMMAR[ NUM_OF_RULES ];
-
-
Grammar is an array of cells
-
Doing
GRAMMAR G;
allocates memory -- static memory allocation -
To get Dynamic memory allocation ( on heap ), define GRAMMAR as
*GRAMMAR
instead -
PARSE TABLE
- Rows are non-terminals
- Columns are terminals
- Entries are Rules
- So keep an
2d array of pointers
, pointing to rule's head node
- Implementing FIRST set.
- Symbols are enumerated now.
- Union is required to be done in O(1) time
- Use Bit masks
- So say FIRST(A) = {a}
- Write it as
FIRST[A] = 0; //Initialize as an empty set
mask = 1;
FIRST[A] |= (mask << a);
- It works because A and a (terminals and nonterminals) are enumerated
- I will also need the
positions set as 1
becuase I will need what elements are there in the FIRST set.
- One of nonterminals
- Done above
- RHS of a rule
-
A -> aBcd
andA -> DE
, both rules go in some other colums in the parse table
-
-
Predective, top-down parser
-
Table driven
-
Table creation done
-
Algo
-
Stack implementation
-
push(stk, S)
- push start symbol on Stack -
while(){ check top; if top(stk) is a terminal{ match with lookahead character from lexer if(match found){ get_next_token(): } } else if(nonterminal) { // say NT is "A" // and lookahead character is "a" check table T[A][a] if( there exists a valid rule){ pop off top of stack; push rule's RHS on to stack. } } }
-
While pushing on stack, care is to be taken to make parse tree successfully
-
-
Parse tree structure
-
trie - number of children aren't fixed, so dynamically allocate memort
-
need to use an auxilliary stack for the purpose
- A -> BCD, needs to push D, C , followed by B on main stack
- but node addition to trie is done in manner of B, C, followed by D.
-
Main stack stores our PDA
-
- have both head and tail pointer in cell. This will make insert o(1) instead of o(n)
- allocate grammar_t statically. Why use malloc unnecessarily and waste heap space?
- use hash table in get_symbol_val for string to enum value matching ( exactly like what we did for keyword lookup table)
- use a rhs_node for lhs symbol too. This makes the code uniform and cleaner. At the moment the code was very confusing to understand
- assign tag of node in get_symbol_val instead of get_rules. Testing for terminal/ non-terminal is happening twice and is hence redundant.
- we are doing a toupper(str(i)) in get_symbol_val. This is causing the tags of all symbols to be NT. That is why seg fault is occurring. So make the strings in terminal_string array lower case