parser implementation notes

Grammar

Alphabet = Token
Grammar rules are stored in a file
Non-terminals are stored without <>
-> is omitted from rules
- It is clear that first string is LHS and rest is RHS

Implementation

Handle Nullable non-terminals
- will help to calculate follow sets
Enumerate non-terminals also so that FIRST[statements] can be implemented since array indices should be integers. typedef enum {statements, index, expression, ....} nonterminals
Structure of mapping table
- String Enumerated value FLAG
  
  "statement" STATEMENT NT
  
  "switch" SWITCH T
- String => What I am reading from file
- Flag => terminals or nonterminals
- Enumerated value =>
```
    typedef union{
        terminal t;
        non_terminal nt;
    } symbol;
```
- ```
  typedef struct{
      symbol s;
      char str[MAX_LEXEME_LEN];
      int tag;    //terminal or non-terminal
  } mapping table[SIZE];  //SIZE is number of terminals + non-terminals
```
- But since implementation will be in form of a hash table, keep size as a prime number near to actual value of SIZE.
RHS of any production is not of fixed length so should be implemented as a linked list
So set of productions will be implemented as an array of linked lists
- instead of linked list of linked lists, prefer array of linked lists
- will help in random access

structure for linked list node for RHS

  typedef enum {T, NT} type_of_sym;
  
  struct rhsnode{
      symbol s;
      type_of_sym flag;
      struct rhsnode *next;
  };

  typedef struct rhsnode RHSNODE;

structure for LHS

  typedef struct cell{
      nonterminal sym;
      RHSNODE *head;
  } GRAMMAR[ NUM_OF_RULES ];

Grammar is an array of cells
Doing GRAMMAR G; allocates memory -- static memory allocation
To get Dynamic memory allocation ( on heap ), define GRAMMAR as *GRAMMAR instead
PARSE TABLE
- Rows are non-terminals
- Columns are terminals
- Entries are Rules
- So keep an 2d array of pointers , pointing to rule's head node

FIRST and FOLLOW Set Implementation

Implementing FIRST set.
Symbols are enumerated now.
Union is required to be done in O(1) time
Use Bit masks
So say FIRST(A) = {a}
- Write it as
- FIRST[A] = 0; //Initialize as an empty set
- mask = 1;
- FIRST[A] |= (mask << a);
- It works because A and a (terminals and nonterminals) are enumerated
I will also need the positions set as 1 becuase I will need what elements are there in the FIRST set.

Two types of FIRSt sets

One of nonterminals
- Done above
RHS of a rule
- A -> aBcd and A -> DE , both rules go in some other colums in the parse table

Parse tree creation

Predective, top-down parser
Table driven
Table creation done

Algo

Stack implementation
push(stk, S) - push start symbol on Stack

  while(){
      check top;
      if top(stk) is a terminal{
          match with lookahead character from lexer
          if(match found){
              get_next_token():
          }
      }
      else if(nonterminal) {
          // say NT is "A"
          // and lookahead character is "a"

          check table T[A][a]

          if( there exists a valid rule){
              pop off top of stack;
              push rule's RHS on to stack.
          }

      }

  }

While pushing on stack, care is to be taken to make parse tree successfully

Parse tree structure
- trie - number of children aren't fixed, so dynamically allocate memort
- need to use an auxilliary stack for the purpose
  - A -> BCD, needs to push D, C , followed by B on main stack
  - but node addition to trie is done in manner of B, C, followed by D.
- Main stack stores our PDA

Issues

have both head and tail pointer in cell. This will make insert o(1) instead of o(n)
allocate grammar_t statically. Why use malloc unnecessarily and waste heap space?
use hash table in get_symbol_val for string to enum value matching ( exactly like what we did for keyword lookup table)
use a rhs_node for lhs symbol too. This makes the code uniform and cleaner. At the moment the code was very confusing to understand
assign tag of node in get_symbol_val instead of get_rules. Testing for terminal/ non-terminal is happening twice and is hence redundant.
we are doing a toupper(str(i)) in get_symbol_val. This is causing the tags of all symbols to be NT. That is why seg fault is occurring. So make the strings in terminal_string array lower case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser implementation notes

Grammar

Implementation

FIRST and FOLLOW Set Implementation

Two types of FIRSt sets

Parse tree creation

Issues

Clone this wiki locally

String	Enumerated value	FLAG
"statement"	STATEMENT	NT
"switch"	SWITCH	T