Skip to content

parser implementation notes

Vishal Mittal edited this page Feb 20, 2020 · 2 revisions

Grammar

  • Alphabet = Token
  • Grammar rules are stored in a file
  • Non-terminals are stored without <>
  • -> is omitted from rules
    • It is clear that first string is LHS and rest is RHS

Implementation

  • Handle Nullable non-terminals

    • will help to calculate follow sets
  • Enumerate non-terminals also so that FIRST[statements] can be implemented since array indices should be integers. typedef enum {statements, index, expression, ....} nonterminals

  • Structure of mapping table

    • String Enumerated value FLAG
      "statement" STATEMENT NT
      "switch" SWITCH T
    • String => What I am reading from file

    • Flag => terminals or nonterminals

    • Enumerated value =>

          typedef union{
              terminal t;
              non_terminal nt;
          } symbol;
      
    •   typedef struct{
            symbol s;
            char str[MAX_LEXEME_LEN];
            int tag;    //terminal or non-terminal
        } mapping table[SIZE];  //SIZE is number of terminals + non-terminals
      
    • But since implementation will be in form of a hash table, keep size as a prime number near to actual value of SIZE.

  • RHS of any production is not of fixed length so should be implemented as a linked list

  • So set of productions will be implemented as an array of linked lists

    • instead of linked list of linked lists, prefer array of linked lists
    • will help in random access
  • structure for linked list node for RHS

    •   typedef enum {T, NT} type_of_sym;
        
        struct rhsnode{
            symbol s;
            type_of_sym flag;
            struct rhsnode *next;
        };
      
        typedef struct rhsnode RHSNODE;
      
  • structure for LHS

    •   typedef struct cell{
            nonterminal sym;
            RHSNODE *head;
        } GRAMMAR[ NUM_OF_RULES ];
      
  • Grammar is an array of cells

  • Doing GRAMMAR G; allocates memory -- static memory allocation

  • To get Dynamic memory allocation ( on heap ), define GRAMMAR as *GRAMMAR instead

  • PARSE TABLE

    • Rows are non-terminals
    • Columns are terminals
    • Entries are Rules
    • So keep an 2d array of pointers , pointing to rule's head node

FIRST and FOLLOW Set Implementation

  • Implementing FIRST set.
  • Symbols are enumerated now.
  • Union is required to be done in O(1) time
  • Use Bit masks
  • So say FIRST(A) = {a}
    • Write it as
    • FIRST[A] = 0; //Initialize as an empty set
    • mask = 1;
    • FIRST[A] |= (mask << a);
    • It works because A and a (terminals and nonterminals) are enumerated
  • I will also need the positions set as 1 becuase I will need what elements are there in the FIRST set.

Two types of FIRSt sets

  • One of nonterminals
    • Done above
  • RHS of a rule
    • A -> aBcd and A -> DE , both rules go in some other colums in the parse table

Parse tree creation

  • Predective, top-down parser

  • Table driven

  • Table creation done

  • Algo

    • Stack implementation

    • push(stk, S) - push start symbol on Stack

    •   while(){
            check top;
            if top(stk) is a terminal{
                match with lookahead character from lexer
                if(match found){
                    get_next_token():
                }
            }
            else if(nonterminal) {
                // say NT is "A"
                // and lookahead character is "a"
      
                check table T[A][a]
      
                if( there exists a valid rule){
                    pop off top of stack;
                    push rule's RHS on to stack.
                }
      
            }
      
        }
      
    • While pushing on stack, care is to be taken to make parse tree successfully

  • Parse tree structure

    • trie - number of children aren't fixed, so dynamically allocate memort

    • need to use an auxilliary stack for the purpose

      • A -> BCD, needs to push D, C , followed by B on main stack
      • but node addition to trie is done in manner of B, C, followed by D.
    • Main stack stores our PDA

Issues

  1. have both head and tail pointer in cell. This will make insert o(1) instead of o(n)
  2. allocate grammar_t statically. Why use malloc unnecessarily and waste heap space?
  3. use hash table in get_symbol_val for string to enum value matching ( exactly like what we did for keyword lookup table)
  4. use a rhs_node for lhs symbol too. This makes the code uniform and cleaner. At the moment the code was very confusing to understand
  5. assign tag of node in get_symbol_val instead of get_rules. Testing for terminal/ non-terminal is happening twice and is hence redundant.
  6. we are doing a toupper(str(i)) in get_symbol_val. This is causing the tags of all symbols to be NT. That is why seg fault is occurring. So make the strings in terminal_string array lower case