forked from manu291/dypgen
-
Notifications
You must be signed in to change notification settings - Fork 1
/
dypgen-doc.tex
2521 lines (2041 loc) · 142 KB
/
dypgen-doc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{ hmargin=2cm, vmargin=2cm }
\usepackage{hyperref}
\hypersetup{
colorlinks=true, % false: boxed links; true: colored links
linkcolor=black, % color of internal links
citecolor=black, % color of links to bibliography
filecolor=black, % color of file links
urlcolor=blue % color of external links
}
\title{\texttt{dypgen} User's Manual}
\author{Emmanuel Onzon}
\newcommand{\comment}[1]{}
\begin{document}
\maketitle
\section*{Overview}
\texttt{dypgen} is a parser generator for Objective Caml.
To use it you need to learn the BNF syntax for grammars
which is briefly explained in section \ref{BNF}.
Its main features are:
\begin{itemize}
%\item But you should not
% need to learn the various notions of stack automaton (LR(0), LALR(1),
% \dots).
\item This is a GLR parser. This means it can handle ambiguous
grammars and outputs the list of all possible parse trees. Even for
non ambiguous grammars, GLR parsing allows to define the grammar in
a more natural way. It is possible to extract a definition suitable
for the documentation directly from the parser source file.
\item Ambiguities can be removed by introducing {\em priorities} and
relations between them. This gives a very natural way to express a
grammar (the examples in this documentation illustrate this).
\item Grammars are self-extensible, i.e. an action can add new rules to the
current grammar. Moreover, the modifications can be local. The new
grammar is valid only for a well delimited section of the parsed input.
\item \verb#dypgen# provides management of local and global immutable data.
The user actions can access it and return it modified. These mechanisms
adress the problem posed by global mutable data with GLR parsing and let
the user control the scope of the data.
Modifications of local data are preserved when traveling from right to
left in a rule or when going down in the parse tree. Modifications of
global data are preserved across the complete traversal of the parse
tree.
These data may be used for instance to do type-checking at parsing
time in an elegant and acceptable way. The local data may
contain the environment that associates a type to each variable while the
global data would contain the substitution over types that is
usually produced by unification.
\item Pattern matching for symbols in right-hand sides of rules is possible.
In particular this allows guarded reductions and to bind names
to the arguments of actions.
\item \verb#dypgen# allows {\em early actions}, that are semantic actions
performed before the end of a rule.
\item It is possible to use dypgen as the lexer generator too. In this case regular expressions that match characters strings are allowed in the right-hand side of grammar rules and it is possible to extend the lexer by introducing such rules.
\item The operators \verb|*|, \verb|+| and \verb|?| of regular expressions are available for non terminals and nested rules too.
\end{itemize}
\newpage
\tableofcontents
\section{Introduction to dypgen}
\subsection{The calculator example}
It is traditional to start the documentation of a parser generator
with the calculator exemple. Here we only give the grammar file:
\texttt{dypgen} takes a file ending with \texttt{.dyp} as input and
generates a \texttt{.ml} file and a \texttt{.mli} file.
For the program to be complete, one also need to generate a lexical
analyser, which can be done with ocamllex, dypgen or other lexer generators. The complete source for this example lies in the directory \verb|demos/calc_ocamllex| of the distribution.
Here is the file defining the calculator grammar:
\newcommand{\verbatimfile}[1]{\expandafter\beginverb\inputfile{#1}}
\newcommand{\beginverb}{\begin{verbatim}}
\newcommand{\inputfile}[1]{\input{#1}}
\verbatimfile{demos/calc_ocamllex/calc_parser.dyp}
\end{verbatim}
%\end{verbatim}
Let us comment it briefly. More details are available later in this
documentation.
\begin{itemize}
\item The first two lines starting with \verb#%token# define the {\em
tokens} also called {\em terminal symbols} of the grammar. The lexical analyser
is supposed to transform a character stream into a token stream.
For instance, the token \verb#PLUS# will probably (this is defined by
the lexical analyser) correspond to the character
\verb#+# in the input stream.
On the second line, we also indicate that the \verb#INT# tokens comes
with a value of type \verb#int#. They correspond to integer values in
the input stream.
\item The third line starting with \verb#%relation# defines three
priority levels, which intuitively will correspond to atomic
expressions (\verb#pi#), products (\verb#pt#) and sums (\verb#pp#).
\item The line starting with \verb#%start# gives the entry point of
the grammar. This means that one can parse an input stream using the function
\verb#Calc_parser.main# which is the function you need to call to
actually parse something.
\item The characters \verb|%%| tell dypgen where the grammar of the parser begins. They are equivalent to the keyword \verb|%parser|.
\item All the other lines define the grammar. The next section will
explain how to write grammars. Briefly we remark a BNF grammar
(explained below) decorated with the new concept of priority and with
semantical actions between curly braces.
\end{itemize}
\subsection{BNF grammars}\label{BNF}
A BNF grammar is a concise way of defining a language, that is a set of
sequences of characters. However, it is traditional and may be more efficient to define a language in to steps: lexical analysis and grammatical analysis.
We will not describe lexical analysis here. We will assume an already
defined lexer (for instance using \texttt{ocamllex}) which defines
some sets of words denoted using capital letters. For instance,
in that calculator example above, \verb#PLUS# denotes one word
``\verb#+#'' while \verb#INT# denoted the set of all words representing
an integer.
These set of words defined by the lexer are usually called {\em
tokens} or {\em terminal symbols}.
Then a BNF grammar, describe the language as a set of sequences of
terminal symbols
(they sometime have to be separated by {\em spaces}).
The calculator grammar in this context is
\begin{verbatim}
expr: INT | MINUS expr | LEFTPAR expr RIGHTPAR
| expr PLUS expr | expr MINUS expr
| expr TIMES expr | expr DIV expr
\end{verbatim}
This in fact just defines the language \texttt{expr} as the smallest language
containing \texttt{INT} and closed by the following construction
rules:
\begin{itemize}
\item If $w \in \mathtt{expr}$ then $\mathtt{-}w \in \mathtt{expr}$ (we assume
here that $\texttt{MINUS}$ contains only the word \verb#-#, and
similar assumptions for the other tokens).
\item If $w \in \mathtt{expr}$ then $(w) \in \mathtt{expr}$
\item If $w,w' \in \mathtt{expr}$ then $w\mathtt{+}w' \in \mathtt{expr}$
\item If $w,w' \in \mathtt{expr}$ then $w\mathtt{-}w' \in \mathtt{expr}$
\item If $w,w' \in \mathtt{expr}$ then $w\mathtt{*}w' \in \mathtt{expr}$
\item If $w,w' \in \mathtt{expr}$ then $w\mathtt{/}w' \in \mathtt{expr}$
\end{itemize}
In general, a grammar is define by a finite number of non-terminal
symbols (the calculator grammar has only one non-terminal:
\texttt{expr}) and a set of rules describing the elements of each
non-terminal symbols.
A rule associates to one non-terminal symbol a sequence of symbols
(terminal or non-terminal).
Then the languages corresponding to each non-terminals are defined
simultaneously as the smallest language satisfying every rules.
Let us consider another example:
\begin{verbatim}
a: INT | a MINUS b
b: INT | b PLUS a
\end{verbatim}
This means that \verb#a# and \verb#b# are the smallest languages
containing the integers and such that:
\begin{itemize}
\item If $w \in \mathtt{a}$ and $w' \in \mathtt{b}$ then $w\mathtt{-}w' \in \mathtt{a}$
\item If $w \in \mathtt{b}$ and $w' \in \mathtt{a}$ then $w\mathtt{+}w' \in \mathtt{b}$
\end{itemize}
Then, it is easy to see that \verb#0-0+0-0# is in the language
\verb#a#, because \verb#0# is both in \verb#a# and \verb#b# which
implies
that \verb#0-0# is in \verb#a#, from which we deduce that \verb#0+0-0#
is in \verb#b# and then, it is easy to conclude. However,
\verb#0-0+0-0# is not in \verb#b# (an easy exercise).
\subsection{Priorities}\label{priorities}
The problem with our calculator grammar as written in the previous section
is that it is ambiguous and wrong because for instance, there
are to way to parse \verb#3-2+1#, one way equivalent to \verb#(3-2)+1#
and the other way leading to \verb#3-(2+1)#.
The second way is not the usual way to read this expression and will
give a wrong answer when we will compute the value of the expression
in the semantical action.
We forget to say that our operator should be associated with the left
and also that product and division have priority over addition and subtraction.
To say so, \texttt{dypgen} provides priority constant and relation
over them. In the case of the calculator, we define three priorities
constants: \verb#pi#, \verb#pt# and \verb#pp#. We define the relation
\verb#<# by \verb#pi<pt#, \verb#pi<pp# and \verb#pt<pp#.
For each rule, we say to which priority it belongs
and for each non terminal in a
rule,
we give the priority it accepts.
The calculator grammar in this context is
\begin{verbatim}
expr: INT pi
| MINUS expr(<=pi) pi
| LEFTPAR expr RIGHTPAR pi
| expr(<=pp) PLUS expr(<pp) pp
| expr(<=pp) MINUS expr(<pp) pp
| expr(<=pt) TIMES expr(<pt) pt
| expr(<=pt) DIV expr(<pt) pt
\end{verbatim}
Let us comment some rules: in the rule
\verb#E: expr(<=pp) PLUS expr(<pp) pp#,
we say that an expression produced by this rule will be associated with
the priority constant \verb#pp#. We also say that on the left of the
\verb#PLUS#
token, only an expression of priority less or equal than \verb#pp# can
appear while on the right, we are more restrictive since we only
accept a priority stricly less that \verb#pp#.
For the rule \verb#E: LEFTPAR expr RIGHTPAR pi#, we associate the
smallest priority to the resulting expression and we give no
constraint for the expression between parentheses.
More details about priorities will be given in the section \ref{priority}.
\subsection{Semantical actions}\label{actions}
Now, parsing is not just defining acceptable sequence. One as to
produce something from the parsed sequence. This is performed using
semantical actions, given after each rule.
An action is a piece of
\verb#ocaml# code returning data associated with the parsed sequence, it can
be used to build a parse-tree or, as with the calculator, to compute a value.
Actions can access the semantics (that is the data
associated with) each non-terminal. Terminal symbols also can have
semantics associated with them by the lexical analyser.
In the code of
an action, we access the semantical data of each symbol in the rule
using notation \verb#$1, $2, $3... # (\verb#$3# is the semantics of
the third symbol in the rule).
Action must be placed after each rule between curly braces and before
the priority of the rule.
Let us look again at the calculator grammar, but with the
semantical action added:
\begin{verbatim}
expr:
| INT { $1 } pi
| MINUS expr(<=pi) { -$2 } pi
| LPAREN expr RPAREN { $2 } pi
| expr(<=pp) PLUS expr(<pp) { $1 + $3 } pp
| expr(<=pp) MINUS expr(<pp) { $1 - $3 } pp
| expr(<=pt) TIMES expr(<pt) { $1 * $3 } pt
| expr(<=pt) DIV expr(<pt) { $1 / $3 } pt
\end{verbatim}
Here, the actions compute the value of the numerical expression and
the example is self explaining.
Remark: a non terminal can accept the empty sequence, by writing no
symbol before the opening curly brace of the action.
\subsection{The calculator with dypgen lexer generator}
The calculator example above assumes that we use an external lexer generator like ocamllex. One can use instead the lexer generator provided by dypgen and define both the lexer and the parser in the .dyp file. Here is a version of the calculator that uses dypgen own lexer generator:
\verbatimfile{demos/calc/calc_parser.dyp}
\end{verbatim}
%\end{verbatim}
You can find this example in the directory \verb|demos/calc|. Here are the differences with the previous example:
\begin{itemize}
\item The line starting with \verb|%layout| defines which characters are considered to be simple layout and that must be discarded by the lexer. They are defined with a regular expression. Here the lexer discards any space and tabulation character.
\item Regular expressions can be written directly on the right-hand sides of grammar rules, allowing to define the lexer and the parser at the same time.
\end{itemize}
In this simple example no terminal symbol appears anymore. However dypgen lexer generator can produce terminal symbols too. Here is another version of the calculator more similar to the first example but still using dypgen's lexer generator:
\begin{verbatim}
%start main
%relation pi<pt<pp
%lexer
main lexer =
[' ' '\t'] ->
['0'-'9'] -> INT { int_of_string (Dyp.lexeme lexbuf) }
"+" -> PLUS
"-" -> MINUS
"*" -> TIMES
"/" -> DIV
"(" -> LPAREN
")" -> RPAREN
%parser
main : expr EOL { $1 }
expr :
| INT { $1 } pi
| MINUS expr(=pi) { -$2 } pi
| LPAREN expr RPAREN { $2 } pi
| expr(<=pp) PLUS expr(<pp) { $1 + $3 } pp
| expr(<=pp) MINUS expr(<pp) { $1 - $3 } pp
| expr(<=pt) TIMES expr(<pt) { $1 * $3 } pt
| expr(<=pt) DIV expr(<pt) { $1 / $3 } pt
\end{verbatim}
\section{Lexer generators}
You can use dypgen to generate a lexer for your parser or you can use an external lexer generator like ocamllex or ulex.
\subsection {dypgen lexer generator}\label{dyplex}
If you use dypgen to generate your lexer then the entry functions (declared with \verb|%start|) have the following type:
\begin{verbatim}
?global_data:global_data_type ->
?local_data:local_data_type ->
obj Dyp.dyplexbuf ->
(ast_type * string) list
\end{verbatim}
The nature of \verb|global_data| and \verb|local_data| is explained in section \ref{data}, \verb|obj Dyp.dyplexbuf| is the type of the lexer buffer (the type \verb|obj| is explained in section \ref{obj}). You make a lexer buffer from a string, an input channel or a function with the following functions of the module \verb|Dyp| of the library:
\begin{verbatim}
val from_string :
('token,'obj,'global_data,'local_data,'lexbuf) parser_pilot ->
string ->
'obj dyplexbuf
val from_channel :
('token,'obj,'global_data,'local_data,'lexbuf) parser_pilot ->
in_channel ->
'obj dyplexbuf
val from_function :
('token,'obj,'global_data,'local_data,'lexbuf) parser_pilot ->
(string -> int -> int) ->
'obj dyplexbuf
\end{verbatim}
These functions are similar to the functions of the same names of the module \verb|Lexing| from the standard library. They expect an argument of type \verb|parser_pilot|, this argument is yielded by a function named \verb|pp| that is generated in the \verb|.ml| file. For instance if you want to parse the string \verb|s| you can use:
\begin{verbatim}
let lexbuf = Dyp.from_string (Parser.pp ()) s
let result = Parser.main lexbuf
\end{verbatim}
assuming your parser is defined in the file \verb|parser.dyp| and the starting non terminal is \verb|main|. The result is of type:
\begin{verbatim}
(ast_type * string) list
\end{verbatim}
\verb|ast_type| denotes the type of the values yielded by the non terminal \verb|main|, the string is the name of the associated priority. The list contains one couple for each interpretation of the input string. The list contains just one couple if the input is unambiguous.
The frame of a lexer definition is as follows:
\begin{verbatim}
%lexer
let ident = regexp ...
rule aux_lexer_1 [arg1 ... argn] =
parse regexp { action }
| ...
| regexp { action }
and aux_lexer_2 [arg1 ... argn] =
parse ...
and ...
main lexer =
regexp -> [Token] [{ action }]
regexp -> [Token] [{ action }]
...
\end{verbatim}
This definition takes place before the definition of the parser. It begins with declarations of alias for regular expressions. Then come the definitions of auxiliary lexers, those can be used for instance to parse strings or comments. The syntax of their definitions is similar to definitions for ocamllex. Finally the main lexer is defined after the line
\begin{verbatim}
main lexer =
\end{verbatim}
A line
\begin{verbatim}
regexp -> Token { action }
\end{verbatim}
means that when \verb|regexp| is matched, the lexer returns the token \verb|Token| with the associated value returned by \verb|action|. The action is optional. The token is optional too. If no token name is written then the text matched by the regular expression is considered to be layout characters and is discarded by the lexer. An action can be stated in this case too for its side-effect.
Layout characters can also be declared before the definition of the lexer with the keyword \verb|%layout| followed by the regular expression and an action code inside curly braces if needed. Auxiliary lexers can be called from the user action code of the main lexer and of the auxiliary lexers. They are called with their names followed by the arguments the last one being the lexer buffer. The current lexer buffer is always available inside the action code with the name \verb|lexbuf|. You can use a lexer defined by ocamllex as an auxiliary lexer and call it from a user action. To do this you must convert the lexer buffer to the lexer buffer type of ocamllex with the function:
\begin{verbatim}
val std_lexbuf : 'obj dyplexbuf -> Lexing.lexbuf
\end{verbatim}
of the module \verb|Dyp|. For instance say you want to match strings with a lexer \verb|string| defined in a file \verb|lex_string.mll|, you can use the following entry in your main lexer definition:
\begin{verbatim}
'"' -> STRING { Buffer.clear Lex_string.string_buf;
Lex_string.string (Dyp.std_lexbuf lexbuf);
Buffer.contents Lex_string.string_buf }
\end{verbatim}
assuming the lexer \verb|string| stores the matched string in the buffer \verb|string_buf| defined in the file \verb|lex_string.mll|.\\
When using dypgen lexer generator you can write regular expressions directly in the right-hand sides of the production rules of the grammar of the parser. They return a value of type \verb|string| that is available in action code in the corresponding variable like \verb|$1|. A sequence of several regular expressions yields as many variables. For instance in the action of the rule:
\begin{verbatim}
nt: "hello" "world"
\end{verbatim}
the variable \verb|$1| has the value \verb|"hello"| and \verb|$2| the value \verb|"world"|. Any layout character can be matched between \verb|"hello"| and \verb|"world"|. If you enclose the sequence inside parenthesis then it is considered as one regular expression. In the action of the rule:
\begin{verbatim}
nt: ("hello" "world")
\end{verbatim}
the variable \verb|$1| has the value \verb|"helloworld"|. The string \verb|"helloworld"| must be matched to be able to reduce with this rule.\\
Regular expressions have the same syntax as those of ocamllex except that the operator \verb|#| of character set difference is not available and it is not possible to do binding to sub-patterns (this is done with the keyword \verb|as| in ocamllex).\\
As with ocamllex the lexer always yields a unique token when there is at least one match. A lexer generated by dypgen deals with ambiguity a bit differently than ocamllex does. Here is how dypgen chooses the matching of the lexer:
The matches that are not expected, taking into account what has been parsed so far, are discarded. Among those which are expected the longest are selected, then those belonging to the most recent lexer (if the lexer has been extended), then the one generated by the higher regular expression in the precedence order. The order in which the regular expressions appear in the file is their precedence order. The precedence of regular expressions in right-hand side of grammar rule is unspecified but lower than that of the regular expressions defined in the section \verb|main lexer =| except for those that are just a string: they are of higher precedence instead. The lexer can be extended when new grammar rules containing regular expressions are introduced. In this case the precedence of these new regular expressions follows the same rule. The precedence of regular expressions that match layout characters is lower than that of other regular expressions (including those that match less characters) and is left unspecified among them. Contrary to ocamllex it is not possible to select the shortest match instead of the longest.\\
You can also keep all the tokens of maximum length (and which are expected). To do so define the following variable:
\begin{verbatim}
let dypgen_choose_token = `all
\end{verbatim}
in the header of the parser definition. And when you call the function \verb|lexparse| (see section \ref{parse} for more information about this function) use the optional argument:
\begin{verbatim}
~choose_token:`all
\end{verbatim}
Note that in this case if there are several entries for the same terminal symbol in the lexer that are matched then this terminal symbol is yielded as many times with its corresponding action each time. Suppose these lines are in the lexer:
\begin{verbatim}
['0'-'9']+ -> INT { int_of_string (Dyp.lexeme lexbuf) }
['0'-'9']+ -> INT { -int_of_string (Dyp.lexeme lexbuf) }
\end{verbatim}
Then when the lexer reads the integer \verb|17| it returns to the lexer two tokens: \verb|INT(17)| and \verb|INT(-17)|.\\
Note that when using \verb|`all| the behaviour of the lexer pertaining to layout regular expressions stays the same as the default one. If the lexer matches several layout regular expressions (all of maximal length), then there are two possible cases:
\begin{itemize}
\item Only layout regular expressions are matched. Then one action is chosen among them and executed.
\item In addition to layout regular expressions, some non-layout regular expression is matched. Then no action associated to layout regular expression is executed.
\end{itemize}
Remark that even when returning several tokens the lexer only returns the tokens of the maximal length and drop the ones with a shorter length. This has counter-intuitive consequences. For example consider the following grammar :
\begin{verbatim}
{ let dypgen_choose_token = `all }
%start main
%lexer
main lexer =
['a'-'z']+ -> ID
%parser
main:
| 'a'? d { "ab" }
| ID 'c' { "id" }
d: 'b' { }
\end{verbatim}
Parsing fails on the string \verb|b| even with
\begin{verbatim}
let dypgen_choose_token = `all
\end{verbatim}
When parsing \verb|b| the lexer first matches \verb|'a'?| with the empty string, then it matches \verb|ID| with the string \verb|b| and the match with \verb|'a'?| is thrown because it is of shorter length. Afterwards the rule \verb|main: 'a'? b| is disqualified.\\
A nested rule can solve this problem. Change the first rule with this one:
\begin{verbatim}
main:
| ["a"]? d { "ab" }
\end{verbatim}
and the parsing of \verb|b| does not fail anymore. For more informations about nested rules see the section \ref{nested}.\\
When parsing a file you want to be able to track the number of lines and maybe from which file the code you are parsing originates before possible preprocessing. To do so the two following functions are available:
\begin{verbatim}
val set_newline : 'obj dyplexbuf -> unit
val set_fname : 'obj dyplexbuf -> string -> unit
\end{verbatim}
These functions update the fields \verb|pos_lnum| and \verb|pos_bol|, and \verb|pos_fname| of the lexer buffer respectively. If \verb|lexbuf| is your lexer buffer, you can access these fields with:
\begin{verbatim}
(Dyp.std_lexbuf lexbuf).lex_curr_p.pos_lnum
(Dyp.std_lexbuf lexbuf).lex_curr_p.pos_bol
(Dyp.std_lexbuf lexbuf).lex_curr_p.pos_fname
\end{verbatim}
\subsection{External lexer generators}\label{other-lexer}
If you use an external lexer generator like ocamllex to generate your lexer then the entry functions (declared with \verb|%start|) have the following type:
\begin{verbatim}
?global_data:global_data_type ->
?local_data:local_data_type ->
(lexbuf_type -> token) ->
lexbuf_type ->
(ast_type * string) list
\end{verbatim}
Compared with the use of dypgen as a lexer generator, the function expects one extra argument: a function of type:
\begin{verbatim}
lexbuf_type -> token
\end{verbatim}
This function is the lexer that produces tokens for the parser. The type \verb|lexbuf_type| denotes the type of the lexer buffer (with ocamllex it is \verb|Lexing.lexbuf|) and \verb|token| is a type declared by dypgen. The type \verb|token| owns one constructor for each token that is declared with the keyword \verb|%token|.\\
Note that you cannot use regular expressions in the right-hand side of the
rules of the parser for lexing purpose if you do not use dypgen as lexer generator. More precisely: if you write a regular expression in a rhs then dypgen will assume that you use dypgen as lexer generator too.
\subsection{Using a lexer generator different from ocamllex or dypgen}
If you want to use another lexer than ocamllex or dypgen then you have to define the function \verb|dypgen_lexbuf_position| in the header. If you don't need the position of the lexer you may use the line:
\begin{verbatim}
let dypgen_lexbuf_position = Dyp.dummy_lexbuf_position
\end{verbatim}
The function \verb|dummy_lexbuf_position| is:
\begin{verbatim}
let dummy_lexbuf_position _ = (Lexing.dummy_pos, Lexing.dummy_pos)
\end{verbatim}
With this function you don't have access to relevant information about the position of the lexer while parsing. If you need the position of the lexer in your action code, then you have to define the function \verb|dypgen_lexbuf_position| in the header accordingly (see \ref{position}).\\
If the type of the lexer buffer is constructed with some types defined in the header of the parser definition, then you must include them in the \verb|.mli| (see section \ref{mli}).\\
The demo program \verb|tinyML_ulex| uses Ulex as its lexer generator.\\
The demo program \verb|position_parser_token_list| uses ocamllex but a different lexer buffer than \verb|Lexing.lexbuf|.
\subsection{Position of the lexer}\label{position}
The following functions allow to know the location of the part of the input that is reduced to the symbols that appear in the rule currently applied. They are available from the action code.
\begin{verbatim}
val symbol_start : unit -> int
val symbol_start_pos : unit -> Lexing.position
val symbol_end : unit -> int
val symbol_end_pos : unit -> Lexing.position
val rhs_start : int -> int
val rhs_start_pos : int -> Lexing.position
val rhs_end : int -> int
val rhs_end_pos : int -> Lexing.position
\end{verbatim}
These functions tell what is the part of the input that is reduced to a given non terminal. They should behave the same way as the functions of the same names of the module \texttt{Parsing} do in \texttt{ocamlyacc}.
These functions are part of the record \verb|dyp|. When you use them you must use the record \verb|dyp|, like:
\begin{verbatim}
dyp.symbol_start ()
\end{verbatim}
The demo program \verb|position| illustrates the use of these functions with dypgen as lexer generator, and the program \verb|position_ocamllex| with ocamllex.\\
If you want these functions to return relevant information when you use a lexer generator different from ocamllex or dypgen, or when you do not use the interface of ocamllex directly with dypgen, then you have to define the function \verb|dypgen_lexbuf_position| in the header of the parser accordingly. The type of this function is:
\begin{verbatim}
val dypgen_lexbuf_position : lexbuf_type -> (Lexing.position * Lexing.position)
\end{verbatim}
where \verb|lexbuf_type| is the type of the lexer buffer (it is infered by Caml). The two positions are the position of the beginning of the last token that has been read by the parser and its end. The type \verb|Lexing.position| is:
\begin{verbatim}
type position = {
pos_fname : string;
pos_lnum : int;
pos_bol : int;
pos_cnum : int;
}
\end{verbatim}
The demo program \verb|position_token_list| is the same as \verb|position| except that it first uses ocamllex to make a list of tokens and then uses dypgen to parse this list of tokens. It makes use of \verb|dypgen_lexbuf_position|.\\
Note: the following abbreviations are available in the \verb|.dyp| file: \verb|$<| for \verb|dyp.Dyp.rhs_start_pos| and \verb|$>| for \verb|dyp.Dyp.rhs_end_pos|.\\
Remark that in some instances, the position returned by \verb|symbol_start| and \verb|symbol_start_pos| looks inconsistent with the position returned by the functions for the right-hand side. This may be the case when the right-hand side of the rule, which is used to reduce, begins with nullable non terminals. In such a case the position returned by \verb|rhs_start| and \verb|rhs_end| for these symbols may be smaller than the position returned by \verb|symbol_start|. This is the case when there are layout characters before the characters matched by the first non nullable symbol of the right-hand side of the rule. Here is an example.
\begin{verbatim}
%start main
%layout ' '
%parser
main: eps "abc"
{ Printf.printf "eps start: %d, eps end: %d, main start: %d\n"
(dyp.Dyp.rhs_start 1) (dyp.Dyp.rhs_end 1)
(dyp.Dyp.symbol_start ());
0 }
eps: { 0 }
\end{verbatim}
Parsing the string \verb|" abc"|, where the string begins with 3 spaces, returns
\begin{verbatim}
eps start: 0, eps end: 0, main start: 4
\end{verbatim}
The reason for this inconsistency is that no layout character is read before reducing with a rule with an empty right-end side, therefore the 3 spaces are not read. But those layout characters are not included when computing the start of \verb|main|. This seems to be the most useful convention.
\subsection{Extending the lexer}
It is possible to extend indirectly the main lexer generated by dypgen. The way to do it is to introduce new grammar rules for the \emph{parser} with regular expressions in the right-hand sides. It is not possible to extend the definition of an existing terminal symbol neither to create a new terminal symbol at runtime. However this does not prevent you from introducing new tokens. If you want a new token at runtime then you can introduce a new rule with a new non terminal in the left-hand side. This new non terminal will represent the new token. The right-hand side of the new rule contains the regular expression that matches the lexemes for the new token. The section \ref{grammar-extension} explains how to extend the grammar at runtime.
\section{Resolving ambiguities}\label{ambiguities}
There are two main mechanisms to handle ambiguities in \texttt{dypgen}: a system of priorities with relations between them and the merge functions which can decide which parse-tree to keep when a given part of the input is reduced to the same non terminal by two different ways. Two other secondary mechanisms make possible to decide to give up a reduction with a rule (by raising the exception \texttt{Dyp.Giveup}) or to prevent a shift (i.e. the parser is prevented from reading more input without performing a reduction).
\subsection{Priorities with relations}\label{priority}
Each time a reduction by a rule happens, the corresponding parse-tree is yielded with a value which is called a priority.
Priorities are named and declared along with relations between them which hold true with the keyword \texttt{\%relation}. The symbol for the relation is \texttt{<} (but this does not mean that it has to be an order). A declaration of priorities can be for instance:
\begin{verbatim}
%relation pi<pt pt<pp pi<pp
\end{verbatim}
It is possible to declare a relation which is transitive on a subset of the priorities with a more compact syntax.
\begin{verbatim}
%relation p1<p2<...<pn
\end{verbatim}
means that the following relations hold: \texttt{p1<p2}, ... , \texttt{p1<pn}, \texttt{p2<p3}, ... , \texttt{p2<pn}, ... , \texttt{p(n-1)<pn}. Thus the first example of relations declaration can also be written:
\begin{verbatim}
%relation pi<pt<pp
\end{verbatim}
The declarations can use several lines, the following declaration is valid:
\begin{verbatim}
%relation pi<pt
pt<pp
%relation pi<pp
\end{verbatim}
Each rule in the grammar returns a priority value when it is used to reduce. This priority is stated by the user after the action code. For instance:
\begin{verbatim}
expr: INT { $1 } pi
\end{verbatim}
If the parser reduces with this rule then it returns the value associated with the token \texttt{INT} and the priority \texttt{pi}. The user can state no priority, then the default priority \texttt{default\_priority} is returned each time a reduction with this rule happens. The value \verb|default_priority| is part of the module \verb|Dyp_priority_data| available in the parser code.\\
Each non terminal in the right-hand side of a rule is associated with a set of priorities that it accepts to perform a reduction. This set of priorities is denoted using the following symbols \texttt{<}, \texttt{>} and \texttt{=} and a priority \texttt{p}.\\
\texttt{(<p)} denotes the set of all priorities \texttt{q} such that \texttt{q<p} holds. \texttt{(<=p)} denotes the previous set to which the priority \texttt{p} is added. \texttt{(>p)} is the set of all priorities \texttt{q} such that \texttt{p<q} holds. Obviously \texttt{(>=p)} denotes the previous set to which the priority \texttt{p} is added and \texttt{(=p)} is the set of just \texttt{p}. Note that when declaring relations between priorities, the symbols \texttt{>} and \texttt{=} cannot be used.\\
If no set of priorities is stated after a non terminal in a right-hand side of a rule, then it means that it accepts any priority. Thus to not state any set of priority is equivalent to state the set of all priorities.\\
A basic example, you have the following rules:
\begin{verbatim}
expr: INT { $1 } pi
expr: MINUS expr(<=pi) { -$2 }
\end{verbatim}
You parse the string: `\texttt{-1}'. First \texttt{1} is reduced to \texttt{expr} with the first rule. \texttt{1} is the returned value and \texttt{pi} is the returned priority. Then the reduction with the second rule can happen because the non terminal \texttt{expr} in its right-hand side accepts the priority \texttt{pi}. This reduction is performed, the returned value is \texttt{-1} and the returned priority is \texttt{default\_priority}. Now if we want to parse the string `\texttt{--1}' a syntax error will happen, because when \texttt{-1} is reduced, the priority \texttt{default\_priority} is yielded, which is not accepted by the second rule and therefore a reduction by this rule cannot happen a second time.\\
Another example, we have the relations \texttt{pi<pt<pp} and the following grammar:
\begin{verbatim}
expr:
| INT { $1 } pi
| LPAREN expr RPAREN { $2 } pi
| expr(<=pp) PLUS expr(<pp) { $1 + $3 } pp
| expr(<=pt) TIMES expr(<pt) { $1 * $3 } pt
\end{verbatim}
and we parse the following input: \texttt{1 + 2 * 3}\\
\texttt{1} and \texttt{2} are reduced to \texttt{expr}, each returning the priority \texttt{pi}, then the parser explores the shift and the reduction.
The parser can reduce with the third rule because \texttt{pi<pp} holds, the priority \texttt{pp} is returned. After \texttt{3} is reduced to \texttt{expr}, the parser cannot reduce with rule 4 after the reduction by rule 3 because \texttt{pp<pt} does not hold.
But the exploration of the possibility of a shift of \texttt{*} after reducing \texttt{2} to \texttt{expr} leads to a successful parsing (which respects the precedence of \texttt{*} over \texttt{+}).\\
The user can declare a priority without stating any relation with it by adding it in the section after \texttt{\%relation}. You should declare any prorities which has no relation. If you use a priority that is not declared and has no relation, in a rule, then \texttt{dypgen} will emit a warning.\\
\comment{When a priority is declared a corresponding Caml value of type \texttt{priority} is introduced with the same name. You should beware of identifier collision and not name other variables with the names of your priorities or \texttt{default\_priority}. This value can be used in action codes. All the information about relations are stored in a structure of type \texttt{priority\_data}. This structure is accessible from the action code with the record field \texttt{dyp.priority\_data}. The following function allows to know whether the relation holds between two priorities:
\begin{verbatim}
val is_relation: priority_data -> priority -> priority -> bool
\end{verbatim}
\texttt{is\_relation dyp.priority\_data p q} returns true if \texttt{p<q} holds and false otherwise. The type of the record \verb|dyp| is defined in the module \verb|Dyp| of the library \verb|dyp.cm[x]a|, see the section \ref{dyp} for more information about it. Other functions pertaining to priorities are available, see section \ref{dynamic priority} about changing the priorities at runtime.}
\subsection{Merge functions and keeping several trees}\label{merge}
\subsubsection{Specific merge functions}
When a part of the input is reduced by several different ways to a same non terminal \texttt{nt} then the parser must decide whether to keep all the abstract syntax trees (ASTs) yielded or to choose just one or a few of them, or to make a new tree from them. By default only one tree is kept, and the other ones are discarded. It is not possible to know which one is kept. The user can override the default choice. It is possible to define a function \verb|dyp_merge_Obj_nt| of type:
\begin{verbatim}
val dyp_merge_Obj_nt:
(nt_type * global_data_t * local_data_t) list ->
nt_type list * global_data_t * local_data_t
\end{verbatim}
The type names stated here are not actually defined anywhere but are used here to make understand the nature of the argument and result of the merge functions. The type \verb|nt_type| represents the type of the value returned when one reduces to the non terminal \verb|nt|.
The argument of the function is the list of the ASTs which are the different interpretations which were yielded and kept for the non terminal \verb|nt| for a given part of the input. Theses ASTs are associated with their corresponding \verb|global_data| and \verb|local_data| in a triple (see section \ref{data} for more information about \verb|global_data| and \verb|local_data|). The default behavior of dypgen is to not merge when the data are different. Therefore by default all the global data are actually the same as well as all the local data. This behavior can be made different (see \ref{keep_data}).\\
The merge function returns a result like:
\begin{verbatim}
(tree_list, global_data, local_data)
\end{verbatim}
Where \verb|tree_list| is a list of ASTs which are kept as the different interpretations for the non terminal \verb|nt| for the considered part of the input. \verb|global_data| and \verb|local_data| are the \verb|global_data| and \verb|local_data| that are kept for the parsing.\\
The user can define such a merge function for each non terminal in the header of the parser.\\
For example the following merge function keeps all the ASTs for the non terminal \verb|nt|, without considering \verb|global_data| and \verb|local_data|:
\begin{verbatim}
let dyp_merge_Obj_nt l = match l with
| (_,gd,ld)::_ -> List.map (fun (o,_,_) -> o) l, gd, ld
| [] -> assert false
\end{verbatim}
A merge function is only called on ASTs which were yielded with the same priority. If a part of the input is reduced to the same non terminal by two different ways but yielding two distinct priorities, then each AST is kept and used independently for the parsing, but they are not merged.
\subsubsection{Example}
Here is an example of using a merge function to enforce the precedence of the multiplication over the addition. Suppose we have the following grammar:
\begin{verbatim}
expr:
| INT { Int $1 }
| expr PLUS expr { Plus ($1,$2) }
| expr TIMES expr { Times ($1,$2) }
\end{verbatim}
And the following string: \texttt{3+10*7} should be parsed \texttt{Plus (3,Times (10,7))}, more generally the parse result of any string should respect the precedence of \texttt{*} over \texttt{+}. Then we can do this by defining the following merge function:
\begin{verbatim}
let rec dyp_merge_Obj_expr = function
| (o1,gd,ld)::(o2,_,_)::t ->
begin match o1 with
| Times ((Plus _),_)
| Times (_,(Plus _)) -> dyp_merge_Obj_expr ((o2,gd,ld)::t)
| _ -> dyp_merge_Obj_expr ((o1,gd,ld)::t)
end
| [o,gd,ld] -> [o],gd,ld
| _ -> assert false
\end{verbatim}
You can find this example implemented in the directory \verb|demos/merge_times_plus|.\\
Note that the argument of the merge function (the list) always has at least two elements when called by dypgen.\\
Since it is not possible to know which tree is kept by the default merge function, you should only rely on it when all the trees yielded by the ambiguous non terminal are actually the same, or when it does not matter which one is chosen.
\subsubsection{Global and generic merge functions}
In addition to these merge functions which are specific to one non terminal, the user can also define one global merge function called \verb|dyp_merge|, and several generic merge functions. A generic merge function can be the merge function of several different non terminals. The type of the global merge function is the following:
\begin{verbatim}
type 'a merge_function =
('a * global_data_t * local_data_t) list ->
'a list * global_data_t * local_data_t
\end{verbatim}
The global merge function can be defined in the header of the parser definition, for instance to define a global merge function which keeps all the parse trees:
\begin{verbatim}
let dyp_merge l = match l with
| (_,gd,ld)::_ -> List.map (fun (o,_,_) -> o) l, gd, ld
| [] -> assert false
\end{verbatim}
Then it will be the merge function of any non terminal \texttt{nt} which has not its own function \verb|dyp_merge_nt| and has no generic merge function assigned to it.\\
Generic merge functions are defined in the header. Then to assign a merge function to one or several non terminal one uses the keyword \verb|%merge| in the following way:
\begin{verbatim}
%merge my_merge_function Obj_nt1 Obj_nt2 Obj_nt3 Obj_nt4 Obj_nt5
\end{verbatim}
where \verb|my_merge_function| is the name of the generic merge function which has been defined in the header and \verb|nt1| ... \verb|nt5| are the names of the non terminal which are assigned this generic merge function.\\
Note that the global merge function must be polymorphic and accept the type of values yielded by any non terminal. This is not the case for a generic merge function as long as it is used for non terminals that return values of the same type. In the special case when all the non terminals return values of the same type, the global merge function does not have to be polymorphic.\\
There are two predefined generic merge functions available to the user in the module \verb|Dyp| of the library:
\begin{verbatim}
val keep_all : 'a merge_function
val keep_one : 'a merge_function
\end{verbatim}
They keep respectively all the ASTs, and one of the AST, they choose \verb|global_data| and \verb|local_data| arbitrarily. If no global merge function is defined then by default it is \verb|keep_one|.\\
Note that you can use the predefined merge functions as generic functions and as the global merge function as well. If you want to keep all the ASTs for all the non terminals, just define the following in the header:
\begin{verbatim}
let dyp_merge = keep_all
\end{verbatim}
You can find a very simple example using \verb|keep_all| in the directory \texttt{demos/forest} where a parse forest is yielded.\\
Note that the merge functions are in fact associated with the constructors of the type \verb|obj| (see section \ref{obj}) rather than with the non terminals. Which means that any non terminal that shares the same constructor is assigned the same specific merge function. And that using \verb|%merge| you can assign a generic merge function to constructors introduced with the keyword \verb|%constructor| (see section \ref{obj} for more information). Now keep in mind that despite the fact merge is associated with a constructor, merge is only applied to different parsings of the \emph{same non terminal} (and same part of the input and yielding the same priority) and never to distinct non terminals even if they have the same constructor.
\subsubsection{Merging on different global and local data}\label{keep_data}
You may want to merge ASTs even if their corresponding global data or local data are different. Then you have to define the variable:
\begin{verbatim}
val dypgen_keep_data : [`both|`global|`local|`none]
\end{verbatim}
in the header of the parser, or use the functions \verb|parse| or \verb|lexparse| (discussed in section \ref{parse}) with the optional argument:
\begin{verbatim}
?keep_data : [`both|`global|`local|`none]
\end{verbatim}
\begin{itemize}
\item\verb|`both| is the default behavior: if global data is different or local data is different then the ASTs are not merged.
\item\verb|`global| allows merging only when global data are equal but local data may be different.
\item\verb|`local| allows merging only when local data are equal but global data may be different.
\item\verb|`none| makes merging happen disregarding whether the data are equal or not.
\end{itemize}
\subsubsection{Self-derivable non terminals and empty reductions}
In the general case, when a merge function is called for a non terminal \verb|nt| and a given part of the input, it means that this part will never be reduced to \verb|nt| again. The arguments passed to the merge function are all the ASTs that are possible when reducing this part of the input to \verb|nt|.\\
As a special exception this is not true in two cases:
\begin{itemize}
\item When the part of the input that is considered is the empty string
\item When the non terminal can derive itself. For instance with the rules:
\begin{verbatim}
a: b c
b: a
c:
\end{verbatim}
the non terminal \verb|a| can derive itself (this is also true for \verb|b|).
\end{itemize}
In such cases, as soon as two possible ASTs are yielded for a part of the input and a given non terminal, the corresponding merge function is called on them. If afterward a third AST is yielded then the merge function is called again on this third AST and on the result of the previous call to the merge function.\\
In the general case, the reductions are performed in an order such that an AST is never merged \emph{after} being used by another reduction. This ensures, in particular, that a reduction that spans a bigger part of the input does not miss the results of reductions of subparts of the input when they are relevant.\\
However it is not always possible to avoid this situation when reducing to a self-derivable non terminal. In such a case, a possible solution is to use a reference on a list of all possible ASTs for the nodes corresponding to self-derivable non terminals, and when a merge happen, to update the list.
\subsubsection{Merge warning}
To know whether a merge happens you can use the command line option \texttt{--merge-warning} with \texttt{dypgen}. Then the generated parser will emit a warning on the standard output each time a merge is added to its merge working list and print the name of the non terminal which will be merged. The number of merge warnings is equal to the length of the list passed in argument to the merge function minus one.\\
Warning: when there is an error of type with the arguments or the result of a merge function, Caml reports it and points to the \texttt{.ml} file generated by \texttt{dypgen}, not to the \texttt{.dyp} input file, which may be puzzling. In particular make sure your merge functions return a list of ASTs as the first value of the returned tuple and not just an AST, otherwise you will have an error message pointing to the generated \verb|.ml| file.
\subsection{Giving up a reduction}
When a reduction occurs this reduction is given up if the exception \texttt{Dyp.Giveup} is raised in the corresponding action code.
\begin{verbatim}
expr:
| INT { $1 }
| expr DIV expr { if $3=0 then raise Dyp.Giveup else ($1 / $3) }
\end{verbatim}
This is an example where a division by zero is not syntaxically correct, the parser refuses to reduce a division by zero. We can also imagine a language with input files being parsed and typed at the same time and an action would give up a reduction if it detected an incompatibility of type. Let us assume we have the following input for such a language:
\begin{verbatim}
exception Exception
let head l = match l with
| x::_ -> x
| _ -> raise Exception
let a = head 1::[2]
\end{verbatim}
Then the parser tries to reduce \texttt{head 1} to an expression, but the typing concludes to an incompatibility of type. Hence an exception \texttt{Giveup} is raised which tells the parser not to explore this possibility any further. Then the parser reduces \texttt{1::[2]} to an expression and thereafter \texttt{head 1::[2]} is reduced to an expression without type error.
\subsection{Preventing a shift}\label{will_shift}
When a reduction occurs it is possible to prevent the shift with the action code. You have to state the character `\verb|@|' before the left brace which begins the action code and returns the following list along with the returned value of your action in a couple:
\begin{verbatim}
@{ returned_value, [Dyp.Dont_shift] }
\end{verbatim}
The list after \verb|returned_value| is called the parser commands list, see section \ref{dyp_action} for more information about it.
Here is an example, we have the following rules in a grammar:
\begin{verbatim}
expr:
| INT { Int $1 }
| expr COMMA expr { action_comma $1 $3 }
\end{verbatim}
Assuming we have the following input: \texttt{1,2,3}, there are two ways to reduce it to \texttt{expr}. First one: \texttt{1,2} is reduced to \texttt{expr} then \texttt{expr,3} is reduced to \texttt{expr}. Second one: \texttt{2,3} is reduced to \texttt{expr} then \texttt{1,expr} is reduced to \texttt{expr}. Now if we have the input \texttt{1,2,...,n} there are as many ways to reduce it to \texttt{expr} as there are binary trees with n leaves. But we can use the following action code instead:
\begin{verbatim}
expr:
| INT { Int $1 }
| expr COMMA expr @{ action_comma $1 $3, [Dyp.Dont_shift] }
\end{verbatim}
Then there is only one way to reduce \texttt{1,2,3} to \texttt{expr}: the first one, because when \verb|[Dyp.Dont_shift]| is returned the parser will not shift the comma without reducing. And there is only one way to reduce \verb|1,...,n| to \verb|expr|.\\
The constructor \verb|Dont_shift| is part of the type \verb|dyp_action| (see section \ref{dyp_action}), which is defined in the module \verb|Dyp| of the library.
\section{Auxiliary data}\label{data}
\subsection{The fields \texttt{global\_data} and \texttt{local\_data}}
With GLR, the parsing can follow different interpretations independently and simultaneously if there are local ambiguities. As a consequence if there are accesses and changes to data
through side-effects for each of these parsings, there can be unwanted
interactions between them. For this reason, using side effects for the purpose of storing data should be avoided during the parsing. If you want to build and store data while parsing and access this data from within the action code then you should use immutable data and the record fields \verb|dyp.global_data| or \verb|dyp.local_data|. The record \verb|dyp| is available in any user action code, its type is defined in the module \verb|Dyp| of the library \verb|dyp.cm[x]a|, see the section \ref{dyp} for more information about it. If \verb|dyp.global_data| or \verb|dyp.local_data| are used, then the user must define their initial values in the header with the names:
\begin{verbatim}
let global_data = some_initial_data
let local_data = some_other_data
\end{verbatim}
\verb|global_data| and \verb|local_data| can be passed to the parser as argument since they are optional arguments of the parser. For instance if `\verb|main|' is a start symbol of the grammar and it returns values of type \verb|main_type| then the following function is defined in the generated code:
\begin{verbatim}
val main :
?global_data:gd_type ->
?local_data:ld_type ->
(lexbuf_type -> token) ->
lexbuf_type ->
(main_type * string) list
\end{verbatim}
where \verb|gd_type| and \verb|ld_type| represents the type of the global and local data. \verb|lexbuf_type| is the type of the lexing buffer, if you use ocamllex then it is \verb|Lexing.lexbuf|.\\
If you use dypgen to generate your lexer then the type of \verb|main| is:
\begin{verbatim}
val main :
?global_data:global_data_type ->
?local_data:local_data_type ->
obj Dyp.dyplexbuf ->
(main_type * string) list
\end{verbatim}
The record \verb|dyp| is available in the action code and you can access the values of \verb|global_data| and \verb|local_data| through the fields of \verb|dyp| that have the corresponding names. You can return new values for \verb|global_data| and \verb|local_data| using the list of values of type \verb|dyp_action|. To do that you state the character `\verb|@|' before the action code and you return the list
\begin{verbatim}
[Dyp.Global_data new_global_data]
\end{verbatim}
along with the returned value of your action if you want to change \verb|global_data|. Or the list
\begin{verbatim}
[Dyp.Local_data new_local_data]
\end{verbatim}
if you want to change \verb|local_data|. Or the list
\begin{verbatim}
[Dyp.Global_data new_global_data; Dyp.Local_data new_local_data]
\end{verbatim}
if you want to change both \verb|global_data| and \verb|local_data|. For more information about the type of the constructors \verb|Global_data| and \verb|Local_data| and the parser commands list see section \ref{dyp_action}.
%The initial values can also be accessed from outside the parser definition if you declare them with their type in the \verb|.mli| file (by using \verb|%mli|, see section \ref{mli}). If you change them before calling the parser, then these changes will be taken into account as new initial values for \verb|global_data| and \verb|local_data|.\\
The data accessible with \verb|dyp.global_data| follows the parsing during the reductions and the shifts. If at one point the parser follows different alternatives then it evolves independently for each alternative (but side-effects may cause unwanted interactions between them, therefore they should be avoided).\\
The same is true for \texttt{dyp.local\_data} except that when a reduction happens, it is `forgotten' and returned to its previous value. More precisely: when you return a new value for \verb|local_data| in an action which yields a non terminal \texttt{nt}, then this \verb|local_data| is not forgotten (unless you do it) in any action which follows until this non terminal \texttt{nt} is used in another reduction. When this happens, \verb|dyp.local_data| is forgotten and returns to its previous value just before the execution of the action code of this reduction.\\
Here is an example:
\begin{verbatim}
a: TOK_1 b TOK_2 c TOK_3 d EOF @{ action_8, [Local_data some_ld] }
b: e f @{ action_3, [Local_data local_data_3] }
e: A1 @{ action_1, [Local_data local_data_1] }
f: A2 @{ action_2, [Local_data local_data_2] }
c: g h @{ action_6, [Local_data local_data_6] }
g: B1 @{ action_4, [Local_data local_data_4] }
h: B2 @{ action_5, [Local_data local_data_5] }
d: C @{ action_7, [Local_data local_data_7] }
\end{verbatim}
We parse the string:
\begin{verbatim}
TOK_1 A1 A2 TOK_2 B1 B2 TOK_3 C EOF}
\end{verbatim}
\begin{itemize}
\item Assume that at the beginning of the parsing
\verb|dyp.local_data| has some initial value \verb|local_data_0|.
\item The first action to be performed is \verb|action_1|, it has access to \verb|local_data_0| and it returns \verb|local_data_1|,
\item then the next action is \verb|action_2|, it has access to \verb|local_data_1| and returns \verb|local_data_2|, although this is useless in this case because it is about to be forgotten.
\item Then the reduction of \verb|e f| to \verb|b| happens. \verb|dyp.local_data| comes back to its value before \verb|action_1| was performed, that is \verb|local_data_0|. \verb|action_3| is performed and it returns the value \verb|local_data_3| for \verb|local_data|.
\item The next action to be performed is \verb|action_4|, \verb|dyp.local_data| has the value \verb|local_data_3| and it returns \verb|local_data_4|,
\item then \verb|action_5| has access to this new value and returns \verb|local_data_5| but it is useless in this case.
\item The reduction of \verb|g h| to \verb|c| happens and the value of \verb|dyp.local_data| is again \verb|local_data_3|, the value it had just after \verb|action_3| was applied. \verb|action_6| returns the value \verb|local_data_6| for \verb|local_data|.
\item The next action is \verb|action_7| which has access to \verb|local_data_6|. It returns \verb|local_data_7| but it is useless since it is about to be forgotten.
\item The reduction with the first rule is performed, the value of \verb|dyp.local_data| comes back to \verb|local_data_0| and the last action \verb|action_8| is performed.
\end{itemize}
\subsection{Example with \texttt{local\_data}}
\texttt{dyp.local\_data} is useful, for instance, to build a symbol table, and makes possible to use it to disambiguate. Assume the following grammar and action code:
\begin{verbatim}
main: expr "\n" { $1 }
expr: ['0'-'9']+ { Int (int_of_string $1) }
| IDENT { if is_bound dyp.local_data $1 then Ident $1
else raise Giveup }
| "(" expr ")" { $2 }
| expr "+" expr { Plus ($1,$3) }
| "let" binding "in" expr { Let ($2,$4) }
binding: IDENT "=" expr
@{ Binding ($1,$3),
[Local_data (insert_binding dyp.local_data $1 $3)] }
\end{verbatim}
If we keep all the parse trees (see merge functions section \ref{merge}), then the following input string: \begin{verbatim}
let x = 1 in x+1
\end{verbatim}
yields the two following parse trees:
\begin{verbatim}
(let x = 1 in (x+1))
((let x = 1 in x)+1)
\end{verbatim}
But this input string:
\begin{verbatim}
let x = 1 in 1+x
\end{verbatim}
only yields one parse-tree:
\begin{verbatim}
(let x = 1 in (1+x))
\end{verbatim}
Moreover some input are detected as invalid because of unbound identifiers before the whole string has been parsed, like:
\begin{verbatim}
(let x = 1 in y+2) + ...
\end{verbatim}
This example is available in the directory \texttt{demos/local\_data}.
\subsection{Equality between data}
A function of equality between two global data named \verb|global_data_equal| can be defined otherwise it is by default the physical equality \texttt{(==)}, same thing for \verb|local_data| but the function is called \verb|local_data_equal|. These equality functions are used by the parser for optimization purpose. In some cases two distinct GLR parsings may share their \verb|global_data| and \verb|local_data| if both of them are deemed equal by \verb|global_data_equal| and \verb|local_data_equal| respectively. In this case one the two data is chosen and the other is discarded for each of the \verb|global_data| and \verb|local_data|.
\subsection{Example with \texttt{global\_data}}
Here is a very simple example of use of \verb|dyp.global_data|, it counts the number of reductions.
\verbatimfile{demos/global_data/global_data_parser.dyp}
\end{verbatim}
%\end{verbatim}
For instance we parse \texttt{5*7+3*4}, when \texttt{4} is reduced, \verb|global_data| is incremented from 4 to 5 (note that it will actually not count the real total number of reductions, but only the number of reductions made for the first interpretation of the input). This example is available in the directory \verb|demos/global_data|.\\
Here is a less basic example and where \verb|dyp.global_data| can be useful, suppose we have the following grammar:
\begin{verbatim}
expr:
| LIDENT
| INT
| FLOAT
| LPAREN expr COMMA expr RPAREN
| expr PLUS expr
| LPAREN expr RPAREN
| expr PLUSDOT expr
\end{verbatim}
We parse an input and the following string is a part of this input:
\begin{verbatim}
(x+.3.14,(x+1,(x+5, ... )))
\end{verbatim}
Where \texttt{...} stands for something long. Suppose we are typing the expressions in parallel with their parsing and we want to reject the previous string as early as possible. We do not want to wait for reducing the whole string to detect the type incompatibility of \texttt{x}. Then we can use \verb|dyp.global_data| for that purpose and when reducing \texttt{x+.3.14} we store in \verb|global_data| that \texttt{x} is of type float and then when we reduce \texttt{x+1} we have this information still stored in \verb|global_data| which is accessible from the action code. And we can detect the type incompatibility whithout having to parse more input.