-
Notifications
You must be signed in to change notification settings - Fork 0
/
Thesis.tex
3527 lines (2887 loc) · 277 KB
/
Thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% compile with lualatex and biber
% run makeglossaries on command line if glossary changed
\documentclass[
a4paper,
12pt,
twoside,
BCOR=12mm,
parskip=half,
chapterprefix,
numbers=noenddot,
bibliography=totoc
]{scrbook}
\input{config}
\makeatletter
\newcommand\listofillustrations{%
\begingroup
\setlength{\parskip}{0pt}%
\chapter*{List of Illustrations}%
\section*{Figures}%
\@starttoc{lof}%
\bigskip
\section*{Tables}%
\@starttoc{lot}%
\endgroup}
\makeatother
\begin{document}
\frontmatter
%\listoftodos
\input{titlepage}
\cleardoublepage
\setcounter{tocdepth}{1}
\tableofcontents
\listofillustrations
\renewcommand{\lstlistlistingname}{Code Listings}
\lstlistoflistings
\cleardoublepage
\thispagestyle{empty}
\section*{Abstract}
Search spaces of combinatorial optimization problems sometimes have a natural representation as derivation trees of formal grammars, such as in the fields of natural language processing and biological sequence analysis.
Amongst others, the algebraic dynamic programming method by Giegerich et al. was introduced to leverage this representation for automatically deriving efficient top-down and bottom-up dynamic programming algorithms. Its limitation to context-free languages restricts it from being used for certain problems though, for example, RNA pseudoknotted secondary structure prediction, or Viterbi parsing of natural languages with cross-serial dependencies.
This work redefines algebraic dynamic programming for the larger class of multiple context-free languages, as generated by multiple context-free grammars. A recursive formulation for problem instances is given from which top-down and bottom-up algorithms can be derived. Based on that, a prototype top-down implementation in Haskell as an embedded domain-specific language with support for one- and two-dimensional nonterminals is made available.
The usage of this redefined method is demonstrated by the RNA pseudoknotted secondary structure prediction problem.
\cleardoublepage
\thispagestyle{empty}
\section*{Acknowledgements}
I would like to thank Peter Stadler from the University of Leipzig for welcoming me to his group with open arms and introducing me to the world of bioinformatics. I also thank him for introducing me to Markus Nebel from the University of Kaiserslautern who pointed me to the book \emph{Parsing Beyond Context-Free Grammars} by Laura Kallmeyer, and to Christian Höner zu Siederdissen from the University of Vienna who reviewed the final version of this thesis and answered all my questions about bioinformatics and ADP implementations.
I also like to thank Robert Giegerich from the Bielefeld University for inviting me to his group and allowing me to present my work to a number of interested people.
I owe my deepest gratitude to my supervisor Johannes Waldmann for many insightful discussions and practical advice. I also thank him for pushing me to give a presentation at the \emph{10. Herbstseminar der Bioinformatik} where I met Sebastian Wild and Stefan Janssen who shared an interest in my work and with whom I had interesting discussions during hiking tours and evening beers.
Finally, I greatly thank my fellow student Michael Schmeißer for his very helpful and accurate reviews.
\mainmatter
\chapter{Introduction}
\linespread{1.02}\selectfont
One of the main tasks of a computer scientist is to define meaningful abstractions. A classic example is the development of programming language concepts. For example, depending on the language, there are different ways to transform elements of a sequence. In assembly languages, a sequence is a block of memory which is traversed and transformed by incrementing memory addresses, employing jump operations, and storing new values. In imperative languages, the concepts of loops and arrays abstract over the details of memory addressing for traversal. In languages supporting higher-order functions, a loop can be separated from its action -- the loop being the traversal of a sequence, and the action being the transformation of one element. Thus, higher-order functions allow to abstract over looping conditions.
Abstractions have the purpose of improving the efficiency of a programmer by avoiding duplication. Another purpose is to allow reasoning about higher level concepts -- to aid understanding and to reduce programming errors, but also to enable automatic program optimizations. A prominent example of the latter are compilers for domain-specific languages.
In this work, an abstraction for a class of dynamic programming algorithms is provided. Put differently, a formalism for specifying and solving certain combinatorial optimization problems is given.
Although not constrained to it, the driver of this work was the field of bioinformatics. Dynamic programming can be considered as the bread and butter in this field. Writing and implementing dynamic programming equations which relate smaller problems to bigger problems is often regarded as tedious and error-prone work. One particular source of programming errors originates from the vast amount of indices that are used for reading from and writing to dynamic programming tables and for checking looping and other conditions.
Many problems in bioinformatics have been identified as parsing problems \citep{searls_linguistic_1997}, and hence several grammar formalisms were used or newly introduced with the goal of reducing the amount of work necessary to produce error-free dynamic programming algorithms.
A common technique when only simple scoring schemes are needed is the use of \emph{weighted context-free grammars} in conjunction with \emph{semiring parsing} \citep{goodman_semiring_1999}. This approach was employed several times using a stochastic scoring for the problems of RNA alignment and secondary structure prediction \citep{brown_small_2000,eddy_rna_1994,sakakibara_recent_1994} -- although to the knowledge of the author no implementation exists which allows to use this abstraction in a generic way.
For more sophisticated scoring schemes, the two similar techniques \emph{S-attributed context-free grammars} \citep{lefebvre_optimized_1995} and \emph{algebraic dynamic programming} \citep{giegerich_discipline_2004} can be used. The abstractions that both techniques provide were made available in form of \glspl{DSL} (see glossary) -- for \gls{ADP} both within existing programming languages (Haskell and Scala) and externally via dedicated compilers, and for S-attributed context-free grammars only by the latter.
S-attributed context-free grammars and \gls{ADP} are restricted to optimization problems whose search spaces can be described by context-free grammars. Several problems in bioinformatics have been identified which fall outside this class, most prominently the problem of RNA pseudoknotted secondary structure prediction \citep{nebel_algebraic_2012}. For the latter problem, the search space includes elements equivalent to words like $\ts{a}^m \ts{b}^n \ts{c}^m \ts{d}^n$ -- violating the pumping lemma for context-free languages \citep[p. 279]{hopcroft_introduction_2001}. The general problem of RNA pseudoknotted secondary structure prediction is NP-complete. Therefore, its search space is often approximated to limit computational complexity.
As a logical consequence, some of the existing techniques were adapted so that search spaces can be used which require some form of context-sensitivity for their generation. In particular, \emph{weighted multiple context-free grammars}\footnote{To avoid premature confusions: The word ``multiple'' does not refer to multiple grammars but multiple dimensions inside a single grammar.} coupled with semiring parsing \citep{kato_rna_2006,kato_grammatical_2009}, and \emph{multi-tape S-attributed context-free grammars} \citep{lefebvre_grammar-based_1996} serve this purpose. While the former benefits from the higher generative power of \glspl{MCFG}, the latter allows more sophisticated scoring schemes. To the knowledge of the author, only the latter was generically implemented -- again, as an external \gls{DSL} with corresponding compiler.
\looseness+1 The class of languages generated by \glspl{MCFG} -- called \emph{\glspl{MCFL}} -- has been identified \citep{seki_multiple_1991,kato_generative_2005} as being a proper superset of several other classes that are larger than the class of context-free languages. Equal to \glspl{MCFL}, these other classes are a proper subset of context-sensitive languages, with the goal of being parseable in polynomial time. On the other side, several grammar formalisms were proven to generate a language class equivalent to \glspl{MCFL} \citep[cf.][]{kallmeyer_parsing_2010}. Although an even more powerful polynomial-time language class exists -- generated by \emph{range concatenation grammars} \citep{boullier_proposal_1998} and equal to PTIME -- there is no evidence yet, to the knowledge of the author, that this added expressivity is actually needed for practical problems in the world of dynamic programming, particularly in bioinformatics.
\glspl{MCFG} introduce the notion of nonterminal dimension to conveniently adjust the level of context-sensitivity that is required. An \gls{MCFG} with a maximal nonterminal dimension $k$ is called a $k$-\gls{MCFG}. In bioinformatics, \glspl{MCFG} have so far been used, to the knowledge of the author, exclusively for describing the search spaces of RNA pseudoknotted secondary structure \citep[e.g.][]{kato_rna_2006,reidys_topology_2011} and RNA-RNA interaction structure problems \citep[e.g.][]{kato_grammatical_2009,andersen_topology_2012}, where the latter problem is closely related to the former. In all cases, 2-\glspl{MCFG} have been used.
This suggests that grammars generating the class of \glspl{MCFL} provide the right amount of context-sensitivity for similar optimization problems in this field.
At the time of writing, there is a lack of generic implementations to solve combinatorial optimization problems with search spaces generated by \glspl{MCFG} or equivalent formalisms. The consequence is that custom implementations are written which are based on the dynamic programming equations of the given problem \citep[e.g.][]{reidys_topology_2011}. Alternatively, existing solutions such as ADP are extended ad-hoc by allowing low-level access to the dynamic programming tables \citep[e.g.][]{reeder_design_2004}. Both approaches hinder reusability and reintroduce the main source of programming errors that was eliminated by using grammars in the first place -- indices.
In this work, \gls{ADP} is redefined for \glspl{MCFL} to support search spaces that can be described by \glspl{MCFG}.
Together with this new type of abstraction, a prototype implementation supporting 2-MCFG search spaces and realized as a domain-specific language within Haskell is described and made available\footnote{\url{http://hackage.haskell.org/package/adp-multi}}. The RNA pseudoknotted secondary structure prediction problem is used to demonstrate the approach.
\looseness+1 Apart from the main contribution, also a more rigorous mathematical definition of \gls{ADP} is given. In particular -- compared to its original definition -- standard concepts from the field of universal algebra are consistently used, and notations in definitions that were borrowed from programming languages removed.
\linespread{1}\selectfont
\pagebreak
\section{Problem Statement}
In this work, two main problems shall be solved. The first problem is to redefine ADP for \glspl{MCFL}, such that search spaces generated by \glspl{MCFG} can be used. This redefinition is further named ADP-MCFL. Part of the redefinition is to also provide (a) a recursive formulation of solutions of ADP-MCFL problem instances showing that dynamic programming can be applied, and (b) a theorem for determining the asymptotic time complexity for a dynamic programming algorithm solving ADP-MCFL problem instances.
An additional goal is that it should be as easy as possible to translate \glspl{MCFG} to the ADP-MCFL formalism -- due to their ongoing use in bioinformatics.
The second problem is to develop an easy-to-use ADP-MCFL prototype supporting 2-MCFG search spaces.
The more similar the syntax for describing problems using the prototype is to the corresponding formal notation, the more easy-to-use it shall be considered.
Apart from being a tool for solving optimization problems, it should also serve the purpose of being an educational resource for future ADP-MCFL implementers. For this, it should be modular, small, and easily understandable. Decreasing the constant runtime factor and reducing the space usage are problems not to be solved in this work.
\section{Related Work}
\label{sec:relwork}
In the general setting of dynamic programming, several formalisms and frameworks have been developed which ease writing dynamic programming algorithms but which do not prevent the use of matrix indices. An overview of those is given in \citet{sauthoff_bellmans_2011-1}.
In this section, the focus instead lies on related work that eliminates indices completely and also allows search spaces which require some form of context-sensitivity for their generation.
\subsection*{Multi-Tape S-Attributed Grammars}
Multi-tape S-attributed (context-free) grammars rely on multi-tape context-free grammars, both introduced in \citet{lefebvre_grammar-based_1996}. While the latter clearly generate a proper superset of context-free and a proper subset of context-sensitive languages, there exists no work which shows the relation to other similar language classes. To the knowledge of the author, neither of both formalisms has been used in other studies for modelling optimization problems.
Nevertheless, attributed (context-free) grammars are widely known and -- at least academically -- used in compiler construction \citep{dijkstra_architecture_2009}. They are defined with evaluation in mind and can be used within compilers to derive additional values (attributes) for nodes of an abstract syntax tree. It is common in this field that for every input only a single derivation tree exists, that is, the grammars are unambiguous.
In the work of \citet{lefebvre_optimized_1995} for S-attributed context-free grammars (and then later adapted for the multi-tape variant), the grammars -- which for optimization problems are ambiguous -- are augmented with a constraint that defines the optimization objective. This augmentation is not part of the formalism itself but rather exists in the supplied implementation. Compared to \gls{ADP}, only little attention was paid to the effects that optimization objectives have and how they can be leveraged to generate standard top-down or bottom-up dynamic programming algorithms. In \citet{lefebvre_optimized_1995,lefebvre_grammar-based_1996} an \emph{agenda-based parsing algorithm} (see glossary) was implemented which showed additional overhead compared to hand-written dynamic programming code.
It should be remarked that the work in this thesis could very likely have been realized using a variant of S-attributed grammars using \glspl{MCFG} as basis. Nevertheless, \gls{ADP} was chosen due to its primary focus on dynamic programming, and its continuous research and usage in bioinformatics for over a decade.
\subsection*{Dyna}
Parsing algorithms can be described in compact form as inference rules \citep{pereira_parsing_1983}. Dyna \citep{eisner_compiling_2005} is a \emph{weighted logic programming language} (see glossary) that allows to specify semiring parsing algorithms using inference rules and then generates agenda-based parsers in C++ from it.
On the one hand, the generality of Dyna allows to implement virtually any parsing algorithm and is therefore an excellent tool for experimentation, particularly in the field of natural language processing from where it originated. On the other hand, generality is paid with the price of an agenda-based parser. For example, a stochastic CYK parser can be manually implemented as a standard dynamic programming algorithm with very little runtime overhead, yet in Dyna an agenda-based parser with an integrated garbage collector is generated which increases the constant runtime factor.
At the time of writing, only the first generation of Dyna is available, with the second one currently in development \citep{eisner_dyna:_2011,filardo_flexible_2012}. The former was released 2005 and only supports semirings as scoring schemes, while the latter lifts this restriction. As arbitrary scoring schemes should be supported in this work, Dyna was not chosen at this time.
Of course, the choice of not using Dyna is unrelated to the possibility of specifying the top-down parser developed in this work using inference rules -- if only for the sake of completeness. After some initial attempts did not lead to satisfactory results, this way was not followed further. Instead, a higher-level description using recursions is given from which top-down and bottom-up parsing algorithms can be derived.
\subsection*{ADP for Sequences and Trees}
At the time of writing, a yet unpublished but submitted work by \citet{giegerich_modeling_2013} describes a reformulation of \gls{ADP} for optimization problems over sequences and trees. It makes use of a restricted form of term rewriting systems and supports multiple input dimensions. The work focuses on providing a new modelling formalism -- leaving the questions of formal expressivity and implementation open.
At this point, only little can be said about how this approach compares with the work developed here. Some initial attempts at transferring problems from the formalism of \citeauthor{giegerich_modeling_2013} to the one of this work succeeded though. In particular, inputs which are trees could be encoded as terms and further to character sequences which made them usable for standard parsing. The other way around -- transferring problems to the formalism of \citeauthor{giegerich_modeling_2013} -- was not pursued and is left for future work.
\section{Chapter Organization}
The remaining part of this work is organized as follows. \Cref{ch:notion_notation} introduces common mathematical notions and notations, most importantly the concept of multisets. After that, \cref{ch:running_ex} describes a set of problems from the field of bioinformatics which are used throughout this work: RNA secondary structure prediction. In \cref{ch:adp-cfl} the foundation for the main contribution of this work is laid out. The existing formalism of \gls{ADP} is described in detail while also introducing related or required concepts like context-free and regular tree grammars. \Cref{ch:adp-mcfl,ch:adp-multi} present the main contribution of this work by introducing the formalism of \gls{ADP} for \glspl{MCFL} and a corresponding prototypical implementation in Haskell named \emph{adp-multi}. Finally, \cref{ch:concl} gives a summary and provides directions for possible future work.
\chapter{Notions and Notations}
\label{ch:notion_notation}
This chapter describes notions and notations that are used throughout this work. Notations which are used only once or which are very specific are explained in the relevant sections of this work.
\section{Sets}
A set is a collection of distinct objects. Objects in a set are called \emph{elements}. If an object $o$ is an element of a set $S$, one writes $o \in S$. Finite sets with elements $o_1,\ldots,o_n$ are written as $\{o_1,\ldots,o_n\}$. The set without elements is called the \emph{empty set} and is denoted $\{\}$ or $\emptyset$. The number of elements in a set is denoted as $|S|$ and is called its \emph{cardinality}. The \emph{power set} of a set $S$ is denoted as $\mathbb{P}(S)$ and describes the set of all subsets of $S$, including $S$ and the empty set. As an example, $\mathbb{P}(\{1,2\})$ is equal to $\{\{1,2\},\{1\},\{2\},\emptyset\}$.
The set of natural numbers is denoted as $\mathbb{N}$ and includes 0. The set of real numbers is denoted as $\mathbb{R}$.
A convenient way to describe a set with certain properties is by using \emph{set-builder notation}. It has the form $\{x \mid P(x)\}$ where $P(x)$ is a predicate that every element must satisfy. As an example, $\{x \mid x \in \mathbb{N} \wedge 5 < x < 10\}$ describes the set $\{6,7,8,9\}$. In the example, $\mathbb{N}$ is the \emph{universe of discourse} over which $x$ ranges and can in general be any set. It can (a) be omitted if it is clear from the context, and (b) included in the left part as in $\{x \in \mathbb{N} \mid 5 < x < 10\}$. The latter notation is read as ``the set containing those natural numbers which are greater than 5 and less than 10''. If a function should be applied to each element, this function can be included in the left part: instead of writing $\{y \mid \exists x.(5 < x < 10 \wedge y = \op{f}(x))\}$, one can write $\{\op{f}(x) \mid 5 < x < 10\}$.
\section{Tuples}
An $n$-tuple is a sequence of $n$ elements with $n \geq 0$. Parentheses are used to denote tuples, for example $(5,2,4.6)$. Two tuples are equal if they are of the same size and both elements at each position are equal.
Each element of a tuple can be regarded as part of a set. For example, the first and second element of $(5,2,4.6)$ are part of the set of natural numbers, and the last of the set of real numbers. A tuple itself is then part of a set formed from the cross product of those element sets. For example, $(5,2,4.6) \in \mathbb{N} \times \mathbb{N} \times \mathbb{R}$.
\section{Functions}
Functions are used in the usual way. A function symbol $f$ together with its domain $A$ and codomain $B$ is denoted $f: A \to B$. Function arguments are modelled as tuples.
A function $f: A \to B$ is a constant function if $f(x) = f(y)$ for all $x,y \in A$. In this work, constants are modelled as constant functions with the domain $\{()\}$ and are applied with $f()$. Instead of $f() = b$, constants can also be defined as $f = b$. Similarly, if used in the context of function application, constants can be referred to as $f$ instead of $f()$.
\section{Multisets}
A multiset \citep{blizard_multiset_1988} is a collection of objects. It generalizes the concept of sets by allowing elements to appear more than once.
A multiset over a set $S$ can be formally defined as a function from $S$ to $\mathbb{N}$.
A finite multiset $f$ is such function with only finitely many $x$ such that $f(x) > 0$.
The notation $[e_1,\ldots,e_n]$ is used to distinguish multisets from sets. The empty multiset is denoted $[]$. The number of times an element occurs in a multiset is called its \emph{multiplicity}. If $e$ is an element of a multiset $f$ with multiplicity $n$, one writes $e \in^n f$. If $n$ is greater than zero, one writes $e \in f$. The cardinality $|f|$ of a multiset is the sum of all multiplicities. The union $f \cup g$ of two multisets is the multiset $(f \cup g)(x)=\max \{f(x),g(x)\}$. The additive union $f \uplus g$ of two multisets is the multiset $(f \uplus g)(x)=f(x)+g(x)$.
The \emph{multiset powerset} $\mathbb{P}_m(S)$ of a set $S$ is the set of all multisets on $S$, that is, $\mathbb{P}_m(S) = \{ f \mid f : S \to \mathbb{N} \}$, $\mathbb{P}_m(\emptyset)=\emptyset$.
For example, $\mathbb{P}_m(\{1,2\})=\{ [],[1],[1,1],\ldots,[2],\allowbreak[2,2],\ldots,[1,2],[1,1,2],\ldots,[1,2,2],\ldots\}$. \citep[Def. 75]{kotowski_constructive_2011}
As for sets, a builder notation for multisets is introduced. The first notation of two, here called the \emph{explicit} notation, has the form $[x_{[n]} \mid P(x,n)]$ where $n$ is the multiplicity of an element and $P(x,n)$ is a predicate that every element must satisfy. In some cases, it might be too inconvenient to specify the multiplicities explicitly, in particular when elements from several multisets should be combined into a cross-product with additional conditions. To circumvent this, a second notation, here called \emph{implicit} notation, is introduced and has the form $[x \mid \exists y.P(x,y)]$ where the multiplicity of $x$ is either (a) $|\{y \mid P(x,y)\}|$ if the domain of $y$ is not a multiset, or (b) $|[y \mid P(x,y)]|$ otherwise. In (b), for an arbitrary multiset $f$ the rule $[y \mid y \in f \wedge P(y)]=[y_{[n]} \mid y \in^n f \wedge P(y)]$ is used. As in set-builder notation, a function to be applied to each element can be directly inserted into the left part.
\begin{example}[Multiset-builder notation]
\begin{equation*}
M_1 = \{1,2,3\}, M_2 = [1,3,3], M_3 = [2,2]
\end{equation*}
\begin{align*}
&[x_{[y]} \mid x \in M_1 \wedge y=2x] &&= [1,1,2,2,2,2,3,3,3,3,3,3]\\
&[x_{[2n]} \mid x \in^n M_3] &&= [2,2,2,2]\\
&[x \bmod 2 \mid x \in M_1] &&= [0,1,1]\\
&[(x,y) \mid x \in M_2, y \in M_3] &&= [(1,2),(1,2),(3,2),(3,2),(3,2),(3,2)]
\end{align*}
\end{example}
\section{Conditional Expressions}
An expression whose value depends on some conditions is written in the usual mathematical notation
\[
f(x)= \begin{cases}
x &\text{if $x \geq 0$,}\\
-x &\text{else.}
\end{cases}
\]
For reasons of space-efficiency, sometimes an alternative notation is used which is borrowed from Haskell's guard notation:
\begin{alignat*}{2}
f&(x) \\
&\mid x \geq 0 &&\quad= x\\
&\mid \text{else} &&\quad= -x
\end{alignat*}
The semantics of both notations are the same.
\chapter{Running Example: RNA Secondary Structure Prediction}
\label{ch:running_ex}
One of the problems solved in \gls{bioinformatics} is the prediction of secondary and tertiary \gls{RNA} structures. This chapter introduces the problem of predicting \gls{RNA} secondary structures which serves as the running example throughout this work. Before motivating this specific topic, a brief description of \gls{RNA} and its three levels of structural abstraction are given, following \citet{meyers_rna_2000}. After that, the secondary structure prediction problem is defined as a mathematical optimization problem.
\section{Introduction and Motivation}
\label{sec:rnaintro}
\Gls{RNA} is a family of molecules that have various biological functions in organisms such as the human. The function of an \gls{RNA} molecule, e.g. translating \gls{DNA} to proteins (see glossary) or performing chemical reactions, depends on its (three-dimensional) structure -- which is why this aspect is of such importance to researchers.
The building blocks of \gls{RNA} and \gls{DNA} are nucleotides, that is, molecules consisting of a sugar, a phosphate group, and one of the four RNA or DNA nucleobases (or short \emph{bases}) adenine (A), guanine (G), cytosine (C), for RNA uracil (U) and for DNA thymine (T). An \gls{RNA} (or \gls{DNA}) molecule then is a sequence of nucleotides. The sugar and phosphate parts of all nucleotides together are called the \emph{backbone} of a sequence. As the nucleotides have a certain chemical layout, the two ends of \gls{RNA} (and \gls{DNA}) sequences can be distinguished and are denoted as $3'$ and $5'$, respectively. By convention, single strands of sequences are written in $5' \to 3'$ direction. In the following, three different abstractions of the structure of \gls{RNA} molecules are briefly described: the primary, secondary, and tertiary structure.
\begin{figure}
\centering
\includegraphics[width=1.00\textwidth]{imgs/dna_transcription.pdf}
\caption[Transcription of DNA to RNA]{Transcription of DNA to RNA. The direction $3' \to 5'$ (or $5' \leftarrow 3'$, respectively) denotes the possible transcription direction on the \gls{DNA} strand. (adapted from \url{http://commons.wikimedia.org/wiki/File:DNA_transcription.svg}, public domain)}
\label{fig:transcription}
\end{figure}
\gls{RNA} is produced with the help of \gls{DNA} using an enzyme (see glossary) called \emph{RNA polymerase}, illustrated in \cref{fig:transcription}. In this process, called \emph{transcription}, the \gls{RNA} is produced by forming the backbone bonds between newly created nucleotides. Their identity is determined by the sequence of nucleotides on the \gls{DNA}. Hence, the \gls{RNA} sequence is a faithful ``transcription'' of the \gls{DNA} sequence. More precisely, one of the two strands of the \gls{DNA} double strand is transcribed in direction $3' \to 5'$ such that the \gls{RNA} nucleotides are complements to the \gls{DNA} nucleotides: A $\to$ U, G $\to$ C, C $\to$ G, T $\to$ A. As seen in \cref{fig:transcription}, the transcription process produces reversed nucleotides, which means the $5'$ and $3'$ ends are opposite to those of the \gls{DNA} strand. After finishing the transcription process, the first level of organization can be described as the sequence of bases -- the primary structure. \Cref{fig:primary_struc} shows an example of such structure.
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure]
\begin{scope}[dot bracket]
\node {C};
\node {A};
\node {A};
\node {U};
\node {U};
\node {U};
\node {U};
\node {C};
\node {U};
\node {G};
\node {A};
\node {A};
\node {A};
\node {A};
\node {U};
\node {U};
\node {U};
\node {U};
\node {C};
\node {A};
\node {C};
\end{scope}
\end{tikzpicture}
\caption[Example of a primary structure]{Primary structure of a part of the tobacco etch virus (see \url{http://www.ekevanbatenburg.nl/PKBASE/PKB00279.HTML}).}
\label{fig:primary_struc}
\end{figure}
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure]
\begin{scope}[natural]
\node {C};
\node[yshift=10pt,xshift=5pt] {A};
\node[yshift=22pt,xshift=-5pt] {A};
\node[yshift=23pt,xshift=-3pt] {U};
\node[yshift=20pt,xshift=-1pt] {U};
\node[yshift=16pt,xshift=0pt] {U};
\node[yshift=10pt,xshift=5pt] {U};
\node[yshift=-5pt,xshift=5pt] {C};
\node[yshift=-15pt,xshift=0pt] {U};
\node[continue chain=going below] {G};
\node[yshift=-10pt,continue chain=going left] {A};
\node[xshift=-5pt,join=with chain-6 by {basepair edge}] {A};
\node[yshift=-10pt,join=with chain-5 by {basepair edge}] {A};
\node[yshift=-20pt,xshift=5pt,join=with chain-4 by {basepair edge}] {A};
\node[continue chain=going below,join=with chain-3 by {basepair edge}] {U};
\node[yshift=-15pt,continue chain=going right,join=with chain-2 by {basepair edge}] {U};
\node[xshift=5pt] {U};
\node[yshift=10pt,xshift=5pt,join=with chain-11 by {basepair edge}] {U};
\node[yshift=10pt,xshift=5pt,join=with chain-10 by {basepair edge}] {C};
\node[yshift=6pt,xshift=5pt,join=with chain-9 by {basepair edge}] {A};
\node[yshift=-6pt,xshift=5pt] {C};
\end{scope}
\end{tikzpicture}
\caption[Example of a spring-model-style graph representing a secondary structure]{Spring-model-style graph representing the secondary structure belonging to \cref{fig:primary_struc}. Thin edges mark the backbone, while thick edges represent base pairs.}
\label{fig:classical_sec_struc}
\end{figure}
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure]
\begin{scope}[linear]
\node {C};
\node {A};
\node {A};
\node {U};
\node {U};
\node {U};
\node {U};
\node {C};
\node {U};
\node {G};
\node {A};
\node[join=with chain-6 by {basepair edge}] {A};
\node[join=with chain-5 by {basepair edge}] {A};
\node[join=with chain-4 by {basepair edge}] {A};
\node[join=with chain-3 by {basepair edge}] {U};
\node[join=with chain-2 by {basepair edge}] {U};
\node {U};
\node[join=with chain-11 by {basepair edge}] {U};
\node[join=with chain-10 by {basepair edge}] {C};
\node[join=with chain-9 by {basepair edge}] {A};
\node {C};
\end{scope}
\end{tikzpicture}
\caption[Example of a linearized graph representing a secondary structure]{(Backbone-)Linearized graph representing the secondary structure belonging to \cref{fig:primary_struc}. As in \cref{fig:classical_sec_struc}, thin edges mark the backbone, while thick edges represent base pairs.}
\label{fig:linear_sec_struc}
\end{figure}
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure]
\begin{scope}[dot bracket primary]
\node {C};
\node {A};
\node {A};
\node {U};
\node {U};
\node {U};
\node {U};
\node {C};
\node {U};
\node {G};
\node {A};
\node {A};
\node {A};
\node {A};
\node {U};
\node {U};
\node {U};
\node {U};
\node {C};
\node {A};
\node {C};
\end{scope}
\begin{scope}[dot bracket,yshift=-0.5cm]
\node {.};
\node {(};
\node {(};
\node {(};
\node {(};
\node {(};
\node {.};
\node {.};
\node {[};
\node {[};
\node {[};
\node {)};
\node {)};
\node {)};
\node {)};
\node {)};
\node {.};
\node {]};
\node {]};
\node {]};
\node {.};
\end{scope}
\end{tikzpicture}
\caption[Example of the dot-bracket notation for a secondary structure]{Dot-bracket notation derived from \cref{fig:linear_sec_struc}. Dots are unpaired bases, while matching brackets are base pairs.}
\label{fig:dot_bracket_sec_struc}
\end{figure}
In the salty cytoplasm of a cell -- or in vitro in salty water -- certain bases of the \gls{RNA} chain move together to form hydrogen bonds between each other. A bond is called a \emph{base pair} and can occur between A and U, G and C, or G and U. The second level of organization is therefore the set of base pairs that have formed -- the secondary structure. \Cref{fig:classical_sec_struc,fig:linear_sec_struc,fig:dot_bracket_sec_struc} show common ways for visualizing secondary structures. One distinctive property of a secondary structure is whether it is \emph{pseudoknotted}. A secondary structure is pseudoknotted if there are base pair edges that cross each other when displayed as a linearized graph. \Cref{fig:linear_sec_struc} shows such structure. The majority of RNA molecules are not pseudoknotted, yet a growing number of pseudoknotted structures have been found which are crucial for biological function \citep[cf.][]{reidys_topology_2011}.
Once the secondary structure has formed and some other conditions like a certain temperature \citep[cf.][]{ignacio_how_1999} are satisfied, the \gls{RNA} molecule transitions to a three-dimensional structure\footnote{In cells, the formation of a three-dimensional structure happens together with forming the secondary structure, as the conditions for both processes are always satisfied.}. Therefore, the third level of organization is the set of all atomic positions -- the tertiary structure.
It should be noted that the secondary and tertiary structures are formed according to chemical thermodynamics, meaning that each possible structure has a certain \emph{free energy} and that the structure formation process converges to those structures that have the \emph{lowest} free energy -- being the energetically \emph{most stable} structures \citep{ignacio_how_1999}.
This section is concluded by shortly describing two related areas of research where secondary structure prediction plays an important role.
As already mentioned, the function of an \gls{RNA} molecule depends on its tertiary structure. A common task is finding recurrent motifs in it -- as identical motifs could lead to identical functions. There are various physical methods to determine the tertiary structure, for example, X-ray crystallography or nuclear magnetic resonance spectroscopy. As the usage of such methods can be expensive and time-consuming, theoretical models have been developed that approximate tertiary structures, often with the help of secondary structures -- called \emph{hierarchical folding}. These models resulted in tools that have been used to describe certain motifs long before they had been validated by physical methods. See \citet[sect. 5.3]{meyers_rna_2000} for some examples.
A relatively new field of research lies in \emph{designing} \gls{RNA} molecules which have certain functions (e.g. cleaving \gls{RNA} virus particles to prevent infections). The problem here is finding an RNA primary structure that folds into a given tertiary (or as approximation secondary) structure. This problem is called \emph{inverse RNA folding} and is solved by enumerating all base pair matching primary structures and choosing the one which folds into a tertiary (secondary) structure with least free energy. For the latter part, existing tertiary (secondary) structure prediction algorithms can be employed. See \citet{andronescu_new_2004} for an introduction.
\section{Formalization as an Optimization Problem}
\label{sec:rna_optimization_problem}
Predicting secondary structures can be formalized as a combinatorial optimization problem. Solving it is enumerating all secondary structures from a search space, rating each structure using a certain free energy model, and returning those which have the lowest free energy.
Although eventually a single problem, secondary structure prediction is a family of optimization problems -- as many models exist that define the search space and scoring functions in certain ways with the goal of achieving better approximations or limiting computational complexity. The following overview provides a textual description of this family.
\begin{description}
\item[Search space] All secondary structures which satisfy certain constraints:
\begin{itemize}
\item Match the given primary structure
\item Allow only G-C, A-U, G-U base pairs
\end{itemize}
\end{description}
Additional constraints further restricting the search space size are often used to lower the computational effort -- while trying to still consider most of the structures occurring in nature \citep{reidys_topology_2011}. Pseudoknotted structures play an important role here. As many \gls{RNA} molecules are in fact not pseudoknotted, the search space is often restricted to pseudoknot-free structures. Considering all pseudoknotted structures and using nontrivial\footnote{A trivial scoring function is the number of base pairs -- equivalent to the \emph{maximum weighted matching} graph problem which is solvable in polynomial time. Nontrivial scoring functions take the context of base pairs into account, e.g. stacks of base pairs are scored higher.} scoring functions leads to optimization problems that are NP-complete \citep{lyngso_rna_2000}. In this work, pseudoknot-free but also a restricted class of pseudoknotted structures, here named $C2u$ structures, are considered. It should be mentioned that, by convention, classes of some pseudoknotted structures always include all pseudoknot-free structures. The restricted class of pseudoknotted structures that is used in this work is defined in the following. Beginning with the class of unrestricted structures, a series of subsequent restrictions eventually leads to the class of $C2u$ structures.
Let $A = a_1 a_2 \ldots a_n$ be an RNA primary structure. Then, all base pairs that can theoretically form according to the base pairing rules are expressed as a set of pairs of indices:
\begin{equation}
P = \{ (i,j) \mid 1 \leq i < j \leq n \wedge \text{$(a_i,a_j)$ is a base pair} \}.
\end{equation}
Given a set $s \in \mathbb{P}(P)$ of pairs of indices, then $\op{degree}(s,i)$ denotes how often the index $i$ is occurring in $s$.
The class of unrestricted structures contains all pseudoknotted and pseudoknot-free structures and is defined as the set of all combinations of base pairs such that each base takes part in at most one base pairing:
\begin{equation}
\label{eq:class_u}
S_U = \{ s \in \mathbb{P}(P) \mid \forall (i,j) \in s. \op{degree}(s,i)=\op{degree}(s,j)=1 \}.
\end{equation}
In order to simplify further definitions, the concept of a \emph{shadow} is borrowed from \citet{reidys_topology_2011}. The shadow of any secondary structure displayed as linearized graph is obtained by removing all unpaired bases and noncrossing base pairs, and collapsing all uninterrupted stacks of base pairs into a single base pair. In the following, the function $\op{shadow} : \mathbb{P}(\mathbb{N}^2) \to \mathbb{P}(\mathbb{N}^2)$ is used as the operation transforming a secondary structure into its shadow. See \cref{fig:shadow} for an example. Using shadows, the class of pseudoknot-free structures can be defined as
\begin{equation}
S_{PKF} = \{ s \in S_U \mid \op{shadow}(s) = \emptyset \}.
\end{equation}
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure,transform shape,scale=0.55]
\begin{scope}[linear]
\node {};
\node {};
\node {};
\node {};
\node[join=with chain-3 by {basepair edge}] {};
\node[join=with chain-4 by {basepair edge}] {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node[join=with chain-7 by {basepair edge}] {};
\node[join=with chain-2 by {basepair edge}] {};
\node[join=with chain-1 by {basepair edge}] {};
\node {};
\node {};
\node[join=with chain-19 by {basepair edge}] {};
\node[join=with chain-18 by {basepair edge}] {};
\node {};
\node {};
\node[join=with chain-13 by {basepair edge}] {};
\node[join=with chain-12 by {basepair edge}] {};
\node[join=with chain-11 by {basepair edge}] {};
\end{scope}
\begin{scope}[linear,xshift=20cm]
\node {};
\node {};
\node {};
\node[join=with chain-2 by {basepair edge}] {};
\node[join=with chain-3 by {basepair edge}] {};
\node {};
\node {};
\node[join=with chain-6 by {basepair edge}] {};
\node[join=with chain-1 by {basepair edge}] {};
\node[join=with chain-7 by {basepair edge}] {};
\end{scope}
\end{tikzpicture}
\caption[Example of a secondary structure and its shadow]{Example of a secondary structure (left) and its shadow (right).}
\label{fig:shadow}
\end{figure}
Before introducing the next class of secondary structures, the notion of \emph{connected components} is introduced \citep{reidys_topology_2011}. A set $C$ of base pairs is called a connected component if for any two base pairs $c,c' \in C$ there is a sequence of base pairs $c c_1 \ldots c_n c'$ with $c_1,\ldots,c_n \in C$ and $n \geq 0$ such that consecutive base pairs cross each other when displayed as linearized graph. Every base pair in a secondary structure or a shadow belongs to exactly one such component. Applied to the secondary structure in \cref{fig:shadow}, there are four such components (remember that each uncrossed base pair counts as a component), while its shadow only contains two components. Given a secondary structure or shadow, the function $\op{connectedComponents} : \mathbb{P}(\mathbb{N}^2) \to \mathbb{P}(\mathbb{P}(\mathbb{N}^2))$ returns those components.
The next class of secondary structures -- here named $C2$ -- is now introduced. $C2$ structures are those where the shadows of all connected components are only allowed to be the most simplest ones -- those having at most two base pairs:
\begin{equation}
S_{C2} = \{ s \in S_U \mid \forall c \in \op{connectedComponents}(s).|\op{shadow}(c)| \leq 2 \}.
\end{equation}
The secondary structure in \cref{fig:shadow} is an example of a $C2$ structure. The connected components of $C2$ structures are either single uncrossed base pairs or two groups of base pairs crossing each other. The functions $\op{leftGroup}$ and $\op{rightGroup}$ with signature $\mathbb{P}(\mathbb{N}^2) \to \mathbb{P}(\mathbb{N}^2)$ access the base pairs of those two groups of the latter type of components:
\begin{align}
\op{leftGroup}(s) = \{ (i,j) \in s \mid \exists (k,l) \in s. i<k<j<l \}\\
\op{rightGroup}(s) = \{ (k,l) \in s \mid \exists (i,j) \in s. i<k<j<l \}
\end{align}
Finally, the class $C2u$ is defined whose structures only contain connected components that are either single uncrossed base pairs or two groups of uninterrupted stacks of base pairs:
\begin{equation}
S_{C2u} = \left\{ s \in S_{C2} \;\middle\vert\;
\begin{aligned}
&\forall c \in \op{connectedComponents}(s).\\
& |c| \geq 2 \to\\
&\forall g \in \{ \op{leftGroup}(c),\op{rightGroup}(c) \}.\\
& \max_{(i,j) \in g} i - \min_{(i,j) \in g} i
= |g|-1 =
\max_{(i,j) \in g} j - \min_{(i,j) \in g} j
\end{aligned}
\right\}
\end{equation}
\begin{figure}
\centering
\begin{tikzpicture}[rna secondary structure,transform shape,scale=0.55]
\begin{scope}[linear]
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node[join=with chain-6 by {basepair edge}] {};
\node[join=with chain-7 by {basepair edge}] {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node {};
\node[join=with chain-14 by {basepair edge}] {};
\node[join=with chain-16 by {basepair edge}] {};
\node[join=with chain-15 by {basepair edge}] {};
\node[join=with chain-4 by {basepair edge}] {};
\node[join=with chain-3 by {basepair edge}] {};
\node[join=with chain-2 by {basepair edge}] {};
\node[join=with chain-1 by {basepair edge}] {};
\node {};
\node {};
\node {};
\node {};
\node[join=with chain-25 by {basepair edge}] {};
\node {};
\node[join=with chain-24 by {basepair edge}] {};
\node[join=with chain-13 by {basepair edge}] {};
\node[join=with chain-12 by {basepair edge}] {};
\node[join=with chain-11 by {basepair edge}] {};
\end{scope}
\end{tikzpicture}
\caption[Example of a $C2u$ secondary structure]{Example of a $C2u$ secondary structure.}
\label{fig:c2u}
\end{figure}
\Cref{fig:c2u} as well as the earlier \cref{fig:linear_sec_struc} represent members of this class. The class of $C2u$ structures is a superset of the class of \emph{canonized simple recursive pseudoknot} structures from \citet{reeder_design_2004}\footnote{In fact, $C2u$ structures equal \emph{simple recursive pseudoknot} structures with only the first canonization rule from \citet{reeder_design_2004} applied.}.
\begin{description}
\item[Scoring function] Each candidate of the search space is scored, e.g. by:
\begin{itemize}
\item Thermodynamic models \citep{zuker_optimal_1981,mathews_expanded_1999}
\item Number of base pairs \citep{nussinov_algorithms_1978}
\end{itemize}
\end{description}
The scoring function is the heart of the optimization problem and determines the accuracy of the solution. Free energy is modelled as a sum of contributions for small features in a structure. Giving weight to base pairs was the first and simplest model developed \citep{nussinov_algorithms_1978}, using the approximation that free energy gets lower when the number of base pairs gets higher. Newer more realistic models, called thermodynamic models, take certain motifs into account and actually calculate an approximation of the real free energy. Throughout this work, the number of base pairs is used as scoring function. Although not used any more for biological research, this method is easily understandable and known by anyone developing secondary structure prediction algorithms.
\pagebreak
\begin{description}
\item[Solution] Finding those candidates whose score is minimal/maximal, e.g.:
\begin{itemize}
\item Minimization if scoring function returns free energy
\item Maximization if scoring function returns number of base pairs
\end{itemize}
\end{description}
The solution of a secondary structure prediction problem is not always as simple as above. Due to the approximations used, biologists need to analyse suboptimal candidates too. Having a rated and ordered list of candidates, the solution might consist of the first ten candidates. In this work, only the simple types of solutions are considered -- as the extension to complex ones is not specific to the new approach of this thesis and directly carries over from existing work by \citet{sauthoff_bellmans_2011-1}.
\chapter{\texorpdfstring{Algebraic Dynamic Programming for Context-Free Languages}{ADP-CFL}}
\label{ch:adp-cfl}
\Gls{ADP} was introduced by \citet{giegerich_discipline_2004} and described as a discipline of dynamic programming over sequence data. It uses grammars and algebras as higher level abstractions to circumvent writing dynamic programming equations and corresponding algorithms while providing transformation rules to automatically derive both artefacts again. Working only on sequences, it can also be described as a way of (context-free) parsing over algebras (cf. semiring parsing by \citet{goodman_semiring_1999}).
\Gls{ADP} has been used primarily in the world of \gls{bioinformatics} to simplify writing software like \gls{RNA} secondary structure prediction tools \citep{reeder_design_2004,giegerich_abstract_2004,theis_prediction_2010}. Several software libraries exist that implement \gls{ADP} and thus provide a way to write grammars and algebras in source code. Depending on the implementation, the grammars can either be directly executed (interpreters) or are translated into dynamic programming equations which are further translated into a target language like C (compilers). At the time of writing, there exist five known implementations of \gls{ADP}: two interpreters (\emph{Haskell-ADP} by \citet{giegerich_discipline_2004}, and \emph{ADPfusion} by \citet{honer_zu_siederdissen_sneaking_2012}), two compilers (\emph{ADPC} by \citet{steffen_compiling_2006}, and \emph{GAPC} by \citet{sauthoff_bellmans_2011}), and a run-time code generator and compiler (\emph{DynaProg} by \citet{coppey_dynaprog_2012}).
This chapter introduces the concepts of \gls{ADP} and serves as the base for the following chapter where \gls{ADP} is redefined. To distinguish between the original definition of \gls{ADP} and its redefinition, the former is further named \gls{ADP-CFL}, and the latter \gls{ADP-MCFL}, designating the class of formal languages that each formalism is able to handle.
\section{Context-Free Grammars}
\label{sec:cfgs}
ADP-CFL has a direct correspondence to \glspl{CFG}. As \cref{sec:yield_parsing} details, the class of \emph{yield languages} that ADP-CFL recognizes is equivalent to the class of languages generated by \glspl{CFG}. In addition, it is trivial to transform a \gls{CFG} to the grammar form used in ADP-CFL (regular tree grammars) and vice versa -- the only difference being that \glspl{CFG} only generate derivation trees while ADP-CFL, imprecisely said, also evaluates each tree and aggregates their values. In order to have a common understanding of \glspl{CFG}, a short introduction follows.
\begin{definition}[Alphabet, words]
\label{def:alphabetwords}
An \emph{alphabet} $\mathcal{A}$ is a finite set of symbols. Sequences of symbols are called \emph{words}. The symbol $\epsilon$, which must not be part of an alphabet, denotes the empty word.
The set of words with length $n$ over an alphabet $\mathcal{A}$ is defined by $\mathcal{A}^0 = \{\epsilon\}$, $\mathcal{A}^1 = \mathcal{A}$, and $\mathcal{A}^{n+1} = \{ ax \mid a \in \mathcal{A}, x \in \mathcal{A}^n \}$. The set of words with arbitrary lengths is $\mathcal{A}^* = \bigcup_{n \geq 0} \mathcal{A}^n$.
\end{definition}
A \gls{CFG} -- developed by \citet{chomsky_certain_1959} -- can be described \citep[sect. 6.1]{kastens_modellierung:_2008} as a tuple $\mathcal{G} = (V, \mathcal{A}, Z, P)$ where $V$ is an alphabet of nonterminal symbols, $\mathcal{A}$ an alphabet of terminal symbols disjoint from $V$, $Z \in V$ the start symbol, and $P$ a finite set of productions of the form $v \to r$ with $v \in V$ and $r \in (\mathcal{A} \cup V)^*$. By convention, multiple productions $v \to r_1, \ldots, v \to r_n$ with the same left-hand side can be written in compact form as $v \to r_1 \mid \ldots \mid r_n$.
Given a \gls{CFG} $\mathcal{G} = (V, \mathcal{A}, Z, P)$, then a nonterminal $v \in V$ in an arbitrary word $m = a v c \in (V \cup \mathcal{A})^*$ can be replaced by the right-hand side of a production $v \to b$ leading to $n = a b c \in (V \cup \mathcal{A})^*$. This is called a \emph{derivation step} and is written as $m \Rightarrow_\mathcal{G} n$. For any $m,n \in (V \cup \mathcal{A})^*$, $m \Rightarrow^*_\mathcal{G} n$ is called a \emph{derivation} if $m=n$ or if derivation steps $m \Rightarrow_\mathcal{G} d_1 \Rightarrow_\mathcal{G} \ldots \Rightarrow_\mathcal{G} d_k \Rightarrow_\mathcal{G} n$ exist for $k \geq 0$. Thus, the derivation relation $\Rightarrow^*_\mathcal{G}$ is the reflexive transitive closure of $\Rightarrow_\mathcal{G}$. When clear from the context, the subscript $\mathcal{G}$ can be omitted from $\Rightarrow_\mathcal{G}$ and $\Rightarrow^*_\mathcal{G}$.
\begin{definition}[Formal language]
\label{def:formallanguage}
A \emph{formal language}, or short \emph{language} in the context of words, is a set of words.
\end{definition}
\glsreset{CFL}
The language of a \gls{CFG} $\mathcal{G} = (V, \mathcal{A}, Z, P)$ is the set $\mathcal{L}(\mathcal{G}) = \{ w \in \mathcal{A}^* \mid Z \Rightarrow^* w \}$. A language $\mathcal{L}$ is said to be a \gls{CFL} if a \gls{CFG} $\mathcal{G}$ exists such that $\mathcal{L} = \mathcal{L}(\mathcal{G})$.
\begin{definition}[Tree]
\label{def:tree}
A \emph{tree} is a cycle-free \emph{directed graph} (see glossary) with a designated \emph{root} node, such that (a) only the root has in-degree 0, (b) every node is accessible from the root, and (c) all nodes except the root have in-degree 1.
Nodes with out-degree 0 are called \emph{leaves}, all others \emph{inner nodes}.
A tree is \emph{ordered} if the set of children of each node is assigned a total order.
A tree is \emph{labelled} if each node is assigned a label.
\end{definition}
\begin{definition}[Derivation tree of \glspl{CFG}]
\label{def:derivtree_cfg}
An ordered tree $T$ with labels $m : T \to \mathcal{A} \cup \{\epsilon\} \cup V$ is a derivation tree for a \gls{CFG} $\mathcal{G}=(V, \mathcal{A}, Z, P)$, if
\begin{itemize}[noitemsep]
\item every inner node $k$ of $T$ satisfies $m(k) \in V$,
\item every leaf $l$ of $T$ satisfies $m(l) \in \mathcal{A} \cup \{\epsilon\}$,
\item the root $r$ of $T$ satisfies $m(r)=Z$,
\item every inner node $k$ of $T$ with children $k_1,\ldots,k_n$ satisfies\\ $m(k) \to m(k_1) \ldots m(k_n) \in P$, and
\item every inner node $k$ of $T$ with a single child $k_1$ with $m(k_1)=\epsilon$ satisfies $m(k) \to \epsilon \in P$.\qedhere
\end{itemize}
\end{definition}
A derivation tree can represent multiple derivations, depending on the order in which the derivation steps are applied. Therefore, a derivation tree abstracts from the order in which it was constructed. A \gls{CFG} is called \emph{ambiguous} if multiple derivation trees exist that yield the same word.
\begin{definition}[Formal tree language]
\label{def:formaltreelanguage}
A \emph{formal tree language}, or short \emph{language} in the context of trees, is a set of trees.
\end{definition}
The language $\mathcal{L}_T(\mathcal{G})$ of a \gls{CFG} $\mathcal{G}= (V, \mathcal{A}, Z, P)$ is the set of all derivation trees that can be created from derivations $Z \Rightarrow^* w$ with any $w \in \mathcal{A}^*$. The set of derivation trees that can be created from derivations $Z \Rightarrow^* w$ with a given $w \in \mathcal{A}^*$ is denoted $\mathcal{Q}_\mathcal{G}(w)$.
\begin{example}[\gls{CFG} for pseudoknot-free \gls{RNA} secondary structures]
\label{ex:cfg_pkf}
When building grammars for \gls{RNA} secondary structures, one has to take the exact opposite approach as in \autoref{sec:rna_optimization_problem}. In that section, the sets of structures were constructed by gradually restricting the search space. Here, the search space -- which equals the set of possible derivation trees -- has to be built from the ground up by defining exactly those productions that \emph{generate} the desired derivation trees.
It is often easier when creating grammars for \gls{RNA} secondary structures to think in terms of words of the dot-bracket notation. Then, the search space is the set of derivable words instead of the set of derivation trees. The dot-bracket grammar that recognizes all pseudoknot-free secondary structures is readily given by $\mathcal{G}_{PKF-DB}=(\{Z\},\{\ts{(},\ts{)},\ts{.}\},Z,P_{PKF-DB})$ with
\[ P_{PKF-DB}=\{Z \to \ts{.}Z \mid \ts{(}Z\ts{)}Z \mid \epsilon\}. \]
For each word produced by $\mathcal{L}(\mathcal{G}_{PKF-DB})$, exactly one derivation tree exists. The derivation trees for \ts{.....} and \ts{(.())} are shown in the following:
\begin{spreadTrees}
% ..... -> Z(.,Z(.,Z(.,Z(.,Z(.,Z(\epsilon))))))
\Tree [.$Z$ \ts{.} [.$Z$ \ts{.} [.$Z$ \ts{.} [.$Z$ \ts{.} [.$Z$ \ts{.} [.$Z$ $\epsilon$ ] ] ] ] ] ]
% (.()) -> Z('(',Z(.,Z('(',Z(\epsilon),')',Z(\epsilon))),')',Z(\epsilon))
\Tree [.$Z$ \ts{(} [.$Z$ \ts{.} [.$Z$ \ts{(} [.$Z$ $\epsilon$ ] \ts{)} [.$Z$ $\epsilon$ ] ] ] \ts{)} [.$Z$ $\epsilon$ ] ]
\end{spreadTrees}
To give an example of a derivation, one possible derivation corresponding to the right derivation tree is given:
\begin{equation*}
Z \Rightarrow
\ts{(} Z \ts{)} Z \Rightarrow
\ts{(} \ts{.} Z \ts{)} Z \Rightarrow
\ts{(} \ts{.} \ts{(} Z \ts{)} Z \ts{)} Z \Rightarrow
\ts{(} \ts{.} \ts{(} \ts{)} Z \ts{)} Z \Rightarrow
\ts{(} \ts{.} \ts{(} \ts{)} \ts{)} Z \Rightarrow
\ts{(.())}
\end{equation*}
This (unambiguous) grammar just recognizes secondary structures instead of generating all such structures from a given primary structure. The transformation to a suitable (ambiguous) grammar although is easy: productions with a dot have to be expanded to multiple productions where the dot is replaced by all the bases, and productions with a pair of brackets have to be expanded to productions with base pairs. In grammar $\mathcal{G}_{PKF'} = (\{Z\}, \mathcal{A}_{RNA},Z,P_{PKF'})$ with $\mathcal{A}_{RNA} = \{\ts{a},\ts{g},\ts{c},\ts{u}\}$ this transformation was applied:
\begin{align*}
P_{PKF'} = &\{&&\\
&&Z &\to \ts{a}Z \mid \ts{u}Z \mid \ts{g}Z \mid \ts{c}Z \mid \ts{a}Z\ts{u}Z \mid \ts{u}Z\ts{a}Z \mid\\
&&&\mathrel{\phantom{\to}} \ts{g}Z\ts{c}Z \mid \ts{c}Z\ts{g}Z \mid \ts{g}Z\ts{u}Z \mid \ts{u}Z\ts{g}Z \mid \epsilon\\
&\}.&&
\end{align*}
Now, for each word, multiple derivation trees exist. As an example, the following two trees are derived from the word \ts{agcgu}. Eight other derivation trees for \ts{agcgu} exist.
\begin{spreadTrees}
% ..... -> Z(a,Z(g,Z(c,Z(g,Z(u,Z(\epsilon))))))
\Tree [.$Z$ \ts{a} [.$Z$ \ts{g} [.$Z$ \ts{c} [.$Z$ \ts{g} [.$Z$ \ts{u} [.$Z$ $\epsilon$ ] ] ] ] ] ]
% (.()) -> Z(a,Z(g,Z(c,Z(\epsilon),g,Z(\epsilon))),u,Z(\epsilon))
\Tree [.$Z$ \ts{a} [.$Z$ \ts{g} [.$Z$ \ts{c} [.$Z$ $\epsilon$ ] \ts{g} [.$Z$ $\epsilon$ ] ] ] \ts{u} [.$Z$ $\epsilon$ ] ]
\end{spreadTrees}
Again, to give an example of a derivation, one possible derivation corresponding to the right derivation tree is given:
\begin{equation*}
Z \Rightarrow
\ts{a} Z \ts{u} Z \Rightarrow
\ts{a} \ts{g} Z \ts{u} Z \Rightarrow
\ts{a} \ts{g} \ts{c} Z \ts{g} Z \ts{u} Z \Rightarrow
\ts{a} \ts{g} \ts{c} \ts{g} Z \ts{u} Z \Rightarrow
\ts{a} \ts{g} \ts{c} \ts{g} \ts{u} Z \Rightarrow
\ts{agcgu}
\end{equation*}
Introducing new nonterminals for base pairs and bases, the final grammar which is further used is given by $\mathcal{G}_{PKF} = (\{Z,P,B\}, \mathcal{A}_{RNA},Z,P_{PKF})$ with
\begin{align*}
P_{PKF} = &\{&&\\
&&Z &\to BZ \mid PZ \mid \epsilon,\\
&&P &\to \ts{a}Z\ts{u} \mid \ts{u}Z\ts{a} \mid \ts{g}Z\ts{c} \mid \ts{c}Z\ts{g} \mid \ts{g}Z\ts{u} \mid \ts{u}Z\ts{g},\\
&&B &\to \ts{a} \mid \ts{u} \mid \ts{g} \mid \ts{c}\\
&\}.&&
\end{align*}
Of course, this whole transformation process would be pointless if the secondary structure that is somewhere hidden in the derivation tree -- and which was explicit in the dot-bracket notation -- could not be reconstructed. An easy way is to regard derivation trees as terms and assign each nonterminal a function that constructs a part of the secondary structure as dot-bracket notation word:
\begin{align*}
Z(x,y) &= xy\\
Z(\epsilon) &= \epsilon\\
P(b_1,z,b_2) &= \ts{(}z\ts{)}\\
B(x) &= \ts{.} \qedhere
\end{align*}
\end{example}
\section{Regular Tree Grammars}
\label{sec:regular_tree_grammars}
The main component of ADP-CFL is a class of grammars called \emph{regular tree grammars} \citep{brainerd_tree_1969} -- a device to describe a set of ordered trees. Regular tree grammars are usually defined using ranked alphabets -- an alphabet, and a function assigning an arity to each symbol -- but for ADP-CFL it is useful to extend this concept to algebraic signatures. Using the latter, a fine-grained subclass of regular tree grammars can be defined by restricting grammars to only use certain algebraic signatures or classes thereof. In common definitions of ADP-CFL, this approach is used and specialized. In this section, regular tree grammars over algebraic signatures are introduced, while the following section defines the class of regular tree grammars used in ADP-CFL. From now on, the notions (ordered labelled) tree and term are used interchangeably.
In programming languages, types are used to guarantee the structural correctness of a term. For example, in Java, \verb|1 - true| is an invalid term as the minus operator is defined for numbers only. A similar concept is used in the field of universal algebra. A set of terms can be divided into disjoint sets where each set corresponds to a \emph{sort}. By restricting each argument of an operator to a sort, a basic (type) system for guaranteeing structural correctness is formed. Note that sorts are merely names -- in \cref{sec:adp_problem_solution}, when meaning is added to terms, sorts are assigned to domains.
\begin{definition}[Signature {\citep[cf.][Def. 3.1, 3.3]{kastens_modellierung:_2008}}]
\label{def:signature}
A (many-sorted) signature $\Sigma$ is a tuple $(S,F)$ where $S$ is a set of sorts, and $F$ a set of structural descriptions. A structural description describes an $n$-ary function or relation symbol $op$ and has the form $op: s_1 \times \ldots \times s_n \to s_0$ with $s_i \in S$.
\end{definition}
\begin{definition}[Algebraic signature]
\label{def:alg_signature}
A signature without relation symbols is called an \emph{algebraic signature}.
\end{definition}
\begin{definition}[Correct terms over a signature {\citep[Def. 3.4]{kastens_modellierung:_2008}}]
\label{def:correct_terms}
Given a signature $\Sigma =(S,F)$, then the set of \emph{correct terms} over $\Sigma$ contains the term $t$ of sort $s \in S$, if (a) $t$ is the name of a variable of sort $s$, or (b) $t = f (t_1,\ldots,t_n)$ with $n \geq 0$ where $F$ contains a structural description $f : s_1 \times \ldots \times s_n \to s_0$ such that every $t_i, 1 \leq i \leq n$ is a correct term of sort $s_i$.
\end{definition}
\begin{definition}[Ground term {\citep[Def. 3.5]{kastens_modellierung:_2008}}]
\label{def:ground_term}
A correct term without variables is called a \emph{ground term}.
\end{definition}
$T_\Sigma$ is the language containing all ground terms over $\Sigma$.
$T_\Sigma(V)$ is the language containing all correct terms over $\Sigma$ where all variables are from a set $V$.
\begin{definition}[Position of a variable]
To each variable in a correct term a unique number called \emph{position} is assigned such that all variables in a term have distinct numbers.
\end{definition}
\pagebreak
\begin{definition}[Replacement of a variable]
The replacement of a variable at position $p$ in term $t$ by $u$ is denoted $t[u]_p$.
\end{definition}
\begin{definition}[Regular tree grammar over an algebraic signature {\citep[cf.][sect. 3.4]{giegerich_code_1988}}]
\label{def:regular_tree_grammar}
A regular tree grammar $\mathcal{G}$ over an algebraic signature $\Sigma$ is defined as a tuple $(V,\Sigma,Z,P)$, where $V$ is an alphabet of nonterminal symbols disjoint from function symbols in $\Sigma$, $Z \in V$ is the start symbol, and $P$ a finite set of productions which have the form $v \to t$ where $v \in V$ and $t \in T_\Sigma(V)$. All terms $t$ in productions $v \to t$ of the same nonterminal $v \in V$ must have the same sort.
\end{definition}
\begin{remark}
In expression $T_\Sigma(V)$ of \cref{def:regular_tree_grammar}, the alphabet of nonterminals $V$ is regarded as a set of variables where the sort of each variable is determined by the right-hand sides of its corresponding productions.
\end{remark}
Given a regular tree grammar $\mathcal{G}=(V,\Sigma,Z,P)$, then a nonterminal $v \in V$ at position $p$ in a term $t \in T_\Sigma(V)$ can be replaced by the right-hand side of a production $v \to v' \in P$ leading to a new term $t'=t[v']_p$. This is called a \emph{derivation step} and is written as $t \Rightarrow_\mathcal{G} t'$. For any terms $m,n \in T_\Sigma(V)$, $m \Rightarrow^*_\mathcal{G} n$ is called a \emph{derivation} if $m=n$ or if derivation steps $m \Rightarrow_\mathcal{G} d_1 \Rightarrow_\mathcal{G} \ldots \Rightarrow_\mathcal{G} d_k \Rightarrow_\mathcal{G} n$ exist for $k \geq 0$. Thus, the derivation relation $\Rightarrow^*_\mathcal{G}$ is the reflexive transitive closure of $\Rightarrow_\mathcal{G}$. When clear from the context, the subscript $\mathcal{G}$ can be omitted from $\Rightarrow_\mathcal{G}$ and $\Rightarrow^*_\mathcal{G}$.
The language of a term $t \in T_\Sigma(V)$ is the set $\mathcal{L}(\mathcal{G},t) = \{ t' \in T_\Sigma \mid t \Rightarrow^* t' \}$. The language of $\mathcal{G}$ is the set $\mathcal{L}(\mathcal{G}) = \mathcal{L}(\mathcal{G},Z) = \{ t \in T_\Sigma \mid Z \Rightarrow^* t \}$. $\mathcal{L}(\mathcal{G})$ is a subset of $T_\Sigma$.
\begin{example}[Regular tree grammar for sequences of natural numbers]
A regular tree grammar is given by $\mathcal{G}_{Nat} = (\{Seq,Nat\},\Sigma_{Nat},Seq,P_{Nat})$ where
\begin{align*}
P_{Nat} = &\{&&\\
&&Seq &\to \op{nil} \mid \op{cons}(Nat,Seq), \\
&&Nat &\to 0 \mid \op{s}(Nat)\\
&\},&&
\end{align*}
and $\Sigma_{Nat}=(\{N,N^*\},F_{Nat})$ with
\begin{align*}
F_{Nat} = &\{&&&&\\
&&\op{nil} &: \{()\} &&\to N^*,\\
&&\op{cons} &: N \times N^* &&\to N^*,\\
&&0 &: \{()\} &&\to N,\\
&&\op{s} &: N &&\to N\\
&\}.&&&&
\end{align*}
Some terms in $\mathcal{L}(\mathcal{G}_{Nat})$ are
\begin{align*}
&\op{nil},\\
&\op{cons}(0,\op{nil}),\\
&\op{cons}(\op{s}(\op{s}(\op{s}(0))),\op{cons}(\op{s}(0),\op{nil})).
\end{align*}
One possible derivation of the second term is given for illustration:
\begin{equation*}
Seq \Rightarrow
\op{cons}(Nat,Seq) \Rightarrow
\op{cons}(0,Seq) \Rightarrow
\op{cons}(0,\op{nil})
\end{equation*}
The relation to trees in tree grammars is made by imagining that instead of terms, trees are
generated:
\begin{spreadTrees}
\Tree [.nil ]
\Tree [.cons 0 nil ]
\Tree [.cons [.s [.s [.s 0 ] ] ] [.cons [.s 0 ] nil ] ]
\end{spreadTrees}
\end{example}
An important lesson is that although trees implicitly exist as derivation trees for \glspl{CFG}, they were now made explicit in regular tree grammars. The concept of words though is lost now. In the next section, this concept is reintroduced to fuse the worlds of words and trees in a very explicit way.
\section{ADP-CFL Grammars}
As already mentioned, common definitions of ADP-CFL -- e.g. \citet{honer_zu_siederdissen_sneaking_2012,sauthoff_bellmans_2011,giegerich_discipline_2004} -- use a subclass of regular tree grammars . This is achieved by restricting the forms of algebraic signatures and productions that are allowed. The common usage of this subclass is not an inherent restriction of ADP-CFL but rather a way to reduce the complexity of further definitions. A remark in the end of this section elaborates on this further.
The simplified definition of ADP-CFL assumes that all nonterminals of a regular tree grammar generate terms of the same sort. It will later become clear that this restriction implies that only a single objective function can be used.
First, a subclass of algebraic signatures is introduced, in this work named \emph{grammar signatures}, following \citet[sect. 3.1]{giegerich_discipline_2004}. One of the purposes of this class is to introduce signatures with exactly two sorts where one sort is designated as the (terminal) \emph{alphabet sort}, and the other as the \emph{result sort}. Through this distinction, further definitions can reference both sorts explicitly.
\begin{definition}[Grammar signature]
A grammar signature $\Sigma$ is a tuple $(\mathcal{A},\mathcal{S},F)$ which defines an algebraic signature $(\{\mathcal{A},\mathcal{S}\},F \cup F_\mathcal{A})$ where $\mathcal{S}$ is the result sort, and $\mathcal{A}$ is an alphabet of terminal symbols that is formally used as the alphabet sort but also for implicitly defining a set of structural descriptions $F_\mathcal{A}$ describing constant function symbols $a: \{()\} \to \mathcal{A}$ for each $a \in \mathcal{A}$. Additionally, each structural description $f \in F$ must have the form $f: s_1 \times \ldots \times s_n \to \mathcal{S}$ with $s_i \in \{\mathcal{A},\mathcal{S}\}$.\footnote{In \citet{giegerich_discipline_2004}, a non-standard definition of (single-sorted) signatures was used in which $\mathcal{A}$ is just a set of symbols rather than a proper sort. In this chapter, ADP-CFL is defined using regular definitions from the field of universal algebra.}
\end{definition}
\begin{definition}[Production term over a grammar signature]
Given a grammar signature $\Sigma=(\mathcal{A},\mathcal{S},F)$, then a correct term over $\Sigma$ of sort $\mathcal{S}$ is called a \emph{production term}.
\end{definition}
Given a grammar signature $\Sigma$, then $T^p_\Sigma$ is the language containing all production terms over $\Sigma$ which are also ground terms.
$T^p_\Sigma(V)$ is the language containing all production terms over $\Sigma$ where all variables are from a set $V$.
Finally, a definition of the type of restricted regular tree grammars used in further definitions of ADP-CFL can be given.
\begin{definition}[ADP-CFL grammar over $\Sigma$ {\citep[cf.][Def. 1]{giegerich_discipline_2004}}]
\label{def:tree_grammar}
An ADP-CFL grammar $\mathcal{G}$ over a grammar signature $\Sigma=(\mathcal{A},\mathcal{S},F)$ is a tuple $(V,\mathcal{A},Z,P)$ and describes a regular tree grammar $(V,\Sigma,Z,P)$, where the productions in $P$ must have the form $v \to t$ with $v \in V$ and $t \in T^p_\Sigma (V)$.
\end{definition}
Observation: A regular tree grammar $\mathcal{G}$ over an algebraic signature $\Sigma$ with a single sort $S$ is an ADP-CFL grammar. This follows, when the result sort of the corresponding ADP-CFL grammar is mapped to $S$, and the alphabet set is empty so that the alphabet sort stays unused and can effectively be ignored. But it is the alphabet sort that is of utmost importance for the concept of ADP-CFL, which is why ADP-CFL is not simply defined as a regular tree grammar with a single sort. In the following section, the reason of this importance is explained.
\begin{remark}
Although the restriction of each nonterminal generating terms of a single sort $\mathcal{S}$ seems to produce more formal work than necessary at the moment, it vastly reduces the complexity of definitions and methods in the rest of this work. The formal extension to multiple result sorts (and with that, multiple objective functions) is very briefly discussed in \citet[sect. 2.2.2]{sauthoff_bellmans_2011-1}. Also, note that all known implementations of ADP-CFL support multiple result sorts.
%with multiple result sorts we could do:
%S -> f(B,S) \ldots h_1 =max sort=S
%B -> a | g | c | u [... h_2=id] sort=A
%with a single result sort, we have to use a workaround:
%B -> foo(a) | foo(g) | foo(c) | foo(u) h1=max sort=S
%NOTE: adp-multi kann auch mehrere sorts
\end{remark}
Before providing an example of an ADP-CFL grammar for RNA secondary structures, the notion of yield parsing has to be introduced which relates terms of ADP-CFL grammars to words of a context-free language. With this relation, an ADP-CFL grammar which corresponds to the context-free grammar $\mathcal{G}_{PKF}$ of \cref{ex:cfg_pkf} can be constructed.
\section{Yield Parsing for ADP-CFL Grammars}
\label{sec:yield_parsing}
Given an ADP-CFL grammar $\mathcal{G}=(V,\mathcal{A},Z,P)$ over $\Sigma=(\mathcal{A},\mathcal{S},F)$, then its yield function $\op{y}_\mathcal{G} : T_\Sigma \to \mathcal{A}^*$ is defined as
\begin{equation}
\label{eq:y}
\op{y}_\mathcal{G}(z) =
\begin{cases}
z & \text{if $z \in \mathcal{A}$,}\\
\op{y}_\mathcal{G}(x_1)\ldots\op{y}_\mathcal{G}(x_n) & \text{if $z = f(x_1,\ldots,x_n), f \in F$.}
\end{cases}
\end{equation}
By constructing ADP-CFL grammars whose generated trees have elements of $\mathcal{A}$ in their leaves, it is now possible to extract those elements and concatenate them to words.
\begin{example}[ADP-CFL grammar for sequences of characters]
\label{ex:adp_cfl_chars}
An ADP-CFL grammar is given by $\mathcal{G}_{Char} = (\{Seq,Char\},\mathcal{A},Seq,P)$ over $\Sigma$ where $\mathcal{A}=\{\ts{0},\ts{1},\ts{a},\ts{b}\}$,
\begin{align*}
P = &\{&&\\
&&Seq &\to \op{nil} \mid \op{cons}(Char,Seq), \\
&&Char &\to \op{number}(\ts{0}) \mid \op{number}(\ts{1}) \mid \op{letter}(\ts{a}) \mid \op{letter}(\ts{b})\\
&\},&&
\end{align*}
and $\Sigma=(\mathcal{A},\mathcal{S},F)$ with
\begin{align*}
F = &\{&&&&\\
&&\op{nil} &: \{()\} &&\to \mathcal{S},\\
&&\op{cons} &: \mathcal{S} \times \mathcal{S} &&\to \mathcal{S},\\
&&\op{number} &: \mathcal{A} &&\to \mathcal{S},\\
&&\op{letter} &: \mathcal{A} &&\to \mathcal{S}\\
&\}.&&&&
\end{align*}
Some terms in $\mathcal{L}(\mathcal{G}_{Char})$ with $\op{y}_{\mathcal{G}_{Char}}$ applied are:
\begin{align*}
&\op{y}_{\mathcal{G}_{Char}}(\quad\op{nil} \quad) &&= \epsilon\\
&\op{y}_{\mathcal{G}_{Char}}(\quad\op{cons}(\op{number}(\ts{0}),\op{nil})\quad) &&= \ts{0}\\
&\op{y}_{\mathcal{G}_{Char}}(\quad\op{cons}(\op{number}(\ts{0}),\op{cons}(\op{letter}(\ts{a}),\op{nil}))\quad) &&= \ts{0a} \qedhere
\end{align*}
\end{example}
The yield language of a term $t \in T_\Sigma(V)$ is defined as
\begin{equation}
\mathcal{L}_{\op{y}}(\mathcal{G},t) = \{ \op{y}_\mathcal{G}(t') \mid t' \in \mathcal{L}(\mathcal{G},t) \}.
\end{equation}
The yield language of $\mathcal{G}$ is defined as
\begin{equation}
\mathcal{L}_{\op{y}}(\mathcal{G}) = \mathcal{L}_{\op{y}}(\mathcal{G},Z) = \{ \op{y}_\mathcal{G}(t) \mid t \in \mathcal{L}(\mathcal{G}) \}.
\end{equation}
The class of yield languages of ADP-CFL grammars is equivalent to the class of languages generated by context-free grammars \citep[Theorem 6]{giegerich_implementing_2002}.
\begin{example}[continues=ex:adp_cfl_chars]
The yield language of the term $Char$ is equal to the set $\{\ts{0},\ts{1},\ts{a},\ts{b}\}$.
The yield language of the term $Seq$ is equal to the set $\{\ts{0},\ts{1},\ts{a},\ts{b}\}^*$. The yield language of $\mathcal{G}_{Char}$ is the yield language of $Seq$.
\end{example}
Computing the inverse of the yield function is called yield parsing. The yield
parser $\mathcal{Q}_\mathcal{G}$ of an ADP-CFL grammar $\mathcal{G}$ computes the search space of all possible yield parses:
\begin{equation}
\label{eq:yieldparsercfl}
\mathcal{Q}_\mathcal{G}(x) = \{ t \in \mathcal{L}(\mathcal{G}) \mid \op{y}_\mathcal{G}(t) = x \}.
\end{equation}
\begin{example}[continues=ex:adp_cfl_chars]
As in $\mathcal{G}_{Char}$ each term yields a unique word, the yield parser $\mathcal{Q}_{\mathcal{G}_{Char}}$ returns exactly the single term that is associated to the word.
For the earlier examples
\begin{align*}
&\mathcal{Q}_{\mathcal{G}_{Char}}(\epsilon) &&= \{\op{nil}\},\\
&\mathcal{Q}_{\mathcal{G}_{Char}}(\ts{0}) &&= \{\op{cons}(\op{number}(\ts{0}),\op{nil})\},\\
&\mathcal{Q}_{\mathcal{G}_{Char}}(\ts{0a}) &&= \{\op{cons}(\op{number}(\ts{0}),\op{cons}(\op{letter}(\ts{a}),\op{nil}))\}.
\end{align*}
\end{example}
The following example, which continues the running example of RNA secondary structure prediction, uses the concept of yield parsing for generating multiple terms for each word. The search space that was created by derivation trees of $\mathcal{G}_{PKF}$ is now replicated in the formalism of regular tree grammars -- or more specifically: ADP-CFL grammars. Only in the next section it will become clear what the actual benefit of this representation of the search space is.
\begin{example}[ADP-CFL grammar for pseudoknot-free \gls{RNA} secondary structures]
\label{ex:treegrammar_pkf}
In this example, the \gls{CFG} $\mathcal{G}_{PKF}$ from \cref{ex:cfg_pkf} is transformed to an ADP-CFL grammar $\mathcal{G}_{PKF-T}$ where, given a primary structure, the sets of derivable secondary structures from both grammars are equal. More formally, an ADP-CFL grammar needs to be found such that for every $x \in \mathcal{A}^*_{RNA}$ the sets of terms $\mathcal{Q}_{\mathcal{G}_{PKF-T}}(x)$ and $\mathcal{Q}_{\mathcal{G}_{PKF}}(x)$ are equivalent. In this context, two term sets are considered equivalent if both sets are equal when each term has been replaced by its dot-bracket notation word.
Constructing such grammars with equivalent term set is straight-forward. Each production in $\mathcal{G}_{PKF}$ is transformed by using the symbols of the right-hand side as arguments of a new function symbol. As the argument types are either $\mathcal{A}$ or $\mathcal{S}$, the empty word is represented by an empty list of arguments, instead of a single argument with the empty word. Although each production could be constructed with a new function symbol, in this example a certain set of symbols are reused to group similar cases. In the following \cref{sec:adp_problem_solution} -- when terms are evaluated -- the rationale behind it will become clearer.
The resulting ADP-CFL grammar is given by
\[ \mathcal{G}_{PKF-T} = (\{Z,P,B\},\mathcal{A},Z,P_{PKF-T}) \text{ over } \Sigma_{PKF-T} \] where $\mathcal{A} = \mathcal{A}_{RNA}$,
\begin{align*}
P_{PKF-T} = &\{&&\\
&&Z &\to \op{f}_1(B,Z) \mid \op{f}_2(P,Z) \mid \op{f}_3,\\
&&P &\to \op{f}_4(\ts{a},Z,\ts{u}) \mid \op{f}_4(\ts{u},Z,\ts{a}) \mid \op{f}_4(\ts{g},Z,\ts{c}) \mid\\
&& &\mathrel{\phantom{\to}} \op{f}_4(\ts{c},Z,\ts{g}) \mid \op{f}_4(\ts{g},Z,\ts{u}) \mid \op{f}_4(\ts{u},Z,\ts{g}),\\
&&B &\to \op{f}_5(\ts{a}) \mid \op{f}_5(\ts{u}) \mid \op{f}_5(\ts{g}) \mid \op{f}_5(\ts{c})\\
&\},&&
\end{align*}
and $\Sigma_{PKF-T}=(\mathcal{A},\mathcal{S},F_{PKF-T})$ with \pagebreak
\begin{align*}
F_{PKF-T} = &\{&&&&\\
&&\op{f}_1 &: \mathcal{S} \times \mathcal{S} &&\to \mathcal{S},\\
&&\op{f}_2 &: \mathcal{S} \times \mathcal{S} &&\to \mathcal{S},\\
&&\op{f}_3 &: \{()\} &&\to \mathcal{S},\\
&&\op{f}_4 &: \mathcal{A} \times \mathcal{S} \times \mathcal{A} &&\to \mathcal{S},\\
&&\op{f}_5 &: \mathcal{A} &&\to \mathcal{S}\\
&\}.&&&&
\end{align*}
Two of ten trees of the set $\mathcal{Q}_{\mathcal{G}_{PKF-T}}(\ts{agcgu})$ are given:
\begin{spreadTrees}
% ..... -> f_1(f_5(a),f_1(f_5(g),f_1(f_5(c),f_1(f_5(g),f_1(f_5(u),f_3)))))
\Tree [.$\op{f}_1$ [.$\op{f}_5$ \ts{a} ] [.$\op{f}_1$ [.$\op{f}_5$ \ts{g} ] [.$\op{f}_1$ [.$\op{f}_5$ \ts{c} ] [.$\op{f}_1$ [.$\op{f}_5$ \ts{g} ] [.$\op{f}_1$ [.$\op{f}_5$ \ts{u} ] $\op{f}_3$ ] ] ] ] ]
% (.()) -> f_2(f_4(a,f_1(f_5(g),f_2(f_4(c,f_3,g),f_3)),u),f_3)
\Tree [.$\op{f}_2$ [.$\op{f}_4$ \ts{a} [.$\op{f}_1$ [.$\op{f}_5$ \ts{g} ] [.$\op{f}_2$ [.$\op{f}_4$ \ts{c} $\op{f}_3$ \ts{g} ] $\op{f}_3$ ] ] \ts{u} ] $\op{f}_3$ ]
\end{spreadTrees}
The corresponding trees for $\mathcal{G}_{PKF}$ are also given for comparison:
\begin{spreadTrees}
% ..... -> Z(B(a),Z(B(g),Z(B(c),Z(B(g),Z(B(u),Z(\epsilon))))))
\Tree [.$Z$ [.$B$ \ts{a} ] [.$Z$ [.$B$ \ts{g} ] [.$Z$ [.$B$ \ts{c} ] [.$Z$ [.$B$ \ts{g} ] [.$Z$ [.$B$ \ts{u} ] [.$Z$ $\epsilon$ ] ] ] ] ] ]
% (.()) -> Z(P(a,Z(B(g),Z(P(c,Z(\epsilon),g),Z(\epsilon))),u),Z(\epsilon))
\Tree [.$Z$ [.$P$ \ts{a} [.$Z$ [.$B$ \ts{g} ] [.$Z$ [.$P$ \ts{c} [.$Z$ $\epsilon$ ] \ts{g} ] [.$Z$ $\epsilon$ ] ] ] \ts{u} ] [.$Z$ $\epsilon$ ] ]
\end{spreadTrees}
By using this one-to-one translation method, it is quite easy to see that both grammars produce terms which only differ from each other by using nonterminals instead of separate symbols as function symbols, and vice versa. The empty word symbol in derivation trees of \glspl{CFG} can be ignored for this comparison. The only task left is to map the transformation functions given in \cref{ex:cfg_pkf} to the function symbols used in the ADP-CFL grammar -- again, to make the hidden secondary structures explicit in form of dot-bracket notation words:
\begin{align*}
\op{f}_1(b,z) &= bz\\
\op{f}_2(p,z) &= pz\\
\op{f}_3 &= \epsilon\\
\op{f}_4(b_1,z,b_2) &= \ts{(}z\ts{)}\\
\op{f}_5(b) &= \ts{.} \qedhere
\end{align*}
\end{example}
With the concept of yield parsing, words have been reintroduced to regular tree grammars in a way similar to how they exist in \glspl{CFG}. The main benefit compared to \glspl{CFG} is that derivation trees were made explicit in the form of derivable terms which can be processed further. Viewed from outside, introducing and adapting regular tree grammars seems like too much effort just to gain an explicit form of terms. Nevertheless, it is this explicit form which greatly simplifies the processing and evaluation of derivation trees. By using this form, the use of standard notions from the field of universal algebra becomes possible -- e.g. \emph{structures} as introduced in the next section. This basis is particularly important when considering how to develop software implementations of ADP-CFL. Many-sorted signatures directly lead to the field of type theory, which in turn is closely related to programming languages with static type systems. Using this strong analogy, it was possible for the first implementation of ADP-CFL (Haskell-ADP) to consist of merely around 100 lines of code. More arguments for using regular tree grammars can be found in \citet[sect. 3.10]{giegerich_discipline_2004}.
\section{Solving an ADP-CFL Problem}
\label{sec:adp_problem_solution}
The previous two sections described how the candidate search space is constructed. This section shows how each candidate of the search space is evaluated and a solution for the optimization problem is found.
Evaluation of a term -- in mathematical logic called \emph{interpretation} -- is achieved using \emph{structures}. A structure contains interpretations for the symbols of a signature and defines a domain for each sort. Structures for algebraic signatures are also called algebraic structures, or short \emph{algebras}. An algebra for an algebraic signature $\Sigma$ is denoted $\Sigma$-algebra. Examples of common algebras used in mathematics include the algebras of groups and rings over a given set, and the Boolean algebra.
\begin{example}[Signatures and algebras for semirings]
\label{ex:semirings}
The signature of a specific semiring depends on the set over which it is defined. Given a set $S$, then $\Sigma=(\{S\},F)$ with
\begin{align*}
F = &\{&&&&\\
&&+ &: S \times S &&\to S,\\