-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathdef_syntax.tex
1529 lines (1268 loc) · 74 KB
/
def_syntax.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt,bookmarks,bookmarksnumbered,naturalnames,plainpages=false,pdftex,colorlinks=true,urlcolor=blue,bookmarksdepth=subsection,plainpages=false]{paper}
\usepackage[T2A,T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[a4paper,margin=1in]{geometry}
% \newcommand{\Be}{\mbox{\usefont{T2A}{\rmdefault}{m}{n}\CYRB}}
\usepackage{times}
\usepackage{amssymb}
\usepackage{transparent}
% \usepackage{makeidx}
\usepackage{tikz}
%\usepackage{wrapfig}
%\usetikzlibrary{snakes,arrows,shapes,automata}
%\usepackage{mlbook}
%\usepackage[round]{natbib}
%\usepackage{multicol}
\usepackage{epsfig}
\usepackage{graphicx}
% \usepackage{xypic}
\usepackage[matrix,arrow]{xy}
\usepackage[pdftex,colorlinks=true,urlcolor=blue,bookmarksdepth=subsection,plainpages=false]{hyperref}
\usepackage[backend=biber,natbib,style=authoryear,uniquename=false,uniquelist=false]{biblatex}
\def\stackrel#1#2{\mathrel{\mathop{#2}\limits^{#1}}}
\makeatletter
\newcommand{\dashedrightarrow}[1][2pt]{%
\settowidth{\@tempdima}{$\rightarrow$}\rightarrow% typeset arrow
\makebox[-\@tempdima]{\hskip-1.5ex\color{white}\rule[0.5ex]{#1}{2pt}}% typeset overlay
\phantom{\rightarrow}% advance appropriate horizontal distance
}
\makeatother
\newcommand{\andras}[1]{{\color{magenta}{AK: #1}}}
\newcommand{\zalan}[1]{{\color{red}{MM: #1}}}
\addbibresource{ml.bib}
\begin{document}
\title{The syntax of 4lang definitions}
\author{Andr\'as Kornai}
%\small HAS Computer and Automation Research Institute\\
%\small H-1111 Budapest Kende u 13-17, Hungary\\
%\small \href{mailto:andras@kornai.com}{andras@kornai.com}}
\date{}
\maketitle
\begin{abstract}
We describe how the basic lexicographic principles of 4lang are reflected in
the syntax of definitions
\end{abstract}
\section*{Background}
4lang is a concept dictionary, intended to be universal in a sense made more
precise below. The main motivation was spelled out in \citep{Kornai:2010}
as follows:
``In creating a formal model of the lexicon the key difficulty is the
circularity of traditional dictionary definitions -- the first English
dictionary, \cite{Cawdrey:1604} already defines {\it heathen}
as {\tt gentile} and {\it gentile} as {\tt heathen.} The problem has already
been noted by Leibniz (quoted in \cite{Wierzbicka:1985}):
\begin{quote}
Suppose I make you a gift of a large sum of money saying you can collect it
from Titius; Titius sends you to Caius; and Caius, to Maevius; if you continue
to be sent like this from one person to another you will never receive
anything.
\end{quote}
\noindent
One way out of this problem is to come up with a small list of primitives, and
define everything else in terms of these.''
To take the first tentative steps towards language-independence, the system
was set up with bindings in four languages, representative samples of the
major languge families spoken in Europe: Germanic (English), Slavic (Polish),
Romance (Latin), and Finno-Ugric (Hungarian). Today, bindings exist in over 40
languages \citep{Acs:2013} but the user should keep in mind that these
bindings provide only rough semantic correspondence to the intended concept.
4lang can be used for a variety of purposes: for better understanding of
deep cases \citep{Makrai:2014}; for producing state of the art results on
analogy tasks \citep{Recski:2016c}; for studies of the importance of
individual concepts \citep{Makrai:2013a}; for investigating spreading
activation \citep{Nemeskey:2013}; and of course for investigating various
issues of lexicography \citep{Kornai:2015a}.
Unfortunately, these papers don't always reference the same version of the
slowly evolving 4lang, and the principles undergirding the system are not
available in a single, convenient source. The goal of this document is to
provide a single entry point for both the linguistic and the computational
aspects.
The main {\tt 4lang} file is divided into 9 tab-separated fields, of which the
last is reserved for comments. The percent sign is also used to delimit
comments, this is not (yet) consistent. A typical entry (written as one line
in the file but here broken up in three for legibility) would be
\begin{verbatim}
attention figyelem animi_attentio uwaga 801 u V
listen, see, =AGT think =PAT[interesting, important]
%light verb, "pay attention"
\end{verbatim}
\noindent
As can be seen, the first four columns are the 4 language bindings given in
EHLP order, with provisions for keeping the entire file in 7-bit ASCII. The
fifth is a unique number per concept, most important when the English bindings
coincide:
\begin{verbatim}
cook fo3z coquo gotowac1 825 u V =AGT make <food>, INSTRUMENT heat
cook szaka1cs coquus kucharz 2152 u N person, <profession>, make food
\end{verbatim}
\noindent
The sixth column is technical (representing current update status) and will be
ignored here, as it is relevant only for the maintainers of the
dictionary. The seventh column is a rough lexical category symbol, see
Section~\ref{lexcat} for further discussion. The main subject of this note is
the 8th column, which gives the 4lang definition, see Section~\ref{8thcol},
but before turning to this, we discuss some of the lexicographic principles,
and take the opportunity to introduce some of the machinery informally first.
%ring cseng tinnit dzwonic1 2735 u U bell make sound/993, FOR/2782 attention, <bell PART_OF telephone>
\section{Lexicographic principles}
\subsection{Reductivity}
In many ways, 4lang is a logical outgrowth of modern, computationally oriented
lexicographic work beginning with Collins-COBUILD \citep{Sinclair:1987}, the
Longman Dictionary of Contemporary English (LDOCE) \citep{Boguraev:1989},
WordNet \citep{Miller:1995}, FrameNet \citep{Fillmore:1998}, and VerbNet
\citep{Kipper:2000}.
The key step in minimizing circularity was taken in LDOCE, where a small
(about 2,200 words) defining vocabulary (called LDV, Longman Defining
Vocabulary) was created, and strictly adhered to in the definitions with one
trivial exception: words that often appear in definitions (e.g. the word {\it
planet} is common to the definition of Mercury, Mars, Venus, \ldots) can be
used as long as their definition is strictly in terms of the LDV. Since {\it
planet} is defined `a large body in space that moves around a star' and
{\it Jupiter} is defined as `the largest planet of the Sun' it is easy to
substitute one definition in the other to obtain for Jupiter the definition
`the largest body in space that moves around the Sun'.
4lang generalizes this process, starting with a core list of NNN primitives,
defining a larger set in terms of these, a yet larger set in terms of these,
and so on until the entire vocabulary is in scope. As a practical matter we
started from the opposite direction, with a seed list of approximately 3,500
entries composed of the LDV (2,200 entries), the most frequent 2,000 words
according to the Google unigram count \citep{Brants:2006} and the BNC
\citep{Burnard:1998}, as well as the most frequent 2,000 words from Polish
\citep{Halacsy:2008} and Hungarian \citep{Kornai:2006}. Since Latin is one of
the four languages supported by 4lang, we added the classic
\cite{Diederich:1939} list and \cite{Whitney:1885}.
Based on these 3,500 words, we reduced the defining vocabulary by means of a
heuristic graph search algorithm \citep{Acs:2013} that eliminated all words
that were definable in terms of the remaining ones. The end-stage is a
vocabulary with the {\it uroboros property}, i.e. one that is minimal wrt this
elimination process. This list (1,200 words, counting different senses with
multiplicity) was published as Appendix~4.8 of \cite{Kornai:2019} and was used
in several subsequent studies including \citep{Nemeskey:2018}. (The last
remnant of the fact that we started with over 3k words is that numbers in the
5th column are still in the 1-3,500 range, as we decided against renumbering
the set.)
Importantly, since the uroboros vocabulary was obtained by systematic
reduction of a superset of the LDV, it is still guaranteed that every sense of
every word listed in LDOCE (over 82k entries) are definable in terms of
these. Since the defining vocabularies of even larger dictionaries such as
Webster's 3rd \citep{Merriam:1961} are generally included in LDOCE, we have
every reason to believe that the entire vocabulary of English, indeed the
entire vocabulary of any language, is still definable in terms of these 1,200
concepts.
Unfortunately, such redefinition generally requires more than string
substitution: for example if you substitute `a large body in space that moves
around a star' into `the largest \_\_ of the Sun' you would obtain `the
largest a large body in space that moves around a star of the Sun' and it
takes a great deal of sophistication for the substitution algorithm to realize
that {\it a large} is subsumed by {\it the largest} or that {\it a star} is
instantiated by {\it the Sun}. People perform these operations with ease,
without conscious effort, but for now we lack parsers of the requisite
syntactic and semantic sophistication to do this automatically. Part of our
goal with the strict definition syntax that replaces English syntax on the
right-hand side (rhs) of definitons is to study the mechanisms required by an
automated parser for doing this.
\subsection{Morphological prerequisites}\label{morphology}
The LDV contains a few dozen bound morphemes, the suffixes {\it -able -al -an
-ance -ar -ate -ation -dom -ed -en -ence -er -ery -ess -est -ful -hood -ible -ic
-ical -ing -ion -ish -ist -ity -ive -ization -ize -less -like -ly -ment
-ness -or -ous -ry -ship -th -ure -ward -wards -work -y} and the prefixes
{\it counter- dis- en- fore- im- in- ir- mid- mis- non- re- self- un- vice-
well-}. These are tremendously useful both in reducing the size of the
defining vocabulary (since {\it eat} and {\it eating} no longer be listed
both) and in making the definitions less complicated.
While we obviously cannot cover the entirety of English morphology as part of
4lang, we do not consider the problems raised by bound forms to be
qualitatively different from those raised by lexical semantics in
general. Also, languages are not uniform in where they draw the bound/free
boundary: many concepts that are expressed by affixation in one are expressed
by free forms in another, and dictionary definitions often contain these.
We will illustrate our methods on the suffix {\it -ize}, which means something
like `to cause to become', so {\it Americanize} `cause to become American',
{\it carbonize} `cause to become carbon' and so forth. There are cases that do
not fit this analysis ({\it agonize} doesn't mean `cause to become agony' the
same way {\it colonize} means `cause to become colony') and there are other
subregularities one may wish to consider, but the majority of the 2-300
English words ending in {\it -ize} fit this pattern well enough to consider it
the leading candidate for a semantic definition. What we wish to state is a
lexical rule roughly of the following form: for stem X, stem+ize means `cause
to become (like) X'. Anticipating several notational conventions that will
only be explained subsequently, we could write this as
\begin{verbatim}
-ize CAUSE {become <like/1701> stem}, "-ize" MARK stem
\end{verbatim}
\noindent
From here on in examples we keep only the English binding for the definiendum
(the first field in a dictionary record), followed by the definiens (the 8th
field). Here {\tt CAUSE} is a primitive (part of our eventual uroboros set)
written in CAPS because it is one of the few binary relations we admit as
primitives (primitive unaries and variadic predicates are written lowercase).
The curly braces (see Section~\ref{parens}) denote a single hypergraph node
(pictorially, all formulas will correspond to hypergraphs) and the angled
brackets signify optionality, enclosing the default option (see
Section~\ref{default}). {\tt MARK} is another primitive, standing for the
relation between signifier (a string, given in doublequotes) and the relevant
element to be substituted (see \ref{naivegrammar}), here the node {\tt stem}
which is analogous to the variable X used above.
However, neither {\tt like/1701} nor {\tt become} are primitives (for the
four-digit disambiguation number following the English binding see
Section~\ref{comma}). {\tt like/1701} `sicut' is defined as {\tt similar} (as
opposed to {\tt like/3382} `amo') and {\tt become} is defined as {\tt
=AGT[=PAT[after]]} which for now we will paraphrase as `afterwards, agent
IS\_A patient' (thematic roles are discussed in Sections~\ref{agtpat} and
\ref{deepcase}). For something like {\it John caramelized the sugar} this
would be `John caused the sugar to be <similar to> caramel afterwards'.
Here, for the sake of readability, we made some consessions to English syntax,
by adding agreement morphology, an article, a copula, and a preposition, but
eventully the reader will get familiar with the syntax of definitions that
lacks all this niceties, and would read {\tt John CAUSE after \{sugar $<$similar$>$
caramel\}}.
Since {\tt similar} is not a primitive of the formal language of definitions,
we can take this further by substituting its definition
\begin{verbatim}
similar =AGT HAS property, =PAT HAS property, "to" MARK =PAT
\end{verbatim}
\noindent
Since named nodes are unique in definitions, what this means is that in the
construction {\it X (is) similar to Y} the agent will have the {\it same}
property as the patient. As expected, the {\tt MARK} relation is
language-specific, for Hungarian we would want to say that the allative case
{\it hoz/hez/h\"{o}z} marks the patient. (4lang currently gives the MARKs only
for English.)
At this point we can omit the default (since it is a binary relation, this
means substituting {\tt IS\_A}) or we can expand it, to yield
\begin{verbatim}
-ize CAUSE {after {=PAT HAS property, stem HAS property}}, "-ize" MARK stem
\end{verbatim}
\noindent At this point, all our notions are primitives, including the
metalinguistic placeholder "stem" and the term {\tt property}, which is really
underspecified as to what property it refers to, as befits the definition of
{\it similar} which is underspecified exactly in this respect (compare {\it
similar consequences} to {\it similar balloons}). {\tt HAS} again is
primitive, the causative element in {\it -ize} is well known
\citep{Lieber:1992,Plag:1998}, and the idea that we define certain verbs by
their result state is standard. Temporal structure can refer to some state
{\tt before} or {\tt after} the event, but this has to be explicitly
stated. Comma-separated linear order, as in {\tt =PAT HAS property, stem HAS
property} simply means conjunction (see Section~\ref{comma}), and as such
it is independent of the order of the conjuncts.
In the fourth edition \citep{Bullon:2003} LDOCE defines {\it caramelize} as
`if sugar caramelizes, it becomes brown and hard when it is heated'. The
first edition of LDOCE \citep{Procter:1978} does not define {\it caramelize}
and has no self-recursion. The self-recursive definitions added to later
editions may be a feature from the perspective of the human language learner,
but they are definitely a bug from the computational perspective. To parse this
definition would lead us nowhere, since the definiendum is part of the
definiens, and we don't have a theory for finding a minimal fixed point in
{\it if sugar if sugar if sugar \ldots X, X becomes brown and hard when it is
heated X becomes brown and hard when it is heated \ldots}. What happens when
it's not heated? Is it brown? Will it become brown? Is it hard when it's
caramelized? Or will it become hard only when heated? How about caramelizing
something other than sugar, say onions? This definition says nothing about the
`if not sugar' case, whereas the definition we derived above at least tells us
that if onion is caramelized it will share some properties with caramel.
\subsection{Encyclopedic knowledge}
The first edition of LDOCE \citep{Procter:1978} defines {\it caramel} as
`burnt sugar used for giving food a special taste and colour'. In 4lang this
could be recast as
\begin{verbatim}
caramel sugar[burnt], CAUSE {food HAVE {taste[special], colour[specal],
<taste[sweet]>, <colour[brown]>}}
\end{verbatim}
\noindent
where quite a bit of the syntax is implicit, such as the fact that {\tt
caramel} is the subject of {\tt CAUSE}, see Section~\ref{subjobj}, and we
sneaked in some real world knowledge that the special taste is (in the default
case) sweet, and the special color is brown.
As the preceding make clear, we could track further {\it special} (defined in
4lang as {\tt LACK common}), or {\it food}, or {\it burnt}, or any term, but
here we will concentrate on {\it sugar} `a sweet white or brown substance that
is obtained from plants and used to make food and drinks sweet'. Remarkably,
this definition would also cover xylitol $(CH_2OH(CHOH)_3CH_2OH)$ or stevia
$(C_{20}H_{30}O_3)$ which are used increasingly as replacements for common
husehold sugar $(C_6H_{12}O_6)$.
This is not to say that the editors should have been aware in 1978 that a few
decades later their definition will no longer be specific enough to
distinguish sugar from other sweeteners. Yet the clause `obtained from
plants' is indicative of awareness about saccharine $(C_7H_5NO_3S)$ which is
also sweet, but is not obtained from plants.
4lang takes the line that encyclopedic knowledge has no place in the
lexicon. Instead of worrying about how to write clever definitions that will
distinguish sugar not just from saccharine but also from xyletol, stevia, and
whatever new sweeteners the future may bring, it embraces simplicity and
provides definitions like the following:
\begin{verbatim}
rottweiler dog
greyhound dog
\end{verbatim}
\noindent
This means that we fail to fully characterize the competent adult speaker's
ability to use the word {\it rottweiler} or {\it greyhound}, but this does not
seem to be a critical point of language use, especially as many adult speakers
seem to get along just fine without a detailed knowledge of dog breeds. To
quote \cite{Kornai:2010}:
\begin{quote}
So far we discussed the {\it lexicon}, the repository of linguistic knowledge
about words. Here we must say a few words about the {\it encyclopedia}, the
repository of world knowledge. While our goal is to create a formal theory of
lexical definitions, it must be acknowledged that such definitions can often
elude the grasp of the linguist and slide into a description of world
knowledge of various sorts. Lexicographic practice acknowledges this fact by
providing, somewhat begrudgingly, little pictures of flora, fauna, or
plumbers' tools. A well-known method of avoiding the shame of publishing a
picture of the yak is to make reference to {\tt Bos grunniens} and thereby
point the dictionary user explicitly to some encyclopedia where better
information can be found. We will collect such pointers in a set {\bf E}
\end{quote}
\noindent
Today, we use Wikipedia for our encyclopedia, and denote pointers to it by a
prefixed @ sign, see Section~\ref{atsign}. Our definitions are
\begin{verbatim}
sugar sweet, IN food, IN drink
sweet taste, good, pleasant, sugar HAS taste, honey HAS taste
\end{verbatim}
\noindent
Instead of sophisticated scientific taxonomies, 4lang supports a naive
world-view \citep{Hayes:1979,Gordon:2017}. We learn that {\it sugar} is sweet,
and {\it sweet} IS\_A taste -- the system actually makes no distinction
between predicative (is) and attributive (is\_a) usage. We learn that sugar is
to be found in food and drink, but not where exactly.
One place where the naive view is very evident is the treatment of high-level
abstractions. For example, the definition of {\it color} has nothing to do
with photons, frequency ranges in the electromagnetic spectrum, or anything of
the sort -- what we have instead is {\tt sensation, light/739, red IS\_A,
green IS\_A, blue IS\_A} and when we turn to e.g. {\it red} we find {\tt
colour, warm, fire HAS colour, blood HAS colour}. Another field where we
support only a naive theory is grammar, see \ref{naivegrammar}.
As with {\it sugar} and {\it sweet}, we posit something approaching a mutual
defining relation between {\it red} and {\it blood}, but this is not entirely
like Titius and Caius sending you further on: actually {\it blood} gets
eliminated early in the uroboros search as we iteratively narrow the defining
set, while {\it red} stays on. Eventually, we have to have some primitives,
and we consider {\it red}, a Stage II color in the \cite{Berlin:1969}
hierarchy, a very reasonable candidate for a cross-linguistic primitive.
So far, we have discussed the fact that separating the encyclopedia from the
lexicon leaves us with a clear class of lexical entries, exemplified so far by
colors and flavors, where the commonly understood meaning is anchored entirely
outside the lexicon. There are also cases where this anchoring is partial,
such as the suffix {\it -shaped}. The meaning of {\it guitar-shaped, C-shaped,
U-shaped, \ldots} is clearly compositional, and relies, on the hand, on
cultural primitives such as {\it guitar, C, U, \ldots} that will remain at
least partially outside the lexicon. According to \citet{Rosch:1975}, lexical
entries may contain pointers to non-verbal material, not just primary
perceptions like color or taste, but also prototypical images. We can say
that {\it guitar} is a stringed musical instrument, or that $C$ and $U$ are
letters of the alphabet, and this is certainly part of the meaning of these
words, but it is precisely in the image aspect highlighted by {\it -shaped}
that words fail us. Again anticipating notation that we will fully define only
in subsequent sections, we can define {\it guitar-shaped} as {\tt HAS shape,
guitar HAS shape} and in general
\begin{verbatim}
-shaped stem HAS shape, =AGT HAS shape, "_-shaped" MARK stem
\end{verbatim}
\noindent
and leave it to the general unification mechanism we discussed in
\ref{morphology} to guarantee that it is the same shape that the stem and the
subject of the compound adjective will share.
\subsection{Lexical categories and subcategories}\label{lexcat}
Whether a universal system of lexical categories exists is still a widely
debated question. \cite{Bloomfield:1933}, and more recently
\cite{Kaufman:2009} argued that certain languages like Tagalog have only one
category, but the notion that there are at least three major categories that
are universal, nouns, verbs, and adjectives, has been broadly defended
\citep{Baker:2003,Chung:2012}. 4lang subdivided verbs into two categories:
intransitive U and transitive V; retaining the standard N for noun; A for
adjective; and also used D for aDverb; and G for Grammatical formative.
While this rough categorization has proven useful for seeking bindings in the
original 4 and in other languages, there is no theoretical claim associated to
these categories, neither the universal claim that all languages would
manifest these categories (or at least, or at most, these), nor the
(four)language-particular claim that these categories are somehow
necessary/sufficient for capturing the data. In fact, 4lang is a semantic
system, and it says remarkably little about the system of lexical categories
and subcategories, be they defined by morphological or syntactic
cooccurrences. If anything, our findings lend support to the thesis of
\cite{Wierzbicka:2000} that cross-linguistic identification of lexical
categories is to be achieved via prototypes rather than by abstract class
meanings.
To the extent that none of the six lexical categories U,V,N,A,D,G is ever
referred to by any definition or rule, 4lang holds fast to the autonomy of
syntax thesis. In particular, we refrain from stating the categorial signature
of elements even when it is obvious, e.g. that {\it -ize} is N$\rightarrow$V
(see \cite{Lieber:1992} that in productive uses the resulting verb must be
transitive), and we feel free to add in English paraphrases formatives such as
{\it be, that, a/an, the, to, -ly \ldots} which serve only to make the English
syntax come out right. For lexicographic completeness, we have entries
e.g. for infinitival {\it to}, but 4lang does not encode any difference
between the meaning of {\it eat} and {\it to eat} (see \ref{naivegrammar} for
details). This is in sharp contrast to locative {\it to}, which we see as
contentful and define as {\tt after(=AGT AT =PAT)}.
The syntax of 4lang definitions countenances only two basic types (lexical
categories in the metalanguage): unaries and binaries, and permits lexical
entries to be ambiguous between these two. The basic unary type is seen in
most nouns, especially proper names, adjectives, and adverbials, and the basic
binary type is seen in transitive verbs and adpositions. As an example of the
latter, consider the preposition {\it at}, defined in LDOCE as `used to say
exactly where something or someone is, or where something happens'. Clearly,
{\it at} is a binary relation (we write these in SVO order) x AT y, where y is
strongly subtyped for location, be it spatial or temporal, so strongly that
otherwise unspecified entities like {\it Jim's} have to be typcast to location
if we are to make sense of expressions like {\it We meet at Jim's}. In
contrast, x is left untyped: it could be a physical object, a person, or even
an event. Either way, y provides the origin of the coordinate system where we
anchor x. 4lang has the means to express the selectional restriction on the
second argument (see Section~\ref{linking}), but considers the use of {\it
where} inappropriate, given that {\it at} has no question component. % (and in
%fact {\it where} will be defined as {\tt AT wh}).
Since other ideas about what
{\it at} means, be they cast in terms of some geometric coordinate system or
in terms of figure/ground, are far too complex to serve as the basis of some
reductive theory, we again bite the bullet and take AT to be a semantic
primitive. This yields the definition
\begin{verbatim}
at AT, =PAT[place/1026], "at" MARK =PAT
\end{verbatim}
\noindent
(In Hungarian, we would have {\tt "n\'al/n\'el" MARK =PAT} -- the MARK-clauses
in column 8 are always specific to English.) Whereas adpositions are generally
hard to decompose in terms of more primitive notions, transitive verbs are
much easier: as a classic example (and to show our indebtedness to the
generative semantic tradition) we provide
\begin{verbatim}
kill =AGT CAUSE {=PAT[die]}
\end{verbatim}
\noindent
In the metalanguage of the definition syntax there are only three binary
relations: {\it subject}, depicted in graphs by an arrow labeled by `1' and
pointing toward the subject; {\it object}, depicted in graphs by an arrow
labeled by `2' and pointing toward the object (see Section~\ref{subjobj}); and
{\it is/is\_a}, depicted in graphs by an arrow labeled `0' running from
subclass to superclass (see Section~\ref{isa}). These are not to be confused
with binary semantic relations such as {\tt AT, CAUSE}, or {\tt kill}, of which
there are a handful of primitives and thousands of derived ones, see \ref{rel}.
Two important semantic primitives worth mentioning here are {\tt gen} and {\tt
wh}. {\tt gen} is a generic quantifier-like element that is neutral between
{\it somebody/something} and {\it anybody/anything}. In a system of formal
logic {\tt gen} would be just a variable-binding term operator, without
universal or existential import. {\tt wh}, also a VBTO, provides the
semantics of the interrogative morpheme. Note that both {\tt gen} and {\tt wh}
are unaries, and have no scope.
Unaries can serve both as predicates and as arguments. Altogether, the
metalanguage is rather loosely typed, in that anything can serve as an
argument (when an argument slot is filled by something complex, this complex
formula is surrounded by $\{ \}$ for the sake of clarity, see
Section~\ref{parens}) and neither argument of a binary needs to be filled
obligatorily.
Another pair of unaries, {\tt after} and {\tt before}, constitute all that
4lang currently offers in way of temporal semantics. These refer to the state
after (resp. before) the event (verb) they characterize, think of these as
the initial and final stills from a short movie depicting the event. So we have
\begin{verbatim}
die after(dead)
dead LACK live, before(live)
\end{verbatim}
\noindent
These are not to be confused with {\tt AFTER} and {\tt BEFORE}, which are the
usual temporal primitives (by duality, only one of them needs to be left
undefined) with two arguments, e.g. {\tt Tuesday BEFORE Wednesday}.
\subsection{Relations}\label{rel}
In most systems of semantic representation there is something of a squish
between categories of the language and categories of the metalanguage. Our
goal is to avoid this entirely, but of course there exist important metalangue
categories that are modeled on linguistic concepts, often bearing the same
name, so a great deal of caution is called for. In \ref{lexcat} we discussed
the lexical categories of the language (N,V,A,...) and mentioned that the
metalanguage has only two, unaries and binaries, with most elements belonging
to both of these. We could stretch the analogy, and consider the comma and the
various parentheses used in the metalanguage as G (grammatical formatives) but
the central ideas of natural language syntax are not very helpful in
describing the metalanguage, so we will not pursue this here.
Here we discuss those elements that are always binary, and to avoid confusing
these with the contentful (linguistic) elements, we call them {\it relations}
(in the ordinay mathematical sense `subset of direct product'). We have argued
elsewhere \citep{Kornai:2012} that linguistic analysis doesn't require ternary
or higher arity relations, and here we will see that the metalanguage uses
binaries only. The central binaries we rely on are `0', `1', and `2'. Of
these, `1' and `2' are the familiar grammatical functions {\it subject} and
{\it object} respectively, while 0 `being' is used indiscriminately for {\it
is} and IS\_A. For legibility of formulas, a 0 arrow (binary relation) from
{\tt b} to {\tt a} can be written as {\tt b[a]} or as {\tt a(b)}. Both
correspond to Plato's notion of the subject `partaking' or `participating in'
the predicate. We maintain the Aristotelian distinction between accidental and
essential participation, using only essential properties in definitions (but
not in the semantic representation of more complex expressions).
There are no type restrictions on what can be connected with what by means of
0, 1, and 2 relations, except that relations cannot appear as subjects or
objects (grammatical functions are discussed further in \ref{linking}). They
can, however, appear connected by IS\_A, since IS\_A is not conceptualized as
a relation in the metalanguage, only `0', a supercategory of `participation'
and `inherence' is. In fact, IS\_A is treated as epiphenomenal, {\tt x IS\_A
y} being equivalent to `x has all the essential properties of y and perhaps
some others as well'. In other words, the subsumption hierarchy can be deduced
from a model, but it is rarely overtly marked, except for abbreviatory
purposes.
Besides IS\_A, there are some other link types conspiciously missing from
4lang. Systems of Knowledge Representation (KR) such as Cyc \citep{Lenat:1990}
often insist on finer distinctions than 4lang e.g. by distinguishing an
individual poet from the class Poet. This makes an individual such as Allen
Ginsberg an {\sc InstanceOf} the Poet class, and Cyc actually demands that a
distinction be made betwen this and {\sc SubsetOf}, as between {\sc Stuff} and
{\sc StuffType}, and so on. In 4lang, ``we make purposely very little
distinction between an individual fox, the species Vulpes vulpes, the set of
foxes in the world, or the class of potential foxes in all possible worlds''
\citep{Kornai:2018}, and treat {\it gold} as a unary. (Such unaries often have
binary sense as well, English is lucky to have {\it gilt} as a separate verbal
base, but the unary-binary conversion is very general, cf. {\it Babe Ruth
homered his way into the hearts of America} \citep{Jackendoff:1990}.)
Other link types commonly seen include prepositions with clear spatial meaning
such as IN; FROM; OUT; BETWEEN; AT; FOLLOW; ON; and ABOUT. Many of these
govern cases in certain languages, and many become case markers themselves in
others, and we see this as sufficient reason to treat them in the metalanguage
as ordinary binaries with their own 1st and 2nd arguments (denoted by `1' and
`2' as with transitive verbs). The capitalization is an admission of the fact
that currently 4lang is not endowed with a sufficiently general theory of
spatiotemporal relations to settle on these as true primitives -- it is
getting close, but it is not there yet. In many cases, such as {\tt under}, we
felt comfortable that the 4lang analysis captured the semantics sufficiently
well that capitalization has been dropped.
In those relations where the printname is not indicative of argument order, we
settled on one variant. For example we use {\tt x PART\_OF y} in preference to
{\tt y HAS\_PART x}, but in fact {\tt part} would work just as well -- the
fact that it governs {\it of} in this sense is again indicative of some
near-adposition near-case status. The system could be more consistent: the
other major near-case, INSTRUMENT, is used in order {\tt x HAS\_INSTRUMENT y}
rather than {\tt y INSTRUMENT\_OF x}. It is not clear what sort of facts could
be used to demonstrate that one argument order is better than the other.
Remarkably, only one of these relations lacks an easily expressible obverse:
for {\tt x IS\_ABOUT y} there is no relation REL expressing {\tt y REL x} in
approximately the same situations -- this may point at some irreducibly
nonspatial sense. The remaining relations that we currently leave unanalyzed
include purposive FOR (no obvious obverse, but no obvious spatial meaning
either); possessive HAS; and causative CAUSE; these three appear frequently in
definitions, and for the last two the passive works reasonably well the
express the obverse. This is not logically strict, {\it metal fatigue causes
deadly accidents} is not strictly equivalent to {\it deadly accidents are
caused by metal fatigue}, but the cause/effect relation is still clear from
the passive. (For scope effects, and for the dyadic negation relation LACK see
\cite{Kornai:2020a}.) We also keep as primitive the abstract comparative ER
`>', which also governs some case/adposition in many languages.
With this, our list of relations is complete, except for one element, MARK,
which we defer to \ref{naivegrammar}. Completeness, we should emphasize here,
is not the same as finality: it may well be possible to reduce the list
further, or to shift some of the basis from unary/binary (the auxiliary-style
analysis currently used) to purely binary, because relational treatment may be
warranted for modals such as desideratives or imperatives, and even for
standard lexical redundancy rules such as reflexivization, causativization or
locative inversion. This is obviously a prime testing ground for universal
grammar, but a systematic treatment will require a better cross-linguistic
understanding of relational elements than the author can lay claim to.
There are other important issues that we cannot pursue here, such as
collective subjects \citep{Scha:1981}. In general, we find verbs where the
subject and the object are well differentiated: {\it water quenches fire} but
not {\it *water and fire quench}. Cases of free alternation and rare ({\it
John marries Sandra, Sandra and John marry}), and cases where the subject
must be the collective are even more so {\it John and Peter are brothers,
*John brothers Peter, Marseille is between Nice and Montpellier, *Marseille
betweens Nice and Montpellier}. Remarkably, in the defining vocabulary we
see a disproportionate number of predicates that make only, or the most, sense
when one of the arguments is collective: {\it close, between, through,
\ldots}.
\subsection{Linking}\label{linking}
In terms of the amount of fully analyzed text available, Universal
Dependencies (UD) is the single most influential cross-linguistic framework of
grammatical description \citep{Nivre:2018s}. While many other schools of
grammatical description offer a broader variety of analyses, these (with the
possible exception of tagmemics) rarely extend to a broad selection of
languages, the dominant style of linguistic analysis being the in-depth study
of a restricted range of syntactic phenomena (ideally across many
typologically diverse languages, but quite often restricted to a single
language) rather than the in-breadth analysis of an entire language. Here we
assume the reader is familiar with UD, and compare 4lang to UD, pointing at
other frameworks only in a few places. Generally, 4lang is on the sparse or
`lumping' side of the comparison, not just in relation to UD, but also in
relation to other well-developed theories like LFG, HPSG, or MP.
As far as {\bf grammatical functions} are concerned, we only assume two,
subject and object. As argued in \citep{Kornai:2012}, ditransitives and higher
arity predicates are unnecessary for semantic purposes. Since 4lang doesn't
have an indirect object (UD {\tt iobj}) function, ditransitives are always
modeled by decomposition:
\begin{verbatim}
give =AGT CAUSE[recipient HAS =PAT], "dative" MARK recipient
buy =AGT receive =PAT, =AGT pay seller, "from" MARK seller
sell =AGT CAUSE[buyer HAS =PAT], buyer CAUSE[=AGT HAS <money>],
"dative" MARK buyer
\end{verbatim}
\noindent
Since UD distinguishes dependency links by the category of the head and the
dependent, it naturally keeps notions like {\tt nsubj} and {\tt csubj}
(nominal and clausal subjects) separate, and similarly for {\tt obj} and {\tt
ccomp}. 4lang, with its roots in the theory of Knowledge Representation,
where the proliferation of link types has emerged as a significant problem
early on \citep{Woods:1975}, admits only one other link type, `0' (subject
links are `1' and object links `2'), which subsumes most of the other link
types used in UD, such as {\tt amod, appos, nummod} and {\tt advmod}. In a
strictly link-based system such as UD it is a practical necessity to have a
separate link type for coordination: in 4lang we just use comma-separated
concatenation (see Section~\ref{comma}).
Aside from attribution and predication, which are both denoted by a `0' link,
two cardinal links, `1' and `2', are used for all binaries, incuding those
marked in many grammatical systems by {\bf deep cases} such as {\tt
INSTRUMENT}.
\begin{verbatim}
cheque paper, write ON, amount ON, signature ON, pay/812 INSTRUMENT
say communicate, INSTRUMENT sound/993, recipient hear sound/993,
"dative" MARK recipient
\end{verbatim}
What these definitions mean (by virtue of the convention spelled out in
Section~\ref{subjobj}) is that checks are instruments in paying, and sounds
are instruments of communicating. INSTRUMENT, as a binary relation has a
subject (what has the instrument) and an object (what is the instrument). In
effect, a single link type, which could be labeled `INS' is replaced by an
INSTRUMENT node that has the same two arrows as are used for other binaries,
subject and object. Whether other deep cases are called for is not clear from
a lexical semantics perspective, but (broadly speaking) subjects are agents,
objects are patients, AT, ON, IN, are locatives, FROM is source (rather than
ablative), and FOR is goal (purposives, rather than end of motion). 4lang
compromises on dative, which is treated as a {\bf surface case} rather than as
a deep case `recipient' or as an adposition (though in English it is one).
%\bigskip\noindent
%\begin{tabular}{llll}
%Agent & {\it kart\d{r}} & the independent one & (1.4.54)\\
%Goal & {\it karman} & what is primarily desired by the agent & (1.4.49)\\
%Recipient & {\it sa\d{m}prad\={a}na} & the one in view when giving &
%(1.4.32)\\
%Instrument & {\it kara\d{n}a} & the most effective means & (1.4.42)\\
%Locative & {\it adhikara\d{n}a} & the locus & (1.4.45)\\
%Source & {\it ap\={a}d\={a}na} & the fixed point that movement is away from
% & (1.4.24)\\
%\end{tabular}
\bigskip
Finally, we consider {\bf thematic roles} of which we have exactly two, {\tt
=AGT} and {\tt =PAT}. These are constitutive elements in definitions of
binaries where the subject or object needs to be named. Normally, it is the
definiendum that appears in subject or object position of a defining clause:
\begin{verbatim}
soil ground, plant grow IN
fault CAUSE problem
\end{verbatim}
\noindent
but time and again we need to say sameting about the subject or object of the
definiendum itself:
\begin{verbatim}
protest show [=AGT think [=PAT[wrong]]]
\end{verbatim}
\noindent
{\it x protests y} means that x shows that x thinks that y is wrong. The
thematic roles simply reify the protester (=AGT) and the thing protested
(=PAT). Agents and patients of definiendum and definiens are automatically
shared, so it is really {\tt =AGT protest =PAT} that is being defined by {\tt
=AGT show [=AGT think [=PAT[wrong]]]}. For ease of human reading,
redundancies like this are suppressed, but the parser supplies =AGT and =PAT
automatically,
\subsection{Naive grammar}\label{naivegrammar}
The central subject of this paper, the {\it metalanguage} we use to describe
the semantics, has its own syntax, which drives the parser {\tt
def\_ply\_parser.pl}. But before we can turn to this in \ref{8thcol}, we need
to emphasize that 4lang is a formal system on its own right, not intended as a
proposal about natural language syntax (the parser will parse the definitions,
not natural language), and discuss some cases where considerations of syntax
nevertheless creep in.
Since the issue is central to the development of generative grammar, we should
make clear here that our position is not intended as an argument for, or
against, the autonomy of syntax thesis. As a research strategy, we prefer a
semantic formalism that is as autonomous as feasible, since this promotes
modularity not just in the sense of \cite{Fodor:1983}, but also in the sense
of enabling independent experimentation and research for both syntacticians
and semanticists. We do not feel qualified to take sides in the debate, but if
those who believe only in a limited autonomy of syntax are to mount arguments
capable of convincing the opposing side, these arguments need to be cast in
terms of the inadequacy of well-modularized systems, so even for those
refusing to entertain full modularity the first order of business is to look
at modular architectures.
Our main contribution to this area is that we only make reference to a {\it
naive} theory of grammar, just as we see the need to link to naive
probability \citep{Gyenis:2019}, naive planning \citep{Gordon:2017}, and
believe that many of the issues discussed in \ref{rel} would be considerably
simplified by reference to the appropriate naive theory of space and time.
The fundamental elements (primitives) of naive grammar are {\it words}. We
don't go anywhere near the issues of how a word is, or should be, defined in
phonology, morphology, orthography, syntax, semantics, or lexicography (though
we assume that the reader is somewhat familiar with the main proposals). For
our purposes {\it word} is defined as {\tt sign, speech}, and {\it sign} as
{\tt gen perceive, information, show, HAS meaning}.
We have seen in \ref{lexcat} the use of the generic pronoun {\tt gen}, and
could detail the naive theory implicit in our system of definitions, but the
system of these is not intended as the final word on this subject, and would
not bring us much closer to the focal point of the lexical semantics/naive
grammar interface, which we take to be the (primitive) relation {\tt MARK}.
As used in 4lang, {\tt MARK} is the relation connecting form and meaning. This
corresponds well to the Saussurean notion of the sign, and will be sufficient
for our purposes, even though a more sophisticated theory of signs
\citep{Kracht:2011} is available for the non-naive theory of grammar. Our main
use of {\tt MARK} is with function words (including bound grammatical formatives) as
in
\begin{verbatim}
-ing stem-ing IS_A event, "_-ing" MARK stem
\end{verbatim}
Operationally, whatever precedes the formative {\tt ing} is considered a stem,
and the whole form {\tt stem+ing} is considered an event. There is clearly a
great deal more that could be said about {\it -ing} suffixation, the notion of
stems, the classification of junctures, or the conceptual classification of
certain matters as events, but we make no apologies for not developing these
notions as part of naive grammar, especially as {\tt MARK} is used only in
0.5\% of the vocabulary, and often with other similarly undeveloped notions of
naive grammar such as cases. As an example, consider {\it appear} `pareo'
defined by {\tt after[=AGT AT location], "locative" MARK location} as in
{\it A deer appeared in the garden}.
What can/should be considered a locative case in a given language, and what
can be considered a location are weighty questions, well beyond the naive
theory of grammar that we rely on here. On the whole, our approach only
provides a lower bound: to the extent anybody wishes to engage in a fuller
analysis of the vocabulary, they need to introduce some terminology that takes
them outside the bounds of the naive theory. In this regard, the system is
full of promissory notes: we define {\it to/3600} by {\tt "to/3600 \_" MARK
infinitive}, as in {\it John went to eat}, and {\it to/12} as {\tt
after(=AGT AT =PAT)}, as in {\it John went to Chicago}. Here we must
leave notions like {\it infinitive} or agent/patient undefined (we have seen
in \ref{linking} some examples of their intended use).
In this paper we do not take on board the issue of expanding this to a fuller
theory of (naive) grammar, but as we see, a non-eliminable grammatical
core emerges from semantic definitions, and these offer a rich interface for
connecting 4lang to issues that modern grammar (in this case, starting with
\cite{Fillmore:1968}) has much to say on. For example, we will need a theory
that connects intransitives to transitives, as in {\it The fire spread} and
{\it The wind spread the fire}. In English, it is obvious that there is some
relation between these two verbs, but in Latin there is no obvious reason to
relate {\it distendo} and {\it sterno}. In Hungarian, the stems are derived
from the same root by productive suffixation, so we have {\it ter-\"ul} and
{\it ter-\'{\i}t}. 4lang brings out the similarity by taking the intransitive
form as basic {\tt after(=AGT AT wide)} and derives the transitive effect
as one of causation: {\tt =AGT CAUSE {=PAT spread}}.
We expressly disavow any idea of the naive grammar being the ultimate grammar,
or even the metalanguage being the ultimate metalanguage. The system is
designed to support one thing, and one thing only, natural language
semantics. There are many other semiotic systems from music to mathematics
that would have very different semantics, and 4lang is simply not equipped to
deal with these. Also, experience shows that naive theories are superseded by
more sophisticated ones for a reason, as the sophisticated theories are simply
better. But they often rely on key components, such as arithmetic, or the
analytic theory of continuous variables, that are out of scope for 4lang,
expressly designed to deal with ordinary (as opposed to technical or
scientific) language.
\section{The syntax of definitions}\label{8thcol}
\subsection{Coordination}\label{comma}
A 4lang definition always contains one or more (hypergraph) nodes, of which
one is distinguished as the {\it head} (related to, but not exactly the same
as the {\it root} in dependency graphs). All these are interpreted as graph
edges with label 0 running from the definiendum to the definiens. The simplest
definitions are therefore of the form x, where x is a single node. Example
(all examples are taken from 4lang/Reform):
\noindent
{\tt aim purpose}
\noindent
that is, the word {\it aim} is defined as {\it purpose}. Somewhat more complex
definitions are given by a comma-separated list. Here the head is always the
first element. Examples:
\begin{verbatim}
board artefact, long, flat
boat ship, small, open/1814
\end{verbatim}
The number following the '/', if present, serves to disambiguate among various
definitions, in this case adjectival {\it open} `apertus' from verbal
{\it open} `aperio'. These numbers are in column 5 of the 4lang file.
\subsection{External pointers}\label{atsign}
Sometimes (here 42 cases in 1,200) a concept doesn't fully belong in the
lexicon, but rather in the encyclopedia. In the formal language defined here,
such {\it external pointers} are marked by a prefixed @. Examples:
\begin{verbatim}
Africa land, @Africa
London city, @London
Muhammad man/744, @Muhammad
U letter/278, @U
\end{verbatim}
\subsection{Subjects and objects}\label{subjobj}
In addition to 0 links, definitions often explain the definiendum in terms of
it being the subject or object of some binary relation. When not defined by
some more primitive term, such binary relations are given in CAPS. For
example:
\begin{verbatim}
April month, FOLLOW march/1563, may/1560 FOLLOW
bank institution, money IN
\end{verbatim}
\noindent
The intended graph for April will have a 0 link from the head to month, a 1
link to march/1563 and a 2 link to may/1560. Often, what is at the other side
of the binary is unspecified, in which case we use the {\tt gen} symbol ``plugged
up''. Examples:
\begin{verbatim}
vegetable plant, gen EAT
sign information, gen PERCEIVE, show, HAS meaning
\end{verbatim}
\noindent
Thus, {\it vegetable} is a plant that someone (not specified who) can eat (it
is the object of eating, subject unspecified), and {\it sign} is\_a
information, is the object of perception, is\_a show (nominal, something that
is or can be shown) and has meaning.
\subsection{Direct predication}\label{isa}
In a formula {\tt A[B]} means that there is a 0-link from A to B. This is used
only to make the notation more compact. The notation B(A) means the same
thing, it is also just syntactic sugar. Both brackets and parens can contain
full subgraphs.
\begin{verbatim}
tree plant, HAS material[wood], HAS trunk/2759, HAS crown
\end{verbatim}
\noindent
That trees also have roots is not part of the definition, not because it is
inessential, but because trees are defined as plants, and plants all have
roots, so the property of having roots will be inherited.
\subsection{Defaults}\label{default}
In principle, all definitional elements are strict (can be defeased only under
exceptional circumstances) but time and again we find it expedient to collapse
strongly related entries by means of defaults that appear in angled brackets.
\begin{verbatim}
ride travel, =AGT ON <horse>, INSTRUMENT <horse>
\end{verbatim}
\noindent
These days, a more generalized {\it ride} is common ({\it riding the bus,
catching a ride, \ldots} so the definition {\tt travel} should be sufficient
as is. The historically prevalant mode of traveling, on horesback, is kept as
a default. Note that these two entries often get translated by different
words: for example Hungarian distinguishes {\it utazik} `travel' and {\it
lovagol} `rides a horse', a verb that cannot appear with an object or
instrument the same way as English `ride a bike' can.
\subsection{Agents, patients}\label{agtpat}
The relationship between horseback riding (which is, as exemplified in
Section~\ref{default} above, just a form of travelling) and its defining
element, the horse, is indirect. The horse is neither the subject, not the
object of travel. Rather, it is the rider who is the subject of the
definiendum and the definiens alike, corresponding to a graph node that has a
1 arrow leading to it from both. This node is labelled by {\tt =AGT}, so when
we wish to express the semantic fact that Hungarian {\it lovagol} means travel
on a horse we write
\begin{verbatim}
lovagol travel, =AGT ON horse
\end{verbatim}
\noindent
Note that the horse is not optional for this verb in Hungarian: it is
syntactically forbidden ({\it lovagol} is intransitive) and semantically
obligatory. (Morphologically it is already expressed, as the verb is derived
from the stem {\it l\'o} `horse' though this derivation is not by productive
suffixation.) Remarkably, when the object is\_a horse (e.g. a colt is a young
horse, or a specific horse like Kincsem) we can still use {\it lovagol} as in
{\it J\'anos a csik\'ot lovagolta meg} or {\it Elijah Madden Kincsemet
lovagolta}.
For the patient role, consider the word {\it know}, defined as `has
information about'. For this to work, the expression {\tt x know y} has to be
equivalent to {\tt x HAS information ABOUT y}. For this to work, we need to
express the fact that the subject of HAS is the same as the subject of {\it
know} (this is done by the {\tt =AGT} placeholder) and that the object of