def_syntax.tex

\documentclass[11pt,bookmarks,bookmarksnumbered,naturalnames,plainpages=false,pdftex,colorlinks=true,urlcolor=blue,bookmarksdepth=subsection,plainpages=false]{paper}
\usepackage[T2A,T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[a4paper,margin=1in]{geometry}
% \newcommand{\Be}{\mbox{\usefont{T2A}{\rmdefault}{m}{n}\CYRB}}
\usepackage{times}
\usepackage{amssymb}
\usepackage{transparent}
% \usepackage{makeidx}
\usepackage{tikz}
%\usepackage{wrapfig}
%\usetikzlibrary{snakes,arrows,shapes,automata}
%\usepackage{mlbook}
%\usepackage[round]{natbib}
%\usepackage{multicol}        
\usepackage{epsfig}
\usepackage{graphicx}
% \usepackage{xypic}
\usepackage[matrix,arrow]{xy}
\usepackage[pdftex,colorlinks=true,urlcolor=blue,bookmarksdepth=subsection,plainpages=false]{hyperref}
\usepackage[backend=biber,natbib,style=authoryear,uniquename=false,uniquelist=false]{biblatex}
\def\stackrel#1#2{\mathrel{\mathop{#2}\limits^{#1}}}
\makeatletter
\newcommand{\dashedrightarrow}[1][2pt]{%
  \settowidth{\@tempdima}{$\rightarrow$}\rightarrow% typeset arrow
  \makebox[-\@tempdima]{\hskip-1.5ex\color{white}\rule[0.5ex]{#1}{2pt}}% typeset overlay
  \phantom{\rightarrow}% advance appropriate horizontal distance
}
\makeatother

\newcommand{\andras}[1]{{\color{magenta}{AK: #1}}}
\newcommand{\zalan}[1]{{\color{red}{MM: #1}}}

\addbibresource{ml.bib}

\begin{document} 
\title{The syntax of 4lang definitions}
\author{Andr\'as Kornai}
%\small HAS Computer and Automation Research Institute\\
%\small H-1111 Budapest Kende u 13-17, Hungary\\
%\small \href{mailto:andras@kornai.com}{andras@kornai.com}}
\date{}
\maketitle

\begin{abstract}
We describe how the basic lexicographic principles of 4lang are reflected in
the syntax of definitions
\end{abstract}

\section*{Background}

4lang is a concept dictionary, intended to be universal in a sense made more
precise below. The main motivation was spelled out in \citep{Kornai:2010}
as follows:

``In creating a formal model of the lexicon the key difficulty is the
circularity of traditional dictionary definitions -- the first English
dictionary, \cite{Cawdrey:1604} already defines {\it heathen}
as {\tt gentile} and {\it gentile} as {\tt heathen.} The problem has already
been noted by Leibniz (quoted in \cite{Wierzbicka:1985}):

\begin{quote}
Suppose I make you a gift of a large sum of money saying you can collect it
from Titius; Titius sends you to Caius; and Caius, to Maevius; if you continue
to be sent like this from one person to another you will never receive
anything.
\end{quote}

\noindent
One way out of this problem is to come up with a small list of primitives, and
define everything else in terms of these.'' 

To take the first tentative steps towards language-independence, the system
was set up with bindings in four languages, representative samples of the
major languge families spoken in Europe: Germanic (English), Slavic (Polish),
Romance (Latin), and Finno-Ugric (Hungarian). Today, bindings exist in over 40
languages \citep{Acs:2013}  but the user should keep in mind that these 
bindings provide only rough semantic correspondence to the intended concept. 

4lang can be used for a variety of purposes: for better understanding of
deep cases \citep{Makrai:2014}; for producing state of the art results on
analogy tasks \citep{Recski:2016c}; for studies of the importance of
individual concepts \citep{Makrai:2013a}; for investigating spreading
activation \citep{Nemeskey:2013}; and of course for investigating various
issues of lexicography \citep{Kornai:2015a}.

Unfortunately, these papers don't always reference the same version of the
slowly evolving 4lang, and the principles undergirding the system are not
available in a single, convenient source. The goal of this document is to
provide a single entry point for both the linguistic and the computational
aspects. 

The main {\tt 4lang} file is divided into 9 tab-separated fields, of which the
last is reserved for comments. The percent sign is also used to delimit
comments, this is not (yet) consistent. A typical entry (written as one line
in the file but here broken up in three for legibility) would be

\begin{verbatim}
attention	figyelem	animi_attentio	uwaga	801	u	V    
listen, see, =AGT think =PAT[interesting, important] 
%light verb, "pay attention"
\end{verbatim}

\noindent
As can be seen, the first four columns are the 4 language bindings given in
EHLP order, with provisions for keeping the entire file in 7-bit ASCII. The
fifth is a unique number per concept, most important when the English bindings
coincide: 

\begin{verbatim}
cook	fo3z	coquo	gotowac1	825	u	V	=AGT make <food>, INSTRUMENT heat	
cook	szaka1cs	coquus	kucharz	2152	u	N	person, <profession>, make food		
\end{verbatim}

\noindent
The sixth column is technical (representing current update status) and will be
ignored here, as it is relevant only for the maintainers of the
dictionary. The seventh column is a rough lexical category symbol, see
Section~\ref{lexcat} for further discussion. The main subject of this note is
the 8th column, which gives the 4lang definition, see Section~\ref{8thcol}, 
but before turning to this, we discuss some of the lexicographic principles,
and take the opportunity to introduce some of the machinery informally first. 

%ring	cseng	tinnit	dzwonic1	2735	u	U	bell make sound/993, FOR/2782 attention, <bell PART_OF telephone>	


\section{Lexicographic principles}

\subsection{Reductivity}

In many ways, 4lang is a logical outgrowth of modern, computationally oriented
lexicographic work beginning with Collins-COBUILD \citep{Sinclair:1987}, the
Longman Dictionary of Contemporary English (LDOCE) \citep{Boguraev:1989},
WordNet \citep{Miller:1995}, FrameNet \citep{Fillmore:1998}, and VerbNet
\citep{Kipper:2000}. 

The key step in minimizing circularity was taken in LDOCE, where a small
(about 2,200 words) defining vocabulary (called LDV, Longman Defining
Vocabulary) was created, and strictly adhered to in the definitions with one
trivial exception: words that often appear in definitions (e.g. the word {\it
  planet} is common to the definition of Mercury, Mars, Venus, \ldots) can be
used as long as their definition is strictly in terms of the LDV.  Since {\it
  planet} is defined `a large body in space that moves around a star' and
{\it Jupiter} is defined as `the largest planet of the Sun' it is easy to
substitute one definition in the other to obtain for Jupiter the definition
`the largest body in space that moves around the Sun'. 

4lang generalizes this process, starting with a core list of NNN primitives,
defining a larger set in terms of these, a yet larger set in terms of these,
and so on until the entire vocabulary is in scope. As a practical matter we
started from the opposite direction, with a seed list of approximately 3,500
entries composed of the LDV (2,200 entries), the most frequent 2,000 words
according to the Google unigram count \citep{Brants:2006} and the BNC
\citep{Burnard:1998}, as well as the most frequent 2,000 words from Polish
\citep{Halacsy:2008} and Hungarian \citep{Kornai:2006}.  Since Latin is one of
the four languages supported by 4lang, we added the classic
\cite{Diederich:1939} list and \cite{Whitney:1885}.

Based on these 3,500 words, we reduced the defining vocabulary by means of a
heuristic graph search algorithm \citep{Acs:2013} that eliminated all words
that were definable in terms of the remaining ones. The end-stage is a
vocabulary with the {\it uroboros property}, i.e. one that is minimal wrt this
elimination process. This list (1,200 words, counting different senses with
multiplicity) was published as Appendix~4.8 of \cite{Kornai:2019} and was used
in several subsequent studies including \citep{Nemeskey:2018}. (The last
remnant of the fact that we started with over 3k words is that numbers in the
5th column are still in the 1-3,500 range, as we decided against renumbering
the set.)

Importantly, since the uroboros vocabulary was obtained by systematic
reduction of a superset of the LDV, it is still guaranteed that every sense of
every word listed in LDOCE (over 82k entries) are definable in terms of
these. Since the defining vocabularies of even larger dictionaries such as
Webster's 3rd \citep{Merriam:1961} are generally included in LDOCE, we have
every reason to believe that the entire vocabulary of English, indeed the
entire vocabulary of any language, is still definable in terms of these 1,200
concepts. 

Unfortunately, such redefinition generally requires more than string
substitution: for example if you substitute `a large body in space that moves
around a star' into `the largest \_\_ of the Sun' you would obtain `the
largest a large body in space that moves around a star of the Sun' and it
takes a great deal of sophistication for the substitution algorithm to realize
that {\it a large} is subsumed by {\it the largest} or that {\it a star} is
instantiated by {\it the Sun}. People perform these operations with ease,
without conscious effort, but for now we lack parsers of the requisite
syntactic and semantic sophistication to do this automatically. Part of our
goal with the strict definition syntax that replaces English syntax on the
right-hand side (rhs) of definitons is to study the mechanisms required by an
automated parser for doing this. 

\subsection{Morphological prerequisites}\label{morphology}

The LDV contains a few dozen bound morphemes, the suffixes {\it -able -al -an
  -ance -ar -ate -ation -dom -ed -en -ence -er -ery -ess -est -ful -hood -ible -ic
  -ical -ing -ion -ish -ist -ity -ive -ization -ize -less -like -ly -ment
  -ness -or -ous -ry -ship -th -ure -ward -wards -work -y} and the prefixes
{\it counter- dis- en- fore- im- in- ir- mid- mis- non- re- self- un- vice-
  well-}.  These are tremendously useful both in reducing the size of the
defining vocabulary (since {\it eat} and {\it eating} no longer be listed
both) and in making the definitions less complicated.

While we obviously cannot cover the entirety of English morphology as part of
4lang, we do not consider the problems raised by bound forms to be
qualitatively different from those raised by lexical semantics in
general. Also, languages are not uniform in where they draw the bound/free
boundary: many concepts that are expressed by affixation in one are expressed
by free forms in another, and dictionary definitions often contain these.

We will illustrate our methods on the suffix {\it -ize}, which means something
like `to cause to become', so {\it Americanize} `cause to become American',
{\it carbonize} `cause to become carbon' and so forth. There are cases that do
not fit this analysis ({\it agonize} doesn't mean `cause to become agony' the
same way {\it colonize} means `cause to become colony') and there are other
subregularities one may wish to consider, but the majority of the 2-300
English words ending in {\it -ize} fit this pattern well enough to consider it
the leading candidate for a semantic definition. What we wish to state is a
lexical rule roughly of the following form: for stem X, stem+ize means `cause
to become (like) X'. Anticipating several notational conventions that will
only be explained subsequently, we could write this as 

\begin{verbatim}
-ize CAUSE {become <like/1701> stem}, "-ize" MARK stem
\end{verbatim}

\noindent 
From here on in examples we keep only the English binding for the definiendum
(the first field in a dictionary record), followed by the definiens (the 8th
field).  Here {\tt CAUSE} is a primitive (part of our eventual uroboros set)
written in CAPS because it is one of the few binary relations we admit as
primitives (primitive unaries and variadic predicates are written lowercase).
The curly braces (see Section~\ref{parens}) denote a single hypergraph node
(pictorially, all formulas will correspond to hypergraphs) and the angled
brackets signify optionality, enclosing the default option (see
Section~\ref{default}). {\tt MARK} is another primitive, standing for the
relation between signifier (a string, given in doublequotes) and the relevant
element to be substituted (see \ref{naivegrammar}), here the node {\tt stem}
which is analogous to the variable X used above.

However, neither {\tt like/1701} nor {\tt become} are primitives (for the
four-digit disambiguation number following the English binding see
Section~\ref{comma}). {\tt like/1701} `sicut' is defined as {\tt similar} (as
opposed to {\tt like/3382} `amo') and {\tt become} is defined as {\tt
  =AGT[=PAT[after]]} which for now we will paraphrase as `afterwards, agent
IS\_A patient' (thematic roles are discussed in Sections~\ref{agtpat} and
\ref{deepcase}). For something like {\it John caramelized the sugar} this 
would be `John caused the sugar to be <similar to> caramel afterwards'. 

Here, for the sake of readability, we made some consessions to English syntax,
by adding agreement morphology, an article, a copula, and a preposition, but
eventully the reader will get familiar with the syntax of definitions that
lacks all this niceties, and would read {\tt John CAUSE after \{sugar $<$similar$>$
    caramel\}}. 


Since {\tt similar} is not a primitive of the formal language of definitions,
we can take this further by substituting its definition 

\begin{verbatim}
similar =AGT HAS property, =PAT HAS property, "to" MARK =PAT
\end{verbatim}

\noindent
Since named nodes are unique in definitions, what this means is that in the
construction {\it X (is) similar to Y} the agent will have the {\it same}
property as the patient. As expected, the {\tt MARK} relation is
language-specific, for Hungarian we would want to say that the allative case
{\it hoz/hez/h\"{o}z} marks the patient. (4lang currently gives the MARKs only
for English.) 

At this point we can omit the default (since it is a binary relation, this
means substituting {\tt IS\_A}) or we can expand it, to yield 

\begin{verbatim}
-ize CAUSE {after {=PAT HAS property, stem HAS property}}, "-ize" MARK stem
\end{verbatim}

\noindent At this point, all our notions are primitives, including the
metalinguistic placeholder "stem" and the term {\tt property}, which is really
underspecified as to what property it refers to, as befits the definition of
{\it similar} which is underspecified exactly in this respect (compare {\it
  similar consequences} to {\it similar balloons}). {\tt HAS} again is
primitive, the causative element in {\it -ize} is well known
\citep{Lieber:1992,Plag:1998}, and the idea that we define certain verbs by
their result state is standard. Temporal structure can refer to some state
{\tt before} or {\tt after} the event, but this has to be explicitly
stated. Comma-separated linear order, as in {\tt =PAT HAS property, stem HAS
  property} simply means conjunction (see Section~\ref{comma}), and as such 
it is independent of the order of the conjuncts. 

In the fourth edition \citep{Bullon:2003} LDOCE defines {\it caramelize} as
`if sugar caramelizes, it becomes brown and hard when it is heated'.  The
first edition of LDOCE \citep{Procter:1978} does not define {\it caramelize}
and has no self-recursion. The self-recursive definitions added to later
editions may be a feature from the perspective of the human language learner,
but they are definitely a bug from the computational perspective.  To parse this
definition would lead us nowhere, since the definiendum is part of the
definiens, and we don't have a theory for finding a minimal fixed point in
{\it if sugar if sugar if sugar \ldots X, X becomes brown and hard when it is
  heated X becomes brown and hard when it is heated \ldots}. What happens when
it's not heated? Is it brown? Will it become brown? Is it hard when it's
caramelized? Or will it become hard only when heated? How about caramelizing
something other than sugar, say onions? This definition says nothing about the
`if not sugar' case, whereas the definition we derived above at least tells us
that if onion is caramelized it will share some properties with caramel.

\subsection{Encyclopedic knowledge}

The first edition of LDOCE \citep{Procter:1978} defines {\it caramel} as
`burnt sugar used for giving food a special taste and colour'. In 4lang this
could be recast as 

\begin{verbatim}
caramel sugar[burnt], CAUSE {food HAVE {taste[special], colour[specal], 
<taste[sweet]>, <colour[brown]>}}
\end{verbatim}

\noindent
where quite a bit of the syntax is implicit, such as the fact that {\tt
  caramel} is the subject of {\tt CAUSE}, see Section~\ref{subjobj}, and we
sneaked in some real world knowledge that the special taste is (in the default
case) sweet, and the special color is brown.

As the preceding make clear, we could track further {\it special} (defined in
4lang as {\tt LACK common}), or {\it food}, or {\it burnt}, or any term, but
here we will concentrate on {\it sugar} `a sweet white or brown substance that
is obtained from plants and used to make food and drinks sweet'. Remarkably,
this definition would also cover xylitol $(CH_2OH(CHOH)_3CH_2OH)$ or stevia
$(C_{20}H_{30}O_3)$ which are used increasingly as replacements for common
husehold sugar $(C_6H_{12}O_6)$.

This is not to say that the editors should have been aware in 1978 that a few
decades later their definition will no longer be specific enough to
distinguish sugar from other sweeteners. Yet the clause `obtained from
plants' is indicative of awareness about saccharine $(C_7H_5NO_3S)$ which is
also sweet, but is not obtained from plants.  

4lang takes the line that encyclopedic knowledge has no place in the
lexicon. Instead of worrying about how to write clever definitions that will
distinguish sugar not just from saccharine but also from xyletol, stevia, and
whatever new sweeteners the future may bring, it embraces simplicity and
provides definitions like the following:

\begin{verbatim} 
rottweiler dog
greyhound dog
\end{verbatim}

\noindent
This means that we fail to fully characterize the competent adult speaker's
ability to use the word {\it rottweiler} or {\it greyhound}, but this does not
seem to be a critical point of language use, especially as many adult speakers
seem to get along just fine without a detailed knowledge of dog breeds. To
quote \cite{Kornai:2010}:

\begin{quote}
So far we discussed the {\it lexicon}, the repository of linguistic knowledge
about words. Here we must say a few words about the {\it encyclopedia}, the
repository of world knowledge. While our goal is to create a formal theory of
lexical definitions, it must be acknowledged that such definitions can often
elude the grasp of the linguist and slide into a description of world
knowledge of various sorts.  Lexicographic practice acknowledges this fact by
providing, somewhat begrudgingly, little pictures of flora, fauna, or
plumbers' tools. A well-known method of avoiding the shame of publishing a
picture of the yak is to make reference to {\tt Bos grunniens} and thereby
point the dictionary user explicitly to some encyclopedia where better
information can be found. We will collect such pointers in a set {\bf E}
\end{quote}

\noindent
Today, we use Wikipedia for our encyclopedia, and denote pointers to it by a
prefixed @ sign, see Section~\ref{atsign}. Our definitions are 

\begin{verbatim} 
sugar	sweet, IN food, IN drink
sweet	taste, good, pleasant, sugar HAS taste, honey HAS taste
\end{verbatim}

\noindent
Instead of sophisticated scientific taxonomies, 4lang supports a naive
world-view \citep{Hayes:1979,Gordon:2017}. We learn that {\it sugar} is sweet,
and {\it sweet} IS\_A taste -- the system actually makes no distinction
between predicative (is) and attributive (is\_a) usage. We learn that sugar is
to be found in food and drink, but not where exactly.

One place where the naive view is very evident is the treatment of high-level
abstractions. For example, the definition of {\it color} has nothing to do
with photons, frequency ranges in the electromagnetic spectrum, or anything of
the sort -- what we have instead is {\tt sensation, light/739, red IS\_A,
  green IS\_A, blue IS\_A} and when we turn to e.g. {\it red} we find {\tt
  colour, warm, fire HAS colour, blood HAS colour}. Another field where we
support only a naive theory is grammar, see \ref{naivegrammar}.

As with {\it sugar} and {\it sweet}, we posit something approaching a mutual
defining relation between {\it red} and {\it blood}, but this is not entirely
like Titius and Caius sending you further on: actually {\it blood} gets
eliminated early in the uroboros search as we iteratively narrow the defining
set, while {\it red} stays on. Eventually, we have to have some primitives,
and we consider {\it red}, a Stage II color in the \cite{Berlin:1969}
hierarchy, a very reasonable candidate for a cross-linguistic primitive.

So far, we have discussed the fact that separating the encyclopedia from the
lexicon leaves us with a clear class of lexical entries, exemplified so far by
colors and flavors, where the commonly understood meaning is anchored entirely
outside the lexicon.  There are also cases where this anchoring is partial,
such as the suffix {\it -shaped}. The meaning of {\it guitar-shaped, C-shaped,
  U-shaped, \ldots} is clearly compositional, and relies, on the hand, on
cultural primitives such as {\it guitar, C, U, \ldots} that will remain at
least partially outside the lexicon. According to \citet{Rosch:1975}, lexical
entries may contain pointers to non-verbal material, not just primary
perceptions like color or taste, but also prototypical images.  We can say
that {\it guitar} is a stringed musical instrument, or that $C$ and $U$ are
letters of the alphabet, and this is certainly part of the meaning of these
words, but it is precisely in the image aspect highlighted by {\it -shaped}
that words fail us. Again anticipating notation that we will fully define only
in subsequent sections, we can define {\it guitar-shaped} as {\tt HAS shape,
  guitar HAS shape} and in general 

\begin{verbatim}
-shaped  stem HAS shape, =AGT HAS shape, "_-shaped" MARK stem
\end{verbatim}

\noindent
and leave it to the general unification mechanism we discussed in
\ref{morphology} to guarantee that it is the same shape that the stem and the
subject of the compound adjective will share. 

\subsection{Lexical categories and subcategories}\label{lexcat}

Whether a universal system of lexical categories exists is still a widely
debated question. \cite{Bloomfield:1933}, and more recently
\cite{Kaufman:2009} argued that certain languages like Tagalog have only one
category, but the notion that there are at least three major categories that
are universal, nouns, verbs, and adjectives, has been broadly defended
\citep{Baker:2003,Chung:2012}. 4lang subdivided verbs into two categories:
intransitive U and transitive V; retaining the standard N for noun; A for
adjective; and also used D for aDverb; and G for Grammatical formative.

While this rough categorization has proven useful for seeking bindings in the
original 4 and in other languages, there is no theoretical claim associated to
these categories, neither the universal claim that all languages would
manifest these categories (or at least, or at most, these), nor the
(four)language-particular claim that these categories are somehow
necessary/sufficient for capturing the data. In fact, 4lang is a semantic
system, and it says remarkably little about the system of lexical categories
and subcategories, be they defined by morphological or syntactic
cooccurrences. If anything, our findings lend support to the thesis of
\cite{Wierzbicka:2000} that cross-linguistic identification of lexical
categories is to be achieved via prototypes rather than by abstract class
meanings. 

To the extent that none of the six lexical categories U,V,N,A,D,G is ever
referred to by any definition or rule, 4lang holds fast to the autonomy of
syntax thesis. In particular, we refrain from stating the categorial signature
of elements even when it is obvious, e.g. that {\it -ize} is N$\rightarrow$V
(see \cite{Lieber:1992} that in productive uses the resulting verb must be
transitive), and we feel free to add in English paraphrases formatives such as
{\it be, that, a/an, the, to, -ly \ldots} which serve only to make the English
syntax come out right. For lexicographic completeness, we have entries
e.g. for infinitival {\it to}, but 4lang does not encode any difference
between the meaning of {\it eat} and {\it to eat} (see \ref{naivegrammar} for
details). This is in sharp contrast to locative {\it to}, which we see as
contentful and define as {\tt after(=AGT AT =PAT)}.

The syntax of 4lang definitions countenances only two basic types (lexical
categories in the metalanguage): unaries and binaries, and permits lexical
entries to be ambiguous between these two.  The basic unary type is seen in
most nouns, especially proper names, adjectives, and adverbials, and the basic
binary type is seen in transitive verbs and adpositions.  As an example of the
latter, consider the preposition {\it at}, defined in LDOCE as `used to say
exactly where something or someone is, or where something happens'. Clearly,
{\it at} is a binary relation (we write these in SVO order) x AT y, where y is
strongly subtyped for location, be it spatial or temporal, so strongly that
otherwise unspecified entities like {\it Jim's} have to be typcast to location
if we are to make sense of expressions like {\it We meet at Jim's}. In
contrast, x is left untyped: it could be a physical object, a person, or even
an event. Either way, y provides the origin of the coordinate system where we
anchor x. 4lang has the means to express the selectional restriction on the
second argument (see Section~\ref{linking}), but considers the use of {\it
  where} inappropriate, given that {\it at} has no question component. % (and in
%fact {\it where} will be defined as {\tt AT wh}).
Since other ideas about what
{\it at} means, be they cast in terms of some geometric coordinate system or
in terms of figure/ground, are far too complex to serve as the basis of some
reductive theory, we again bite the bullet and take AT to be a semantic
primitive. This yields the definition

\begin{verbatim}
at AT, =PAT[place/1026], "at" MARK =PAT
\end{verbatim}

\noindent
(In Hungarian, we would have {\tt "n\'al/n\'el" MARK =PAT} -- the MARK-clauses
in column 8 are always specific to English.) Whereas adpositions are generally
hard to decompose in terms of more primitive notions, transitive verbs are
much easier: as a classic example (and to show our indebtedness to the
generative semantic tradition) we provide

\begin{verbatim}
kill =AGT CAUSE {=PAT[die]} 
\end{verbatim}

\noindent
In the metalanguage of the definition syntax there are only three binary
relations: {\it subject}, depicted in graphs by an arrow labeled by `1' and
pointing toward the subject; {\it object}, depicted in graphs by an arrow
labeled by `2' and pointing toward the object (see Section~\ref{subjobj}); and
{\it is/is\_a}, depicted in graphs by an arrow labeled `0' running from
subclass to superclass (see Section~\ref{isa}). These are not to be confused
with binary semantic relations such as {\tt AT, CAUSE}, or {\tt kill}, of which
there are a handful of primitives and thousands of derived ones, see \ref{rel}.

Two important semantic primitives worth mentioning here are {\tt gen} and {\tt
  wh}.  {\tt gen} is a generic quantifier-like element that is neutral between
{\it somebody/something} and {\it anybody/anything}. In a system of formal
logic {\tt gen} would be just a variable-binding term operator, without
universal or existential import.  {\tt wh}, also a VBTO, provides the
semantics of the interrogative morpheme. Note that both {\tt gen} and {\tt wh}
are unaries, and have no scope. 

Unaries can serve both as predicates and as arguments.  Altogether, the
metalanguage is rather loosely typed, in that anything can serve as an
argument (when an argument slot is filled by something complex, this complex
formula is surrounded by $\{ \}$ for the sake of clarity, see
Section~\ref{parens}) and neither argument of a binary needs to be filled
obligatorily. 

Another pair of unaries, {\tt after} and {\tt before}, constitute all that
4lang currently offers in way of temporal semantics. These refer to the state 
after (resp. before) the event (verb) they characterize, think of these as
the initial and final stills from a short movie depicting the event. So we have 

\begin{verbatim}
die after(dead)
dead LACK live, before(live)
\end{verbatim}

\noindent
These are not to be confused with {\tt AFTER} and {\tt BEFORE}, which are the
usual temporal primitives (by duality, only one of them needs to be left
undefined) with two arguments, e.g. {\tt Tuesday BEFORE Wednesday}.


\subsection{Relations}\label{rel}

In most systems of semantic representation there is something of a squish
between categories of the language and categories of the metalanguage. Our
goal is to avoid this entirely, but of course there exist important metalangue
categories that are modeled on linguistic concepts, often bearing the same
name, so a great deal of caution is called for. In \ref{lexcat} we discussed
the lexical categories of the language (N,V,A,...) and mentioned that the
metalanguage has only two, unaries and binaries, with most elements belonging
to both of these. We could stretch the analogy, and consider the comma and the
various parentheses used in the metalanguage as G (grammatical formatives) but
the central ideas of natural language syntax are not very helpful in
describing the metalanguage, so we will not pursue this here.

Here we discuss those elements that are always binary, and to avoid confusing
these with the contentful (linguistic) elements, we call them {\it relations}
(in the ordinay mathematical sense `subset of direct product'). We have argued
elsewhere \citep{Kornai:2012} that linguistic analysis doesn't require ternary
or higher arity relations, and here we will see that the metalanguage uses
binaries only. The central binaries we rely on are `0', `1', and `2'. Of
these, `1' and `2' are the familiar grammatical functions {\it subject} and
{\it object} respectively, while 0 `being' is used indiscriminately for {\it
  is} and IS\_A. For legibility of formulas, a 0 arrow (binary relation) from
{\tt b} to {\tt a} can be written as {\tt b[a]} or as {\tt a(b)}. Both
correspond to Plato's notion of the subject `partaking' or `participating in'
the predicate. We maintain the Aristotelian distinction between accidental and
essential participation, using only essential properties in definitions (but
not in the semantic representation of more complex expressions).

There are no type restrictions on what can be connected with what by means of
0, 1, and 2 relations, except that relations cannot appear as subjects or
objects (grammatical functions are discussed further in \ref{linking}). They
can, however, appear connected by IS\_A, since IS\_A is not conceptualized as
a relation in the metalanguage, only `0', a supercategory of `participation'
and `inherence' is. In fact, IS\_A is treated as epiphenomenal, {\tt x IS\_A
  y} being equivalent to `x has all the essential properties of y and perhaps
some others as well'. In other words, the subsumption hierarchy can be deduced
from a model, but it is rarely overtly marked, except for abbreviatory
purposes.

Besides IS\_A, there are some other link types conspiciously missing from
4lang. Systems of Knowledge Representation (KR) such as Cyc \citep{Lenat:1990}
often insist on finer distinctions than 4lang e.g. by distinguishing an
individual poet from the class Poet. This makes an individual such as Allen
Ginsberg an {\sc InstanceOf} the Poet class, and Cyc actually demands that a
distinction be made betwen this and {\sc SubsetOf}, as between {\sc Stuff} and
{\sc StuffType}, and so on. In 4lang, ``we make purposely very little
distinction between an individual fox, the species Vulpes vulpes, the set of
foxes in the world, or the class of potential foxes in all possible worlds''
\citep{Kornai:2018}, and treat {\it gold} as a unary. (Such unaries often have
binary sense as well, English is lucky to have {\it gilt} as a separate verbal
base, but the unary-binary conversion is very general, cf. {\it Babe Ruth
  homered his way into the hearts of America} \citep{Jackendoff:1990}.)

Other link types commonly seen include prepositions with clear spatial meaning
such as IN; FROM; OUT; BETWEEN; AT; FOLLOW; ON; and ABOUT. Many of these
govern cases in certain languages, and many become case markers themselves in
others, and we see this as sufficient reason to treat them in the metalanguage
as ordinary binaries with their own 1st and 2nd arguments (denoted by `1' and
`2' as with transitive verbs). The capitalization is an admission of the fact
that currently 4lang is not endowed with a sufficiently general theory of
spatiotemporal relations to settle on these as true primitives -- it is
getting close, but it is not there yet. In many cases, such as {\tt under}, we
felt comfortable that the 4lang analysis captured the semantics sufficiently
well that capitalization has been dropped.

In those relations where the printname is not indicative of argument order, we
settled on one variant. For example we use {\tt x PART\_OF y} in preference to
{\tt y HAS\_PART x}, but in fact {\tt part} would work just as well -- the
fact that it governs {\it of} in this sense is again indicative of some
near-adposition near-case status. The system could be more consistent: the
other major near-case, INSTRUMENT, is used in order {\tt x HAS\_INSTRUMENT y}
rather than {\tt y INSTRUMENT\_OF x}. It is not clear what sort of facts could
be used to demonstrate that one argument order is better than the other.

Remarkably, only one of these relations lacks an easily expressible obverse:
for {\tt x IS\_ABOUT y} there is no relation REL expressing {\tt y REL x} in
approximately the same situations -- this may point at some irreducibly
nonspatial sense. The remaining relations that we currently leave unanalyzed
include purposive FOR (no obvious obverse, but no obvious spatial meaning
either); possessive HAS; and causative CAUSE; these three appear frequently in
definitions, and for the last two the passive works reasonably well the
express the obverse. This is not logically strict, {\it metal fatigue causes
  deadly accidents} is not strictly equivalent to {\it deadly accidents are
  caused by metal fatigue}, but the cause/effect relation is still clear from
the passive. (For scope effects, and for the dyadic negation relation LACK see
\cite{Kornai:2020a}.) We also keep as primitive the abstract comparative ER
`>', which also governs some case/adposition in many languages.

With this, our list of relations is complete, except for one element, MARK,
which we defer to \ref{naivegrammar}. Completeness, we should emphasize here,
is not the same as finality: it may well be possible to reduce the list
further, or to shift some of the basis from unary/binary (the auxiliary-style
analysis currently used) to purely binary, because relational treatment may be
warranted for modals such as desideratives or imperatives, and even for
standard lexical redundancy rules such as reflexivization, causativization or
locative inversion. This is obviously a prime testing ground for universal
grammar, but a systematic treatment will require a better cross-linguistic
understanding of relational elements than the author can lay claim to.

There are other important issues that we cannot pursue here, such as
collective subjects \citep{Scha:1981}. In general, we find verbs where the
subject and the object are well differentiated: {\it water quenches fire} but
not {\it *water and fire quench}. Cases of free alternation and rare ({\it
  John marries Sandra, Sandra and John marry}), and cases where the subject
must be the collective are even more so {\it John and Peter are brothers,
  *John brothers Peter, Marseille is between Nice and Montpellier, *Marseille
  betweens Nice and Montpellier}. Remarkably, in the defining vocabulary we
see a disproportionate number of predicates that make only, or the most, sense
when one of the arguments is collective: {\it close, between, through,
  \ldots}.


\subsection{Linking}\label{linking}

In terms of the amount of fully analyzed text available, Universal
Dependencies (UD) is the single most influential cross-linguistic framework of
grammatical description \citep{Nivre:2018s}. While many other schools of
grammatical description offer a broader variety of analyses, these (with the
possible exception of tagmemics) rarely extend to a broad selection of
languages, the dominant style of linguistic analysis being the in-depth study
of a restricted range of syntactic phenomena (ideally across many
typologically diverse languages, but quite often restricted to a single
language) rather than the in-breadth analysis of an entire language. Here we
assume the reader is familiar with UD, and compare 4lang to UD, pointing at
other frameworks only in a few places. Generally, 4lang is on the sparse or
`lumping' side of the comparison, not just in relation to UD, but also in
relation to other well-developed theories like LFG, HPSG, or MP.

As far as {\bf grammatical functions} are concerned, we only assume two,
subject and object. As argued in \citep{Kornai:2012}, ditransitives and higher
arity predicates are unnecessary for semantic purposes. Since 4lang doesn't
have an indirect object (UD {\tt iobj}) function, ditransitives are always
modeled by decomposition:

\begin{verbatim}
give =AGT CAUSE[recipient HAS =PAT], "dative" MARK recipient
buy =AGT receive =PAT, =AGT pay seller, "from" MARK seller
sell =AGT CAUSE[buyer HAS =PAT], buyer CAUSE[=AGT HAS <money>], 
     "dative" MARK buyer
\end{verbatim}

\noindent
Since UD distinguishes dependency links by the category of the head and the
dependent, it naturally keeps notions like {\tt nsubj} and {\tt csubj}
(nominal and clausal subjects) separate, and similarly for {\tt obj} and {\tt
  ccomp}. 4lang, with its roots in the theory of Knowledge Representation,
where the proliferation of link types has emerged as a significant problem
early on \citep{Woods:1975}, admits only one other link type, `0' (subject
links are `1' and object links `2'), which subsumes most of the other link
types used in UD, such as {\tt amod, appos, nummod} and {\tt advmod}.  In a
strictly link-based system such as UD it is a practical necessity to have a
separate link type for coordination: in 4lang we just use comma-separated
concatenation (see Section~\ref{comma}). 

Aside from attribution and predication, which are both denoted by a `0' link,
two cardinal links, `1' and `2', are used for all binaries, incuding those
marked in many grammatical systems by {\bf deep cases} such as {\tt
  INSTRUMENT}. 

\begin{verbatim}
cheque paper, write ON, amount ON, signature ON, pay/812 INSTRUMENT
say communicate, INSTRUMENT sound/993, recipient hear sound/993, 
    "dative" MARK recipient
\end{verbatim}

What these definitions mean (by virtue of the convention spelled out in
Section~\ref{subjobj}) is that checks are instruments in paying, and sounds
are instruments of communicating. INSTRUMENT, as a binary relation has a
subject (what has the instrument) and an object (what is the instrument). In
effect, a single link type, which could be labeled `INS' is replaced by an
INSTRUMENT node that has the same two arrows as are used for other binaries,
subject and object. Whether other deep cases are called for is not clear from
a lexical semantics perspective, but (broadly speaking) subjects are agents,
objects are patients, AT, ON, IN, are locatives, FROM is source (rather than
ablative), and FOR is goal (purposives, rather than end of motion). 4lang
compromises on dative, which is treated as a {\bf surface case} rather than as
a deep case `recipient' or as an adposition (though in English it is one).

%\bigskip\noindent
%\begin{tabular}{llll}
%Agent & {\it kart\d{r}} & the independent one & (1.4.54)\\
%Goal & {\it karman} & what is primarily desired by the agent & (1.4.49)\\
%Recipient & {\it sa\d{m}prad\={a}na} & the one in view when giving &
%(1.4.32)\\
%Instrument & {\it kara\d{n}a} & the most effective means & (1.4.42)\\
%Locative & {\it adhikara\d{n}a} & the locus & (1.4.45)\\
%Source & {\it ap\={a}d\={a}na} & the fixed point that movement is away from
%  & (1.4.24)\\
%\end{tabular}


\bigskip
Finally, we consider {\bf thematic roles} of which we have exactly two, {\tt
  =AGT} and {\tt =PAT}. These are constitutive elements in definitions of
binaries where the subject or object needs to be named. Normally, it is the
definiendum that appears in subject or object position of a defining clause:

\begin{verbatim}
soil ground, plant grow IN
fault CAUSE problem
\end{verbatim}

\noindent
but time and again we need to say sameting about the subject or object of the
definiendum itself: 

\begin{verbatim}
protest show [=AGT think [=PAT[wrong]]]
\end{verbatim}

\noindent
{\it x protests y} means that x shows that x thinks that y is wrong. The
thematic roles simply reify the protester (=AGT) and the thing protested
(=PAT). Agents and patients of definiendum and definiens are automatically
shared, so it is really {\tt =AGT protest =PAT} that is being defined by {\tt
  =AGT show [=AGT think [=PAT[wrong]]]}. For ease of human reading,
redundancies like this are suppressed, but the parser supplies =AGT and =PAT
automatically, 

\subsection{Naive grammar}\label{naivegrammar}

The central subject of this paper, the {\it metalanguage} we use to describe
the semantics, has its own syntax, which drives the parser {\tt
  def\_ply\_parser.pl}. But before we can turn to this in \ref{8thcol}, we need
to emphasize that 4lang is a formal system on its own right, not intended as a
proposal about natural language syntax (the parser will parse the definitions,
not natural language), and discuss some cases where considerations of syntax
nevertheless creep in.

Since the issue is central to the development of generative grammar, we should
make clear here that our position is not intended as an argument for, or
against, the autonomy of syntax thesis. As a research strategy, we prefer a
semantic formalism that is as autonomous as feasible, since this promotes
modularity not just in the sense of \cite{Fodor:1983}, but also in the sense
of enabling independent experimentation and research for both syntacticians
and semanticists. We do not feel qualified to take sides in the debate, but if
those who believe only in a limited autonomy of syntax are to mount arguments
capable of convincing the opposing side, these arguments need to be cast in
terms of the inadequacy of well-modularized systems, so even for those
refusing to entertain full modularity the first order of business is to look
at modular architectures.

Our main contribution to this area is that we only make reference to a {\it
  naive} theory of grammar, just as we see the need to link to naive
probability \citep{Gyenis:2019}, naive planning \citep{Gordon:2017}, and
believe that many of the issues discussed in \ref{rel} would be considerably
simplified by reference to the appropriate naive theory of space and time.
The fundamental elements (primitives) of naive grammar are {\it words}. We
don't go anywhere near the issues of how a word is, or should be, defined in
phonology, morphology, orthography, syntax, semantics, or lexicography (though
we assume that the reader is somewhat familiar with the main proposals). For
our purposes {\it word} is defined as {\tt sign, speech}, and {\it sign} as
{\tt gen perceive, information, show, HAS meaning}.

We have seen in \ref{lexcat} the use of the generic pronoun {\tt gen}, and
could detail the naive theory implicit in our system of definitions, but the
system of these is not intended as the final word on this subject, and would
not bring us much closer to the focal point of the lexical semantics/naive
grammar interface, which we take to be the (primitive) relation {\tt MARK}.

As used in 4lang, {\tt MARK} is the relation connecting form and meaning. This
corresponds well to the Saussurean notion of the sign, and will be sufficient
for our purposes, even though a more sophisticated theory of signs
\citep{Kracht:2011} is available for the non-naive theory of grammar. Our main
use of {\tt MARK} is with function words (including bound grammatical formatives) as
in

\begin{verbatim}
-ing    stem-ing IS_A event, "_-ing" MARK stem
\end{verbatim}

Operationally, whatever precedes the formative {\tt ing} is considered a stem,
and the whole form {\tt stem+ing} is considered an event. There is clearly a
great deal more that could be said about {\it -ing} suffixation, the notion of
stems, the classification of junctures, or the conceptual classification of
certain matters as events, but we make no apologies for not developing these
notions as part of naive grammar, especially as {\tt MARK} is used only in
0.5\% of the vocabulary, and often with other similarly undeveloped notions of
naive grammar such as cases. As an example, consider {\it appear} `pareo'
defined by {\tt after[=AGT AT location], "locative" MARK location} as in
{\it A deer appeared in the garden}.

What can/should be considered a locative case in a given language, and what
can be considered a location are weighty questions, well beyond the naive
theory of grammar that we rely on here. On the whole, our approach only
provides a lower bound: to the extent anybody wishes to engage in a fuller
analysis of the vocabulary, they need to introduce some terminology that takes
them outside the bounds of the naive theory. In this regard, the system is
full of promissory notes: we define {\it to/3600} by {\tt "to/3600 \_" MARK
  infinitive}, as in {\it John went to eat}, and {\it to/12} as {\tt
  after(=AGT AT =PAT)}, as in {\it John went to Chicago}. Here we must
leave notions like {\it infinitive} or agent/patient undefined (we have seen
in \ref{linking} some examples of their intended use).

In this paper we do not take on board the issue of expanding this to a fuller
theory of (naive) grammar, but as we see, a non-eliminable grammatical
core emerges from semantic definitions, and these offer a rich interface for
connecting 4lang to issues that modern grammar (in this case, starting with
\cite{Fillmore:1968}) has much to say on. For example, we will need a theory
that connects intransitives to transitives, as in {\it The fire spread} and
{\it The wind spread the fire}. In English, it is obvious that there is some
relation between these two verbs, but in Latin there is no obvious reason to
relate {\it distendo} and {\it sterno}. In Hungarian, the stems are derived
from the same root by productive suffixation, so we have {\it ter-\"ul} and
{\it ter-\'{\i}t}. 4lang brings out the similarity by taking the intransitive
form as basic {\tt after(=AGT AT wide)} and derives the transitive effect
as one of causation: {\tt =AGT CAUSE {=PAT spread}}.

We expressly disavow any idea of the naive grammar being the ultimate grammar,
or even the metalanguage being the ultimate metalanguage. The system is
designed to support one thing, and one thing only, natural language
semantics. There are many other semiotic systems from music to mathematics
that would have very different semantics, and 4lang is simply not equipped to
deal with these. Also, experience shows that naive theories are superseded by
more sophisticated ones for a reason, as the sophisticated theories are simply
better. But they often rely on key components, such as arithmetic, or the
analytic theory of continuous variables, that are out of scope for 4lang,
expressly designed to deal with ordinary (as opposed to technical or
scientific) language.

\section{The syntax of definitions}\label{8thcol}

\subsection{Coordination}\label{comma}

A 4lang definition always contains one or more (hypergraph) nodes, of which
one is distinguished as the {\it head} (related to, but not exactly the same
as the {\it root} in dependency graphs). All these are interpreted as graph
edges with label 0 running from the definiendum to the definiens.  The simplest
definitions are therefore of the form x, where x is a single node. Example
(all examples are taken from 4lang/Reform):

\noindent
{\tt aim   purpose}

\noindent
that is, the word {\it aim} is defined as {\it purpose}. Somewhat more complex
definitions are given by a comma-separated list. Here the head is always the
first element. Examples:

\begin{verbatim}
board   artefact, long, flat    
boat    ship, small, open/1814  
\end{verbatim}

The number following the '/', if present, serves to disambiguate among various
definitions, in this case adjectival {\it open} `apertus' from verbal
{\it open} `aperio'. These numbers are in column 5 of the 4lang file. 

\subsection{External pointers}\label{atsign}

Sometimes (here 42 cases in 1,200) a concept doesn't fully belong in the
lexicon, but rather in the encyclopedia. In the formal language defined here,
such {\it external pointers} are marked by a prefixed @. Examples:

\begin{verbatim}
Africa	land, @Africa	
London	city, @London	
Muhammad  man/744, @Muhammad	
U letter/278, @U
\end{verbatim}

\subsection{Subjects and objects}\label{subjobj}

In addition to 0 links, definitions often explain the definiendum in terms of
it being the subject or object of some binary relation. When not defined by
some more primitive term, such binary relations are given in CAPS.  For
example:

\begin{verbatim}
April   month, FOLLOW march/1563, may/1560 FOLLOW
bank    institution, money IN
\end{verbatim}

\noindent
The intended graph for April will have a 0 link from the head to month, a 1
link to march/1563 and a 2 link to may/1560. Often, what is at the other side
of the binary is unspecified, in which case we use the {\tt gen} symbol ``plugged
up''.  Examples:

\begin{verbatim}
vegetable  plant, gen EAT
sign  information, gen PERCEIVE, show, HAS meaning
\end{verbatim}

\noindent
Thus, {\it vegetable} is a plant that someone (not specified who) can eat (it
is the object of eating, subject unspecified), and {\it sign} is\_a
information, is the object of perception, is\_a show (nominal, something that
is or can be shown) and has meaning.

\subsection{Direct predication}\label{isa}

In a formula {\tt A[B]} means that there is a 0-link from A to B. This is used
only to make the notation more compact. The notation B(A) means the same
thing, it is also just syntactic sugar. Both brackets and parens can contain
full subgraphs. 

\begin{verbatim}
tree plant, HAS material[wood], HAS trunk/2759, HAS crown 
\end{verbatim}

\noindent
That trees also have roots is not part of the definition, not because it is
inessential, but because trees are defined as plants, and plants all have
roots, so the property of having roots will be inherited.

\subsection{Defaults}\label{default}

In principle, all definitional elements are strict (can be defeased only under
exceptional circumstances) but time and again we find it expedient to collapse 
strongly related entries by means of defaults that appear in angled brackets. 

\begin{verbatim}
ride travel, =AGT ON <horse>, INSTRUMENT <horse>
\end{verbatim}

\noindent
These days, a more generalized {\it ride} is common ({\it riding the bus,
  catching a ride, \ldots} so the definition {\tt travel} should be sufficient
as is. The historically prevalant mode of traveling, on horesback, is kept as
a default. Note that these two entries often get translated by different
words: for example Hungarian distinguishes {\it utazik} `travel' and {\it
  lovagol} `rides a horse', a verb that cannot appear with an object or
instrument the same way as English `ride a bike' can.  

\subsection{Agents, patients}\label{agtpat}

The relationship between horseback riding (which is, as exemplified in
Section~\ref{default} above, just a form of travelling) and its defining
element, the horse, is indirect. The horse is neither the subject, not the
object of travel.  Rather, it is the rider who is the subject of the
definiendum and the definiens alike, corresponding to a graph node that has a
1 arrow leading to it from both. This node is labelled by {\tt =AGT}, so when
we wish to express the semantic fact that Hungarian {\it lovagol} means travel
on a horse we write

\begin{verbatim}
lovagol travel, =AGT ON horse
\end{verbatim} 

\noindent
Note that the horse is not optional for this verb in Hungarian: it is
syntactically forbidden ({\it lovagol} is intransitive) and semantically
obligatory. (Morphologically it is already expressed, as the verb is derived
from the stem {\it l\'o} `horse' though this derivation is not by productive
suffixation.) Remarkably, when the object is\_a horse (e.g. a colt is a young
horse, or a specific horse like Kincsem) we can still use {\it lovagol} as in
{\it J\'anos a csik\'ot lovagolta meg} or {\it Elijah Madden Kincsemet
  lovagolta}.

For the patient role, consider the word {\it know}, defined as `has
information about'. For this to work, the expression {\tt x know y} has to be
equivalent to {\tt x HAS information ABOUT y}. For this to work, we need to
express the fact that the subject of HAS is the same as the subject of {\it
  know} (this is done by the {\tt =AGT} placeholder) and that the object of
ABOUT is the same as the object of knowing -- this will be done by the {\tt
  =PAT} placeholder. 

\subsection{Deep cases}\label{deepcase}

Of the 1,200 initially unreduced primitives, about 15\% refer to the major
thematic roles {\tt =AGT,=PAT} discussed above. The treatment of the others
(altogether less than 10\%) is discussed here. \cite{Makrai:2014} used several 
thematic role-like constructs (numbers in the table give their occurrence
frequencies in the 1200 set):

\begin{tabular}{rr}
name & freq\\
\hline
=AGT & 178\\
=PAT & 174\\
=REL & 34\\
=POS & 26\\
=TO & 19\\
=DAT & 19\\
=FROM & 7\\
=OBL & 5\\
=FOR & 1\\
\end{tabular}

Of these, only {\tt =AGT} and {\tt =PAT} are still in use, all other thematic
role-like elements, be they deep cases, surface cases, or prepositions, have
been redefined to rely only on these two, see Reform/notes.

\subsection{More complex notation}\label{parens}

When using [] or (), both can contain not just single nodes but entire
subgraphs. For subgraphs we also use \{ \}. This will have to be rethought in
terms of a consistent IRTG-friendly version of hypergraphs. Example:

\begin{verbatim}
stock document,company HAS,{person HAS stock} prove {person HAS PART_OF company}
\end{verbatim}

\noindent
`stocks are documents that companies have, if a person has stock it proves
that a person owns a part of the company'


\section*{Appendix: the computational substratum}

One critical issue not discussed in the foregoing is the computational
substratum, the kind of abstract machine this calculus is implemented on. This
is different from the formula parser, written by Adam Kovacs in Python,
obviously a Turing-complete language.

The individual components of the abstract machines are the morphemes and words,
conceptualized as small, in isolation rather limited finite-state devices
loosely coupled in an IS\_A network. This network is a DAG but not necessarily a tree: 
undirected cycles are common, as in the classic Nixon diamond
\citep{Reiter:1983}. Edges of this network are labeled 0. There are two other
networks, with edges labeled 1 and 2. In these, no undirected or directed 
cycles have been found, but confluences (directed edges originating in
different nodes but terminating in the same node) are not rare. 

This is not to say that the elementary components (nodes) are devoid of
non-linguistic content: they may refer to (have pointers pointing to) all
kinds of encyclopedic (verbal) knowledge as well as non-verbal memory (sounds,
images, smell) and further, activation of such may bring activation of the
nodes (so these pointer links are often bidirectional, or better yet,
directionless). 

The elementary opperations the nodes are capable of performing involve (i)
activation to various degrees of themselves and adjacent edges (ii) copying of
themselves (triggered by the keyword {\tt other}), (iii) unification of
nodes under certain conditions (here the element {\tt gen} is
distinguished, as it is capable of unification with anything).

Nodes can also dynamically exhibit higher-level organization (IS\_A links are
static) meaning that some of them, during the process of linguistic
comprehension and production, may group together in a single (node-like)
unit. Such temporary configurations, best thought of as ``macros'' or
``hypergraph nodes'' are denoted in the syntax by curly brackets, see
Section~\ref{parens}.

Finally, the entire set of nodes is viewed as adiabatically changing: new
nodes are added as the individual whose linguistic capabilities are being
modeled, is acquiring new words/morphemes. 


%http://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/OverviewKR_491_2017-4x1.pdf


\printbibliography

\end{document}


Add stuf on tojo1
Acc cum inf


small grammar suitable for parsing. In the grammar, () stands for optional,
$|$ stands for choice, and $^*$ for one or more.

weight  physical(quantity), heavy
water  liquid, HAS lack(colour), HAS lack(taste), HAS lack(smell), life NEED
\end{verbatim}

\subsection{Iterated links} 

The definition graph will often have subgraphs denoted as one. In such cases,
parentheses are used with the lead predicate. Examples: 

\begin{verbatim}
weight  physical(quantity), heavy
water  liquid, HAS lack(colour), HAS lack(taste), HAS lack(smell), life NEED
\end{verbatim}


\noindent
Definition $\rightarrow$ Definiendum Definiens (\% Comment)\\
Definiendum $\rightarrow$ Atom\\
Definiens $\rightarrow$ Clause ('','' Clause)$^*$\\
Comment  $\rightarrow$ (Arbitrary string)\\
Atom  $\rightarrow$ PlainAtom$|$NumberedAtom\\
NumberedAtom  $\rightarrow$ PlainAtom''/''Number\\
Clause  $\rightarrow$ 0Clause$|$1Clause''|''2Clause|ComplexClause\\
0Clause  $\rightarrow$ Atom\\
1Clause  $\rightarrow$ BIN '\\
2Clause   $\rightarrow$ ' BIN\\
ComplexClause  $\rightarrow$ Atom(Clause)

\end{document}


Here we introduce some terminology by paraphrase, describing the intended
meaning before offering more formal definitions. Our goal is to stay close to
the standard meaning of these terms, but we do not intend to fully recreate
every aspect of the theories where they originate. 

{\bf Atoms} are intended in the original sense of Democritus, to be small,
indivisible, indestructible units. They are the building blocks of solid
physical objects, but (i) it's not enough to have atoms to have a physical
object, you also need some kind of glue to hold them together and (ii) we
permit other physical objects which are not atomic, in particular liquids and
gases/emanations of various sorts (think of force fields or sounds) which are
thought of as infinitely divisible. Atoms come in a few predefined sortal
types. [We don't do modern physics, quarks, etc. and flatly ignore the fact
  that what contemprary language calls atoms are divisible and desctructible.]

{\bf Space} is ordinary 3D Euclidean space, {\bf time} will generally be
considered discrete (so that the arrow paradox etc. need full
discussion). Liquids and gases may fill various volumes but need not be
stationary even in a closed volume, and can have varying density. 

{\bf Objects} are given by their atoms and some glue relations that hold them
together. Objects can have various properties, such as shape, color,
temperature, taste etc. Certain properties are like liquids or gases
infusing/pervading the object (temperature, density), others (shape in
particular) pertain to their boundaries, yet others can be imperceptible
(e.g. being expensive, being cursed, etc).

There can be various (n-place) {\bf relations} obtaining between objects but,
importantly, relations can also hold between things {\it construed as}
objects, such as geometrical points with no atomic content, e.g. the corner of
the room {\it is next to} the window', complex motion predicates, e.g. `the
flood {\it caused} the breaking of the dam', and so on. Arguments of relations
will be called {\bf matters}, but they need not be material. One particularly
important relation is {\bf identity in essence, idest}, which can hold across
time between objects (including solids) even if they are not atom-by-atom
identical. The idest relation we are interested in is the relation holding
between anaphors and their antecedents, as in {\it Kim cooked dinner$_i$ and
  Sandy ate it$_i$}. This normally assumes a preponderance of atoms staying
identical, but there are important exceptions, as in {\it Our weekend sailboat
  was completely destroyed by fire but we rebuilt it}, where construal
identity overrides physical identity. Note that idest is not transitive,
because it requires a vantage point, and if A idest B from vantage P and B
idest C from vantage Q, there is no guarantee that there is an R such that A
idest C from vantage R. Note also that our interest in idest comes mostly
from establishing the vantage argument that's implicit in it. 

We do not follow situation theory \citep{Barwise:1983,Devlin:1991} closely,
but we do retain some of the techniques, and much of the motivation (reasoning
about common-sense and real world situations).  For us, a {\bf situation} is
composed of some matters (these may include emotions, intentions, and all
kinds of intangibles that exist only in the mind of some observer) and some
relations (which again may be purely notional ones like `forming a
triangle'). None of the relational tuples are negative.

Compared to the basic sortal types used in situation theory, we will have {\bf
  location}, conceived of connected subsets (in the limiting case, points,
lines, or surfaces) in 3D Euclidean space, but also including
nonexistent/inaccessible notional locations like {\it Heaven}. We note that
locations, both spatial and temporal, are subject to the same loose identity
conditions as the idest we discussed above for objects. We will also make use
of the {\bf individual} type, again making room for nonexistent/inaccessible
individuals as well as notional ones (e.g. personifications of countries or
emotions). We will have no use for types necessitated by the technical
machinery of situation theory such as PARamater, POLarity, or TYPe, nor will
we use infons. Even the use of RELn (n-place relations) will be different, as
we have seemingly n-place relations that have hidden variables, so are
actually m-place reations, and this includes the case when we can't easily
name (or quantify over) these variables.

Altogether, we assume a large number (about $10^5$) of ur-objects, roughly one
per morpheme or word, some of them defeasibly typed as LOC (location) or PER
(a special type of individual, a {\bf person}). In addition to these, we will
have a few technical elements such as the empty node $\cdot$, three directed
connectives `0' (is, isa); `1' (subject); and `2' (object). {\bf truth
  values} from a small set $B$ (not necessarily a boolean algebra), and {\bf
  scores} from a small linear order $L$.

The directed connectives are conceptualized as colored edges from nodes to
nodes or larger structures in a graph. The graphs are built up recursively:
(i) urelements, including the empty node, are simple graphs, with their one
element considered their root. (ii) a simple node x, with a 0/1/2 numbered
link to another graph, is a complex graph with root x. (iii) A complex graph,
with root x, with a 0/1/2 numbered link {\it from its root} is again a graph,
with root x.

To each graph corresponds a {\bf formula} built up in a similar fashion:
urelements (including the empty element) are simple formulas headed by
themselves, if f is a simple formula and F a formula, f0F, f1F, and f2F are
complex formulas, still headed by f, and if xrC is a complex formula headed by
x (here r is metalinguistic abbreviation for 0,1,2) and G some other complex
formula, xrC0G, xrC1G, and xrC2G are again complex formulas.

{\bf Valuations} are partial mappings from graphs (or, equivalently, from
formulas) to $L$. There is no analogous `truth assignment' because in the
inner models that are central to the theory, everything is true by virtue of
being present. On occasion we may be able to reason based on missing
signifiers, the dog that didn't bark, but this is atypical and left for later
study. A {\bf situation} is simply a (conjunctive) collection of
formulas/graphs. Whether we need to consider a fixed valuation as part of the
situation will be discussed later. 

A formula of the form x1a2b is equivalent to x2b1a, both mean in traditional
terms that x is a binary relation whose first argument is a and the second b.
As syntactic sugar, these could be written axb, but once we do this, we need
parentheses so the price of the sugar may be too high. As a compromise, binary
connectives of the kind we are interested in (these come from a small, closed
list that has spatial AT, IN, OVER, SOURCE, GOAL, temporal BEFORE and AFTER,
possessive HAS, and perhas a handful of others like CAUSE or INSTRUMENT) are
given in small caps, and will be written infix. So `breakfast BEFORE shower'
abbreviates before1breakfast2shower or the equivalent before2shower1breakfast
and we will say that in such cases the formula has a {\bf main connective}
x. Note that not all formulas have main connectives in this sense.  (Another
set of sugary devices concerns reification of missing arguments, especially in
relative clauses, these will not be discussed here.)  An {\bf unfolding
  situation} is one where the formulas include statements with main
connectives BEFORE or AFTER (one of these two could be eliminated, but this is
not exciting).

{\it Extension} of a situation can be by adding further formulas (this is
standard), and more importantly, by adding further specifications to existing
formulas. One of these is adding further, hitherto unspecified arguments to
the formulas already in a situation, for example `the guttersnipe killed the
officer {\it with a long-range gun}' extends kill1guttersnipe2officer to kill
INSTR lrg, which would be, if we undo the syntactic sugar,
instr1kill1guttersnipe2officer2lrg Note that this kind of elaboration leaves
the two situations (with or without the instrument clause) idest, but it may
change the valuation. 


\section{Definitions}


{\bf Definition 1.} A model instance is a collection of finitely many {\it
  reachable} objects and an unspecified number (possibly zero) of {\it
  unreachable} ones. It is assigned a rational number called its {\it
  timestamp}. Here and in what follows `object' is a term neutral between what
are traditionally considered objects (e.g. a table) and what are traditionally
called `events' e.g. a soccer match.  \andras{More on atomic and compound
  objects}

{\bf Definition 2.} {\it Unary} terms select a subset (possibly empty) of
objects in each model instance. For example, {\it person(x)} selects those
objects that are people (reachable in all but human-uninhabited models),
whereas {\it unicorn(x)} selects only unreachable objects (reachable only
in fictional/mythical models).

{\it Binary} terms, written infix, are directed graph edges connecting two
objects/terms.  For example {\it dream\_of} is a binary term connecting a word
to an object and {\it Kim dream\_of unicorn} is an edge, colored `dream\_of',
between a (possibly reachable) object {\it Kim} and a (possibly unreachabe)
object {\it unicorn}. Note that edges can terminate (and/or start) in other
edges as well, as in {\it Kim dream\_of eat icecream} where the end of the
{\it dream\_of} edge is the edge (Kim eat icecream). 

\section{Preliminaries}

We need two special relations, {\it name}, and {\it exist}, and two special
types, {\it object} and {\it expression}. Expressions (words and more complex
linguistic expressions) are mapped by the interpretation relation to terms.
Some objects have names, many don't. An unreachable object can also have a
name. `Bilbo never heard of Palantir' is true. 


\section{Temporal structure}
 We need to define a model stream (list of instances with increasing
 timestamps, largely invariant objects, and  causal continuity).

  know =AGT HAS information, information ABOUT =PAT
And we learn a bit more: 

\begin{verbatim} 
cake food, sweet, [ ' make] INSTRUMENT bake, FROM/2742 flour, FROM/2742 butter, FROM/2742 sugar, FROM/2742 egg	
\end{verbatim}


\begin{quote}
1a: potassium carbonate, esp. that obtained in colored impure form by leaching
 wood ashes, evaporating the lye usu. in an iron pot, and calcinating the
 residue -- compare pearl ash. b: potassium hydroxide. 2a : potassium oxide
 K$_2$O in combined form as determined by analysis (as of fertilizers)
 $\langle$ soluble $\sim$ $\rangle$ b: potassium -- not used systematically
 $\langle$ $\sim$ salts $\rangle$ $\langle$ sulfate of $\sim$ $\rangle$ 3: any
 of several potassium salts (as potassium chloride or potassium sulfate) often
 occurring naturally and used esp. in agriculture and industry $\langle$
 $\sim$ deposits $\rangle$ $\langle$ $\sim$ fertilizers $\rangle$
\end{quote}


Muslim: deleted believe (more consistent with other religions)
Ural: added mountain, more consistent with Himalayas etc. 
sign: head is information, not ' perceive

bad CAUSE hurt
cup container, small, open/1814, drink IN, HAS flat(bottom), <HAS handle> %classic default case!


"HAS lack" replaced by "lack"

Now part of syntax:

=AGT	178
=PAT	174


=REL	34

-ness  =AGT IS\_A stem, "-ness" MARK stem
about ABOUT
across before(=AGT ON side), =PAT HAS side, after(=AGT ON other(side)) =PAT HAS side[other] 
after =PAT BEFORE =AGT
along AT/2744 =PAT[long]    
around AT/2744 =PAT[round]
because =PAT CAUSE =AGT
before =AGT BEFORE =PAT
between =AGT AT/2744 location, location separate =PAT
by INSTRUMENT
condition =AGT CAUSE[=PAT[possible]]
fairly =AGT[=PAT ER average]
follow =PAT BEFORE =AGT
for AT/2744 exchange, "for" MARK price
for =PAT get =AGT %many other meanings of "for"
from before(=AGT IN/2758 =PAT)
from period, "from" MARK period[begin]
if =PAT before =AGT, lack(` know =PAT), "if" MARK =PAT
in location, =AGT AT/2744 location, location contain =AGT, "in" mark location
instead =AGT
necessary lack(=AGT) CAUSE (=PAT fail), "for" MARK =PAT
of =PAT HAS =AGT
of location, before(=AGT IN/2744 location), "of" mark location
of ABOUT %many other meanings of "of"
on AT, AGT= touch =PAT, <over>
over high(=AGT ER =PAT)
rather good(=AGT 'ER =PAT), "than" MARK =PAT
shall IN/2758 future

through before(=AGT ON side), =PAT HAS side, after(=AGT ON other(side)) =PAT HAS side[other]
to after(=AGT AT/2744 =PAT)
towards	after near(=AGT ER =PAT)
under high(=PAT ER =AGT)
until BEFORE end, "until" MARK end
while AT/2744 time, "while" MARK time	
with "with" MARK instrument
within =AGT IN =PAT


=POS	26

attention V listen, see, =AGT think =PAT[interesting, important] %light verb, "pay attention"
base PART_OF whole, AT/2744 bottom, whole HAS bottom, CAUSE whole[fix]
bottom PART_OF whole, position, deep( ER whole)
duty must[=AGT DO], <society/2285> want[=AGT DO] 
fault CAUSE problem	  
finish PART_OF event, event lack after
friend person, =AGT has, =AGT know, =AGT LIKE/3382, =AGT TRUST %possessors are agents, possessed elements are patients
handle PART_OF object, FOR/2782 hold(object IN/2758 hand)
home place/1026, =AGT AT/2744, "poss" MARK =AGT
interest =AGT want, "poss" MARK =AGT
interest desire, =AGT want[=AGT know], "poss" MARK =AGT %note agentless forms: the discovery has/holds great interest for geographers
key instrument, metal, lock HAS, key[turn] CAUSE[lock open/1814]
meaning information IN/2758 mind, sign represent
member group HAS, IN/2758 group
opinion thought, person HAS, person[confident], person lack proof
order relation, more(item) HAS, first PART_OF
pay money, =AGT give, =PAT receive, FOR work
period time, HAS start, HAS end
piece thing, small, PART_OF thing[large]
quality property, characteristic, inherent, <good>
reason CAUSE thing, ` understand thing
skill ability, =AGT can/1246[=AGT[act]], =AGT HAS practice,  act[good]
start act AFTER
stock document, company HAS, [person HAS stock] prove [person HAS [PART_OF] company]
subject conversation HAS, HAS more(part), more(part) connect more(part)
tent home, temporary, can/1246(` move), HAS wall[lack(rigid),<cloth>, tense], rope PART_OF 


=DAT	19


allow =AGT[lack[=AGT stop =PAT]]
appear =after[=AGT AT location], "locative" MARK location
difficult act need large(effort), "to/inf" MARK act
explain =AGT CAUSE[recipient understand =PAT], "dative" MARK recipient
express =AGT show[=AGT feel  =PAT]
give =AGT CAUSE[recipient HAS =PAT], "dative" MARK recipient
have HAS
help =AGT CAUSE[=PAT[succeed]], =AGT WITH =PAT
let give, =AGT CAUSE[lessee use =PAT], "dative" MARK lessee
pass =AGT allow[=PAT AT place], "to/2743" MARK place
pay =AGT CAUSE[money AT place], "to/2743" MARK place
please =AGT make[=PAT HAS joy]
present =AGT CAUSE[recipient SEE =PAT], "dative" MARK recipient
say communicate, INSTRUMENT sound/993, recipient HEAR sound/993, "dative" MARK recipient
seem gen think[=AGT IS\_A =PAT], "dative" MARK =PAT
sell =AGT CAUSE[buyer HAS =PAT], buyer CAUSE[=AGT HAS <money>], "dative" MARK buyer
show =AGT CAUSE[gen see =PAT]
teach =AGT CAUSE[student know =PAT], "dative" MARK student
thank =AGT EXPRESS[=AGT feel grateful], polite


=TO	19


able =AGT can/1246[=AGT[act]], "to/inf" MARK act
add =AGT CAUSE[=PAT IN/2758 place], "to/2743" MARK place
addition change, add

belong =PAT HAS =AGT, "to/2743" MARK =AGT
gentle careful, kind
include =AGT CAUSE[=PAT IN/2758 place], "in/10" MARK place
invite =AGT say[=AGT want[=PAT AT/2744 place]], "to/2743" MARK place
join after(together) %implies before(separate)
law rule, system, society/2285 HAS, official, norm
listen =AGT CAUSE [=AGT hear], "to/2743" MARK =PAT
mix =AGT CAUSE[=PAT IN/2758 place]], place[material<liquid>], =PAT[material, before(separate)], "in/10" MARK place
necessary gen need
occasion N event
put =AGT CAUSE[=PAT AT/2744 place], =AGT move =PAT, "to/2743" MARK place
ready  =AGT can/1246[=AGT[act]], now, "to/inf" MARK act
similar =AGT HAS property, =PAT HAS property, "to/2743" MARK =PAT
skill ability, =AGT can/1246[=AGT[act]], =AGT HAS practice,  act[good]

=FROM	7


accept receive, =AGT think[=PAT[right/1191] for =AGT]
buy =AGT receive =PAT, =AGT pay seller, "from" MARK seller
%of I think this is a bug, we need instead 
from  -to1l a/ab od 58	   u  G	   "from" MARK after[far]
remove =AGT CAUSE[=PAT FAR source], "from" MARK source
rubber material, flexible
eraser instrument, remove [mark ON paper]
separate V =AGT CAUSE [=PAT FAR source], "from" MARK source %this is he definition of V not A
separate A different
take =AGT CAUSE[=PAT AT/2744 =AGT]

=OBL	5

appear observer THINK [=AGT is_a =PAT], "dat" MARK =PAT
equal HAS quantity, =PAT HAS quantity, "dat" MARK =PAT
full lack(space/2327)

shoot =AGT cause[bullet AT/2744 =PAT], =AGT use gun

=FOR	1

use	haszna1l	utor	uz1ywac1	1008	u	V =AGT HAS purpose, =PAT help purpose,  "for/to" MARK purpose 
\end{verbatim}
%-er, -est

``a lot'' lot_of 

guitar-shaped, U -shaped, C-shaped
noble/nobility
``the quality or state''

divinity	the quality or state of being a god
purity	the quality or state of being pure
youth	the quality or state of being young


``is or about''


ON 44 

wh 7
gen 90

II causative/intensive
VI reciprocal/pretend
IV causative
imperative
V reflexive
VIII reflexive/passive
IX colors/defects
X desiderative
location of
II N->V

The indirect object of form I is the direct object of form III.