montague
is a little CCG semantic parsing library for Scala.
You can build on this code to translate English-into-SQL, English-into-API commands, etc. To do so, you need to build a lexicon for your specific application. We include a few sample lexicons, as described in the Getting Started section of this README.
The code currently implements boolean (non-probabilistic) CCG parsing, using a CKY-based parse search strategy.
Note that the repo history doesn't accurately reflect authorship, because much of the code was ported from the original UPSHOT repo.
At UPSHOT (acquired by Workday), we built a semantic parser that translated English into SQL, and — later — English into SOQL (the Salesforce query language). This was packaged in a mobile application with the following architecture:
This package extracts and open-sources the core CCG-based semantic parser component of UPSHOT, in a form that is general-purpose and self-contained. We hope that that other people find it useful. We plan to clean up the SQL-generation code and release that too. If you have more requests, please email us.
Joseph Turian and Alex Nisnevich gave a talk at Strata 2016 introducing montague
(video here).
"Oh, get ahold of yourself. Nobody's proposing that we parse English." — Larry Wall in
<199709032332.QAA21669@wall.org>
montague
takes its name from Montague
semantics, the
idea that human language can be expressed through formal logic and
lambda-calculus. The process of inferring this formal representation
from natural language is called "semantic parsing". Specifically,
montague
implements Combinatory Categorial Grammar
(CCG), a
particular grammar formalism that has become popular recently for
semantic parsing.
Let's break it down.
Here's an example of a definition in montague
:
("plus" -> ((N\N)/N, λ {y: Int => λ {x: Int => x + y}}))
Here's what it means:
- There's a term called "plus".
- It has the syntactic category
(N\N)/N
. This means that it's something that attaches to a noun (N
) after it to form aN\N
, which is a thing that attaches to a noun before it to form another noun. In other words, "plus" must be between two nouns, and the end result of(Noun) plus (Noun)
syntactically is just another noun. So far, so good! - It has the semantic definition
λ {y: Int => λ {x: Int => x + y}}
. In other words, it's a function that takes an integer and returns a function that another integer, adding the first integer to it. Uncurrying it (because in Montague semantics, all functions must be curried) simply yieldsλ {x: Int, y: Int => x + y}
. Well, that's pretty straightforward.
Looking through the code of the examples below, you'll notice that not all definitions look quite like this. Some of them have multiple synonyms for one definition, or multiple definitions for a single term (in that case, we say that the term is ambiguous). Some of them don't operate on single terms at all, but on matchers (functions that try to find certain kinds of strings). Some of them don't have semantic definitions at all and only describe syntactic categories. But the general idea for all of these is the same.
The way you use montague
is by defining your own lexicon of terms with syntactic
and semantic definitions. And the semantic parser does the rest.
montague
comes with a few simple examples demonstrating some applications of its semantic parsing features.
(See ArithmeticParser
.)
In this example, English is parsed into a semantic form, which is then realized as arithmetic operations.
> sbt "runMain example.ArithmeticParser (3 + 5) * 2"
Input: (3 + 5) * 2
Output: 16
Our current implementation treats all grammatical rule applications as either possible (true) or impossible (false). For this reason, the parser cannot currently discriminate between rules of different precedence:
> sbt "runMain example.ArithmeticParser 3 + 5 * 2"
Input: 3 + 5 * 2
Output: Ambiguous(13, 16)
Besides ambiguity arising from the inability to discriminate between
different rule applications, we might want intentionally to encode
ambiguity into our language. For example, the +/-
operation is
intentionally ambiguous, and has multiple valid semantic
interpretations:
> sbt "runMain example.ArithmeticParser (3 +/- 5) * 2"
Input: (3 +/- 5) * 2
Output: Ambiguous(16, -4)
We ignore all unrecognized tokens by adding an Else
clause in the
lexicon, which matches all tokens that wouldn't match otherwise, and in
this case produces semantically null parses:
> sbt "runMain example.ArithmeticParser Could you please tell me, what is 100 + 100 ?"
Input: Could you please tell me, what is 100 + 100 ?
Output: 200
(See CcgBankParser
.)
Using the CCGBank lexicon, we parse English sentences into syntax dependency trees.
If you don't have the CCGBank lexicon, you can use an older version of it that we downloaded from
Julia Hockenmaier's site. The lexicon is located at
data/lexicon.wsj02-21.gz
.
You can then parse sentences using the old CCGBank lexicon as follows:
> sbt "runMain example.OldCcgBankParser Thom and Alex and Joseph are writing a parser"
Input: Thom and Alex and Joseph are writing a parser
Output:
are
writing
a
parser
and
joseph
and
alex
thom
If you do have the CCGBank lexicon and would like to use it, put
CCGbank.00-24.lexicon
into subdirectory data/
, and invoke the
parser as follows:
> sbt "runMain example.CcgBankParser Thom and Alex and Joseph are writing a parser"
(See InformationStore
.)
InformationStore
uses SemanticRepl
to implement a very basic information storage and retrieval system, by parsing statements into Define
constructs and queries into Query
constructs, then executing them accordingly.
Important: This example uses the older-style CCGBank lexicon - see the above example for download instructions.
An example interactive session with InformationStore
:
> sbt "runMain example.InformationStore"
>> Joseph is a programmer
Ok
>> Joseph is pretty weird
Ok
>> Who is Joseph?
{a(programmer), pretty(weird)}
>> Who are Joseph and Ted?
I don't know
>> Joseph and Ted are Alex's bandmates
Ok
>> Who are Joseph and Ted?
alex's(bandmates)
SemanticParser
is the main entry point into montague. To instantiate a SemanticParser
, you need a syntactic scheme (CcgCat
for our purposes) and a lexicon, stored in a ParserDict
.
Once you've instantiated a SemanticParser
, .parse(text, tokenizer)
yields a SemanticParseResult
, which you can unpack to find the parse tree and resulting semantic representation (if the parse succeeded). See SemanticParser.main
for an example of how to extract results.
To build up a lexicon, you can create a new ParserDict()
(or load syntactic entries from a CCGbank lexicon, if you have one, with ParserDict.fromCcgBankLexicon
) and add entries to it with the +
operator.
A lexicon entry looks like (matcher -> meaning)
, where
matcher
can be- a term (String),
- a Seq of terms,
- an instance of
TokenMatcher[T]
(aString => Seq[T]
function), or Else
, which matches any otherwise un-matched token; and
meaning
can be- a syntactic category,
- a
(
syntactic category,
semantic representation)
pair, - a Seq of either of the above (in which case the meaning of the term is ambiguous), or
- (only if the
matcher
is aTokenMatcher
orElse
), a function of the matched object that produces of Seq of syntactic categories or a(
syntactic category,
semantic representation)
pairs
SemanticRepl
is a wrapper on top of SemanticParser
that's useful for making REPLs that repeatedly read input, parse it into an "action", and pattern-match that action to perform some operation against an internal state. For an example of SemanticRepl
at work, see the InformationStore
example above.
montague supports arbitrary syntactic schemes, but the only one built-in is CcgCat
, representing CCG categories. Here are the categories available in the ccg
package:
- Terminal syntactic categories are ones that can appear at the top of the parse tree and cannot consume adjacent terms. Built-in terminals are
S
("sentence"),N
("noun"),NP
("noun phrase"), andPP
("prepositional phrase"), but others are easy to add, depending on your application. - Non-terminal syntactic categories are ones that can (and must) consume adjacent terms. A parse cannot succeed if there is a non-terminal at the top of the parse tree. Types of non-terminal categories are:
- Forward application:
A/B
consumes aB
in front of it to become anA
. - Backward application:
A\B
consumes aB
behind it to become anA
. - Bidirectional application:
A|B
consumes an adjacentY
to become anA
. - Identity categories:
X|X
,X/X
, andX\X
are special cases of the above -- they can consume a term of any category to become that category. Conj
, the conjunction category, is a short-hand for(X\X)/X
. (Exercise: Why is this called the "conjunction" category?)
- Forward application:
- Additionally, a category assigned to a term may have a probability attached to it:
Cat % prob
. For example, if the term apple has categoriesN % 0.9
and(NP/N) % 0.1
, that means that it's 9 times more likely to be a noun than an adjective, and the parser will score potential parses accordingly. Probabilities default to1.0
if unspecified. - A category may also have a label:
Cat("label")
.A/B("somelabel")
can consume aB("somelabel")
but not a regularB
.
A SemanticState
is generally one of two things (in each case, LF
corresponds to the type of the objects we're dealing with -- in the examples below, LF = Int
):
- A form
Form[LF]
represents a semantic state that is "complete" (i.e. doesn't consume any arguments) -- for example,Form(4)
represents the number four. - A lambda
λ {x: LF => <semantic state>}
represents a semantic state that must consume arguments -- for example,λ {x: Int => Form(x + 5)}
represents the function that adds 5 to any integer. Implicit conversions allow us to write this more concisely toλ {x: Int => x + 5}
. Multi-argument functions are represented by currying (e.g.λ {y: Int => λ {x: Int => x + y}}
).
There are a few other possible SemanticStates
that generally shouldn't be specified directly in the lexicon, but can appear in parse results:
Nonsense
represents a parse with no valid semantic outputs.Ambiguous(Set[SemanticState])
represents a parse with more than one valid semantic output.Ignored
represents a parse that ignored semantics entirely (i.e. you didn't specify semantic representations in the lexicon).
(In rough order of difficulty.)
- While all of our examples involve CCG parsing, montague supports alternative syntactic schemes. Create your own semantic scheme (that is, a type hierarchy that inherits from
SyntacticLabel
), and parse something with it. - Composition. montague's CCG implementation currently supports only one of the three CCG combinators: the application combinator (
X/Y Y -> X
,X X\Y -> Y
). Extend it to also support the composition combinator (X/Y Y/Z -> X/Z
). - Type-raising. As above, but for the type-raising combinator (
X -> T/(T\X)
). - Probabilistic parsing. The parser currently supports boolean parsing: a parse of the input string is either possible (true) or impossible (false). (This corresponds to implementing the boolean semiring parser of Goodman, 1999 -- see Figure 5.) Extend the code so that it supports probabilistic or weighted parsing. (The existing probabilistic implementation of English parsing, based upon multiplying out lexicon weights, is a hack that including the token weights within the CCG category.)
- Speed improvements. The parser implements a CKY search strategy, which is bottom-up. If the parser had weight implemented, we could parse faster using agenda-based parsing: You use an agenda to order nodes by some priority. For example, the priority can be the cumulative probability of applying the rules (best-first parsing). Alternately, instead of agenda-based parsing, beam pruning could be used to reduce the size of the search space. In this case, only the top k weighted nodes are kept in any parse cell.
- † Fuzzy matching. What if a user enters a phrase that doesn't parse successfully, but adding or removing one word (or perhaps correcting a misspelling) would fix it? Create a "Did You Mean?" feature that implements this efficiently.
- Add more features to the
ArithmeticParser
example. For example, improve the tokenizer to correctly handle infix expressions without spaces (e.g.(1+2)*3
), or add more operations. - Add more features to the
InformationStore
example. For example, add other types of relations, or support more kinds of expressions. - IFTTT. Parse English phrases like "When (this happens), (do this)" into semantic forms corresponding to IFTTT API calls. Try building a REPL on top of
SemanticRepl
for communicating with IFTTT via natural language. - Slack bot. Similar to the above, but do something cool with a Slack bot instead.
- † Game semantics. Come up with a semantic scheme for representing rule descriptions for a simple card game (think Magic, Hearthstone, etc., but simplify!) For example, a card may say something like "Whenever your opponent loses life, draw a card". Then write a parser for it.
- † English to Freebase. Parse English phrases into Freebase queries. For example (borrowing an example from SEMPRE), "Which college did Obama go to?" →
(and (Type University) (Education BarackObama))
→ "Occidental College, Columbia University". (Hint: You'll have to generate most of the lexicon programmatically using Freebase as well.) - † English to SQL. Parse English questions (such as "How many customers in Europe made a purchase last month?") into SQL statements. Assume that you have all relevant table structure information available. (Hints: Generate an abstract structure for the full parse first, and then worry about generating SQL out of it. Much of the lexicon will have to be generated from table and column names, as well as entries for categorical columns. There will be a lot of ambiguous definitions.)
- SEMPRE is a toolkit for training semantic parsers.
- Cornell Semantic Parsing Framework is an open source research software package. It includes a semantic parsing algorithm, a flexible meaning representation language and learning algorithms.