This is the code for the generation system described in 'Assessing Composition in Sentence Vector Representations' (COLING 2018), by Allyson Ettinger, Ahmed Elgohary, Colin Phillips, and Philip Resnik.
The probing task datasets and classifier code for the experiments described in the paper can be found here.
@inproceedings{ettinger2018assessing,
author = {Allyson Ettinger and Ahmed Elgohary and Colin Phillips and Philip Resnik},
title = {Assessing Composition in Sentence Vector Representations},
booktitle = {Proceedings of COLING},
year = {2018},
}
- NumPy
- NLTK
You can generate sentences that will vary freely within the built-in system parameters using the command below
python gen_from_meaning.py --setname sent --setdir . --configfile config1.example.json --mpo 10
This will generate 10 sentences for each structural category that the system uses, and write them to sent.txt
in the generation_system
directory, with annotation objects written to sent-annot.json
.
setname
is used in naming the output files and will also be used in sentence IDssetdir
specifies the location to write the output filesconfigfile
specifies the config file, described in more detail belowmpo
specifies the maximum number of sentences to be generated for a given structural categoryadv
you can also include this option with an integer argument, in which case the system will insert a variable number of adverbs before verbs (as long as your vocabulary includes adverbs), with the integer specifying the maximum number of adverbs to precede a given verb
When using the needEv
and avoidEv
constraints in the config files, you will use EVENT objects. config2.example.json
and config3.example.json
give examples of using these objects to specify these constraints. EVENT objects allow specification of the following properties:
EVENT objects can specify the following:
name
: the lemma describing the event (verb)tense
: tense of the event (past
orpres
, e.g., 'sleeps' vs 'slept')aspect
: aspect of the event (neut
orprog
, e.g., 'sleeps' vs 'is sleeping')voice
: voice of the event (active
orpassive
, e.g., 'x chased y' vs 'y was chased by x')pol
: polarity of the event (pos
orneg
, e.g., 'slept' vs 'did not sleep')frame
: transitivity of the event (transitive
orintransitive
)participants
: participants in the event - an object with possible keys ofagent
andpatient
The participant keys (agent
or patient
) map to CHARACTER objects, within which you can specify the following:
name
: the lemma describing the participant (noun)num
: number of the participant (sg
orpl
, e.g., 'student' vs 'students')attributes
: attributes of the participant - another object. Currently the only supported attribute is a relative clause, the key for which isrc
, and the value for which is another object.
The RC object can specify the following:
rtype
: object-relative or subject-relative (orc
orsrc
)event
: the event described by the relative clause. The value here will be another EVENT object, as described above.
Note that at present only one event object can be specified for a given constraint.
Config files are used to specify a) the input constraints for the sentences to be generated, and b) possible structures for the sentences.
The three example config files give some illustrations of how you can specify constraints.
config1.example.json
specifies no input constraints. Sentences can vary freely within the built-in parameters of the system.
config2.example.json
uses all of the categories of input constraint, to illustrate usage of each.
- With
needEv
, it specifies an event that all sentences must include: in this case, an event with lawyer as AGENT of recommend. This is a JSON object. - With
avoidEv
, it specifies an event that no sentences can include: in this case, an event with lawyer as AGENT of shout. This is a JSON object. - With
needList
, it specifies lemmas that need to be present in every sentence. This is a JSON object with keys for 'noun','transitive' (verb), and 'intransitive' (verb), and arrays of lemmas as values. - With
avoidList
, it specifies lemmas that cannot be present in any sentence. This is a JSON object with keys for 'noun','transitive' (verb), and 'intransitive' (verb), and arrays of lemmas as values.
config3.example.json
demonstrates a more complex usage of the EVENT object in specifying the needEv
constraint. In this case, the constraint specifies that
These examples leave constant the role_rc_structures
object, which lists the different possible structures that the sentences can be made up of. The structures listed in this object specify whether different participants have relative clauses, so for example {"agent":"none","patient":"transitive"}
describes a sentence in which the agent has no relative clause attribute, and the patient has a transitive relative clause.
lexical/vocabulary.json
already contains a usable vocabulary (containing only human nouns and verbs that are compatible with human participants, to preserve plausibility).
If you want to modify the vocabulary, you can do this by first modifying lexical/vocabulary.json
. The vocabulary object in this file has four keys corresponding to lemma categories: nouns, transitive verbs, intransitive verbs, and adverbs. New vocabulary items must be placed in the array corresponding to the correct key, and must be lemmas (dictionary forms - e.g., 'sleep') and NOT inflected forms (e.g., 'sleeps').
You can then build the lexical variables that the system will use by running get_lexicon.py
.
python get_lexicon.py
This will read from lexical/vocabulary.json
and write a new gensys_lexvars.json
(which contains an inflection dictionary and other lexical variables for the system to use) based on the new vocabulary.
get_lexicon.py
uses the XTAG morphology database to look up inflections, and will remove any lemmas from the vocabulary that are missing inflections in that lookup.