Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTD 1.1 #52

Open
arademaker opened this issue Jul 1, 2021 · 13 comments
Open

DTD 1.1 #52

arademaker opened this issue Jul 1, 2021 · 13 comments

Comments

@arademaker
Copy link
Member

Why partOfSpeech is an attribute of the Lemma and not an attribute of lexicalEntry?

@arademaker
Copy link
Member Author

arademaker commented Jul 1, 2021

Why in the lemon RDF vocab, a lexicalEntry has a canonicalForm and otherForm but in the XML a lexicalEntry has a lemma and one or more Form?

@jmccrae
Copy link
Member

jmccrae commented Jul 1, 2021

These are all related to the original formats.

The modelling of partOfSpeech on Lemma is due to Kyoto-LMF: http://kyoto-project.eu/xmlgroup.iit.cnr.it/kyoto/index6bfa.html?option=com_content&view=article&id=143&Itemid=129 Personally, I find this weird but there is no technical reason to change this.

I would see canonicalForm in OntoLex as equivalent to Lemma in LMF and otherForm as equivalent to Form from my understanding of these models.

@arademaker
Copy link
Member Author

Thank you, I thought we could change the schemas and DTDs in this repo freely. It would make sense to use the same terminology on both if possible.

@jmccrae
Copy link
Member

jmccrae commented Jul 1, 2021

We can change things of course, but there needs to be good reasons to make changes with the precedents of previous formats. I guess I can close this issue?

@jmccrae jmccrae closed this as completed Jul 1, 2021
@arademaker
Copy link
Member Author

arademaker commented Jul 1, 2021

No other GWA member want to make a comment?

I would vote to adapt the XML and RDF schemas to a single terminology.

@arademaker arademaker reopened this Jul 1, 2021
@goodmami
Copy link
Member

goodmami commented Jul 1, 2021

I'm not a voting member but I'll add that I agree with John. We may not live in the best possible world, but we shouldn't break backward compatibility only for (effectively) aesthetics. You may be interested in #43, however.

@arademaker
Copy link
Member Author

but we shouldn't break backward compatibility only for (effectively) aesthetics.

The problem is how to define if a modification is only aesthetic. But fine , good to have more opinions.

Thank you for the link to the other issue.

@goodmami
Copy link
Member

goodmami commented Jul 1, 2021

The problem is how to define if a modification is only aesthetic.

Fair point. For me, I'd ask if the change allows us to do something we couldn't do before, or prevent us from doing something we could do before. If not, it's aesthetic (or "non-functional", etc.).

For example, partOfSpeech on <Lemma> is effectively the same as putting it on <LexicalEntry> because every lexical entry must have exactly one lemma, so there will always be one partOfSpeech within a lexical entry. Moving it to be an attribute of <LexicalEntry> wouldn't change this. The gray area here is that one could argue that when it's on <Lemma> it is not clear that the part of speech also pertains to any other <Form> elements (i.e., siblings of the <Lemma>) within the <LexicalEntry>, but that's a matter of interpretation.

@1313ou
Copy link

1313ou commented Jul 3, 2021

I more or less agree with you.

Instead of the vague "what you can do and can't", I'd suggest reasoning in terms of information.

Some changes are indeed cosmetic such as renames (no info brought in or removed).
Others, while not affecting the quantity of information, affect

  • redundancy,
  • how information is structured,
  • incidentally how information is maintained, retrieved and processed: you define the same info (can/can't do the same things) but it can be substantially harder or easier to maintain, retrieve, process ..., the latter depending on the application.

In this case, the PartOfSpeech attribute trickles down from the file's name to Lemmas and Synsets where it hardly brings new information (except for the tricky adjectives which can split into a or s). Of course you need it after merging but it can be derived and recorded then. Also, we want maintenance scripts to find it suspicious for wn-noun files to contain Lemmas with verb parts-of-speech.

It is assumed that PartOfSpeech is propagated up from (unique) Lemma to LexicalUnit if need be. Because we don't want to repeat it at both levels. But it is a "matter of interpretation" as you say because inheritance does not usually flow from child to parent.

I've already expressed LexicalEntry and Lemma is a one-one relation and the tags should be merged. We don't need them separate. The current discussion but illustrates this point I am making and is virtually endless: either the PartOfSpeech is propagated down from parent to child or propagated up from unique child to parent.

Who cares ? But one may question whether we should have a parent-child pair here.

@goodmami
Copy link
Member

goodmami commented Jul 4, 2021

@1313ou thanks for the further thoughts. While I only meant my definition as an informal rule of thumb, I agree that framing it in terms of encoded information instead of capabilities is better.

I'm not convinced that merging <Lemma> and <LexicalEntry> is a good solution because, for instance, what do we do if the <Lemma> has <Tag> child elements? Do they become siblings to other <Form> elements? That doesn't seem better to me.

@1313ou
Copy link

1313ou commented Jul 4, 2021

Are they distinct entities ?
Is it incorrect to say a lemma
1- has (i.e. is realized as) a number of forms and
2- has a number of senses (i.e. is a member of a number of synsets) ?
(I leave aside syntactic behaviour, a non-issue here)

@goodmami
Copy link
Member

goodmami commented Jul 9, 2021

Are they distinct entities ?

It may help if we think of <LexicalEntry> as representing an abstract lexeme with some set of realized forms, one of which is distinguished as the canonical form, or lemma. With this in mind I think the current situation is good, except for the placement of partOfSpeech.

Is it incorrect to say a lemma
1- has (i.e. is realized as) a number of forms and

I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form. Also, not all wordnets use <Lemma>/<Form> to encode inflectional variants; namely the Japanese Wordnet, which uses it to encode alternative orthographies of the lemma.

2- has a number of senses (i.e. is a member of a number of synsets) ?
(I leave aside syntactic behaviour, a non-issue here)

I wonder if we're talking about different things, as this seems backwards. The senses shouldn't change for alternative forms of the same lexical entry, but we could imagine that the syntactic behaviour could change (e.g., plural nouns in English not requiring a determiner). Currently we do not have a way to encode relationships between <SyntacticBehaviour> and specific forms, though.

@1313ou
Copy link

1313ou commented Jul 11, 2021

Is it incorrect to say a lemma has (i.e. is realized as) a number of forms
I think that is incorrect as the lemma is a realized form. It's just the canonical/dictionary/citation form.

I'll give you that, though the DTD fails to capture this inheritance: it just copies the element definitions. Both have Pronunciations, and Tags.

I should have said 'is inflected as' or dropped the 'i.e. ..' altogether. But as you note, a lemma acts as a name ("citation"), so it stands for what it names.

Having a parent and a unique child is aesthetic in your terms. It doesn't add information. But it is ineffective in that it scatters information and more steps are required to retrieve it.

Non-collapsing them would make (more) sense if multiple lemmas were allowed for a lexical entry (for instance color + colour, realize + realise) following the practice of what most dictionaries do. The LexicalEntry tag could then group these lemmas and give substance to the feeling they refer to one and the same entity. The current DTD leaves no option but to have separate multiple lexical entries that are grouped through synset membership.

Mine is a database-design principle, as often here, that seeks effectiveness but I can grant you a point of view based on fine-grained concepts is also legitimate.

SyntacticBehaviour/SyntacticBehavior

As I advocated elsewhere SyntacticBehaviour is attached to senses. As such it shouldn't be here in the first place, but further down, under the Sense tag.

Added to that, the current DTD definition can't make a difference between reference and definition. So it merges them into one tag with

<!ATTLIST SyntacticBehaviour
  id ID #IMPLIED
  subcategorizationFrame CDATA #REQUIRED
  senses IDREFS #IMPLIED>

This makes it mandatory to repeat 'Somebody ----s somebody' 4525 times throughout the English WordNet database for instance. And it's too permissive because it fails to capture that either id OR senses is required.

Otherwise, if you want a bag to put just about anything, here is the perfect fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants