Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wiktionary: refactor/bugfix parsing all possible parts of speech #1443

Merged

Conversation

HumorBaby
Copy link
Contributor

@HumorBaby HumorBaby commented Jan 10, 2019

The refactor comes from changing the control flow, and the bugfix is the ability of the module to parse more parts of speech. While this may not be a bug per se, I'd think not being able to get definitions for a word/term that is actually in Wiktionary is not a preferred behavior (pardon the double negative 😅).

Regardless, this PR is opened to address the points that came up in #1405:

  • parse all parts of speech

  • give subdefinitions like those for -able their own sub-number (e.g., 1a., 1b., ...)

  • update formatting algo to be smarter about sub-definitions and such

  • re-order parts of speech list so that definitions can be prioritized... it would be a shame if some obscure circumposition (A circumposition is a discontinuous adposition... um wat?) definition led to the more common verb definition being truncated.

I'll put the commit messages as comments below to get a prettified output of what was done (and since I'll probably overwrite/summarize them in a squash, if/when it is time)

EDIT: The removed (strikethrough) enhancements will be moved to another issue.

EDIT 2 (@dgw): The removed enhancements were moved to #1593.

@HumorBaby
Copy link
Contributor Author

refactor(wiktionary): improve control flow

Move all parts of speech to constant to make iteration easier; this helps avoid the messy if-elif-else chain that results when parsing for defintions, and makes it easier to change acceptable parts of speech if/when Wiktionary changes specifications.

Note: not all parts of speech are actually parts of speech, but they are listed in the specification under the "Parts of speech" heading, so just easier to call them that. See the comments in the PARTS_OF_SPEECH constant for details.

For now, this refactor does not break any old features, but also does not yet correctly parse the newly added parts of speech (e.g., phrases, proper nouns).

@HumorBaby
Copy link
Contributor Author

fix(wiktionary): parse definitions for all parts of speech

With this commit, the wiktionary module now provides defintions for almost all "parts of speech" listed on the Wiktionary specifications page.

I say almost and quote "parts of speech" above because things that are not really parts of speech do not work well, e.g., the "Han character" type, which is likely due to a unicode translation issue that could happen anywhere from the client to the web request; have not specifically pinned it down. Symbols like Δ and ⰴ, and diacritical marks like ῀ work, however (at least from weechat).

Fun new definitions for things like "Prepositional phrases", "Proper nouns", "Articles", and more are now provided. Try some:

<HumorBaby> .wt Paul
<ro-bot-o> Paul — proper noun: 1. In the New Testament, Apostle to the Gentiles and author of fourteen epistles, 2. A male given name of biblical origin, 3. A patronymic surname​

<HumorBaby> .wt out of hand
<ro-bot-o> out of hand — adjective: 1. Not under control — adverb: 1. Without (further) thought or consideration, 2. (now  rare) Immediately, forthwith, or incontinently

<HumorBaby>.wt on the other hand
<ro-bot-o> on the other hand — prepositional phrase: 1. (sequence, idiomatic) From another point of view

<HumorBaby> .wt in time
<ro-bot-o> in time — prepositional phrase: 1. At or before the time assigned, 2. (with for) Sufficiently early, 3. As time passes, 4. In rhythm, 5. (in time with) At the same rhythm as

<HumorBaby> .wt the
<ro-bot-o> the — adverb: 1. With a comparative or more and a verb phrase, establishes a parallel with one or more other such comparatives, 2. With a comparative, and often with for it, indicates a result more like said comparative. This can be negated with none — article: 1. Definite grammatical article t[...]

<HumorBaby> .wt A
<ro-bot-o> A — letter: 1. The first letter of the basic modern Latin alphabet — symbol: 1. Marks the first item in a list, 2. A hypothetical item or person designated the first, usually when there are more than one

<HumorBaby> .wt ⰴ
 <ro-bot-o> ⰴ — letter: 1. The fifth letter of the Glagolitic alphabet

<HumorBaby> .wt !
<ro-bot-o> ! — punctuation mark: 1. Denoting excitement, surprise, or shock, 2. Expresses excitement, anger or other emotion — symbol: 1. (mathematics) The factorial: the result of a multiplication of all integers from 1 to the given number, 2. (chess) Indicates a good move

@HumorBaby HumorBaby force-pushed the refactor-wiktionary-pos-subdefinitions branch 4 times, most recently from 884af84 to 88ac121 Compare January 10, 2019 16:50
@dgw dgw added this to the 7.0.0 milestone Jan 29, 2019
@dgw dgw added Feature Bugfix Generally, PRs that reference (and fix) one or more issue(s) labels Jan 29, 2019
@dgw
Copy link
Member

dgw commented Apr 25, 2019

@HumorBaby This looks WIP, 'cause of the task list showing "1 of 4" in the PR list. Is it ready or did you intend to add more stuff before review?

@HumorBaby
Copy link
Contributor Author

This final result of this PR is to address an issue related to #1214, where definitions were not being returned unless the exact case was provided. While the relaxation of the case requirement has been addressed in #1442, the case where a word has different cases for different definitions was not directly addressed.

One such example is Paul vs paul. One is a proper noun, while the other is a common noun. Before this PR, only the definition of paul would be returned despite a query like .wt Paul (which before #1442 would have failed outright). Despite having a definition in Wiktionary, Paul would never return the desired definition because all the cataloged parts of speech were not parsed (proper nouns, being one). Now, all parts of speech tracked by Wiktionary are parsed (which not only solves the issue but has the side-effect of providing definitions for a whole new set of queries!)

So, how about a review @sopel-irc/rockstars??

Copy link
Member

@dgw dgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks good as-is, then.

Maybe another of @sopel-irc/rockstars will still get around to reviewing this, so I'll leave it as "Needs Review" for a bit—but you should be good to squash your commits if you still want to (as was mentioned in the top comment).

(If you don't want to squash them: you've a typo "defintions" in the refactor commit message. 😉)

Copy link
Contributor

@Exirel Exirel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 and 🔧 it later if needed.

@dgw dgw removed the Needs Review label Apr 25, 2019
Move all parts of speech to constant to make iteration easier; this
helps avoid the messy `if-elif-else` chain that results when parsing
for definitions, and makes it easier to change acceptable parts of
speech if/when Wiktionary changes specifications.

_Note: not all parts of speech are actually parts of speech, but they
are listed in the specification under the "Parts of speech" heading, so
just easier to call them that. See the comments in the_ `PARTS_OF_SPEECH`
_constant for details._

For now, this refactor does not break any old features, but also
does not yet correctly parse the newly added parts of speech
(e.g., phrases, proper nouns).
With this commit, the wiktionary module now provides definitions for
_almost_ all "parts of speech" listed on the [Wiktionary specifications
page](https://en.wiktionary.org/wiki/Wiktionary:Entry_layout#Part_of_speech).

I say _almost_ and quote "parts of speech" above because things that are
not really parts of speech do not work well, e.g., the "Han character"
type, which is likely due to a unicode translation issue that could
happen anywhere from the client to the web request; have not
specifically pinned it down. Symbols like Δ and ⰴ, and diacritical marks
like ῀ work, however (at least from weechat).

Fun new definitions for things like "Prepositional phrases", "Proper
nouns", "Articles", and more are now provided. Try some:

```
<HumorBaby> .wt Paul
<ro-bot-o> Paul — proper noun: 1. In the New Testament, Apostle to the Gentiles and author of fourteen epistles, 2. A male given name of biblical origin, 3. A patronymic surname​

<HumorBaby> .wt out of hand
<ro-bot-o> out of hand — adjective: 1. Not under control — adverb: 1. Without (further) thought or consideration, 2. (now  rare) Immediately, forthwith, or incontinently

<HumorBaby>.wt on the other hand
<ro-bot-o> on the other hand — prepositional phrase: 1. (sequence, idiomatic) From another point of view

<HumorBaby> .wt in time
<ro-bot-o> in time — prepositional phrase: 1. At or before the time assigned, 2. (with for) Sufficiently early, 3. As time passes, 4. In rhythm, 5. (in time with) At the same rhythm as

<HumorBaby> .wt the
<ro-bot-o> the — adverb: 1. With a comparative or more and a verb phrase, establishes a parallel with one or more other such comparatives, 2. With a comparative, and often with for it, indicates a result more like said comparative. This can be negated with none — article: 1. Definite grammatical article t[...]

<HumorBaby> .wt A
<ro-bot-o> A — letter: 1. The first letter of the basic modern Latin alphabet — symbol: 1. Marks the first item in a list, 2. A hypothetical item or person designated the first, usually when there are more than one

<HumorBaby> .wt ⰴ
<ro-bot-o> ⰴ — letter: 1. The fifth letter of the Glagolitic alphabet

<HumorBaby> .wt !
<ro-bot-o> ! — punctuation mark: 1. Denoting excitement, surprise, or shock, 2. Expresses excitement, anger or other emotion — symbol: 1. (mathematics) The factorial: the result of a multiplication of all integers from 1 to the given number, 2. (chess) Indicates a good move
```
@HumorBaby HumorBaby force-pushed the refactor-wiktionary-pos-subdefinitions branch from 88ac121 to 752b604 Compare April 25, 2019 23:12
@HumorBaby
Copy link
Contributor Author

Boy, I can't spell defintions definitions at all... fixed in both commit messages.

@Exirel
Copy link
Contributor

Exirel commented May 8, 2019

@dgw I'm shamelessly invoking you. No reason. At all. Nah just kidding: I'm scanning through oldest PR available for 7.x.

You can probably merge that "tomorrow" for your github streak. 😉

@dgw dgw merged commit a36a434 into sopel-irc:master May 10, 2019
@dgw dgw mentioned this pull request May 10, 2019
7 tasks
@HumorBaby HumorBaby deleted the refactor-wiktionary-pos-subdefinitions branch August 28, 2019 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugfix Generally, PRs that reference (and fix) one or more issue(s) Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants