-
-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wiktionary: refactor/bugfix parsing all possible parts of speech #1443
wiktionary: refactor/bugfix parsing all possible parts of speech #1443
Conversation
refactor(wiktionary): improve control flowMove all parts of speech to constant to make iteration easier; this helps avoid the messy Note: not all parts of speech are actually parts of speech, but they are listed in the specification under the "Parts of speech" heading, so just easier to call them that. See the comments in the For now, this refactor does not break any old features, but also does not yet correctly parse the newly added parts of speech (e.g., phrases, proper nouns). |
fix(wiktionary): parse definitions for all parts of speechWith this commit, the wiktionary module now provides defintions for almost all "parts of speech" listed on the Wiktionary specifications page. I say almost and quote "parts of speech" above because things that are not really parts of speech do not work well, e.g., the "Han character" type, which is likely due to a unicode translation issue that could happen anywhere from the client to the web request; have not specifically pinned it down. Symbols like Δ and ⰴ, and diacritical marks like ῀ work, however (at least from weechat). Fun new definitions for things like "Prepositional phrases", "Proper nouns", "Articles", and more are now provided. Try some:
|
884af84
to
88ac121
Compare
@HumorBaby This looks WIP, 'cause of the task list showing "1 of 4" in the PR list. Is it ready or did you intend to add more stuff before review? |
This final result of this PR is to address an issue related to #1214, where definitions were not being returned unless the exact case was provided. While the relaxation of the case requirement has been addressed in #1442, the case where a word has different cases for different definitions was not directly addressed. One such example is So, how about a review @sopel-irc/rockstars?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, looks good as-is, then.
Maybe another of @sopel-irc/rockstars will still get around to reviewing this, so I'll leave it as "Needs Review" for a bit—but you should be good to squash your commits if you still want to (as was mentioned in the top comment).
(If you don't want to squash them: you've a typo "defintions" in the refactor commit message. 😉)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 and 🔧 it later if needed.
Move all parts of speech to constant to make iteration easier; this helps avoid the messy `if-elif-else` chain that results when parsing for definitions, and makes it easier to change acceptable parts of speech if/when Wiktionary changes specifications. _Note: not all parts of speech are actually parts of speech, but they are listed in the specification under the "Parts of speech" heading, so just easier to call them that. See the comments in the_ `PARTS_OF_SPEECH` _constant for details._ For now, this refactor does not break any old features, but also does not yet correctly parse the newly added parts of speech (e.g., phrases, proper nouns).
With this commit, the wiktionary module now provides definitions for _almost_ all "parts of speech" listed on the [Wiktionary specifications page](https://en.wiktionary.org/wiki/Wiktionary:Entry_layout#Part_of_speech). I say _almost_ and quote "parts of speech" above because things that are not really parts of speech do not work well, e.g., the "Han character" type, which is likely due to a unicode translation issue that could happen anywhere from the client to the web request; have not specifically pinned it down. Symbols like Δ and ⰴ, and diacritical marks like ῀ work, however (at least from weechat). Fun new definitions for things like "Prepositional phrases", "Proper nouns", "Articles", and more are now provided. Try some: ``` <HumorBaby> .wt Paul <ro-bot-o> Paul — proper noun: 1. In the New Testament, Apostle to the Gentiles and author of fourteen epistles, 2. A male given name of biblical origin, 3. A patronymic surname <HumorBaby> .wt out of hand <ro-bot-o> out of hand — adjective: 1. Not under control — adverb: 1. Without (further) thought or consideration, 2. (now rare) Immediately, forthwith, or incontinently <HumorBaby>.wt on the other hand <ro-bot-o> on the other hand — prepositional phrase: 1. (sequence, idiomatic) From another point of view <HumorBaby> .wt in time <ro-bot-o> in time — prepositional phrase: 1. At or before the time assigned, 2. (with for) Sufficiently early, 3. As time passes, 4. In rhythm, 5. (in time with) At the same rhythm as <HumorBaby> .wt the <ro-bot-o> the — adverb: 1. With a comparative or more and a verb phrase, establishes a parallel with one or more other such comparatives, 2. With a comparative, and often with for it, indicates a result more like said comparative. This can be negated with none — article: 1. Definite grammatical article t[...] <HumorBaby> .wt A <ro-bot-o> A — letter: 1. The first letter of the basic modern Latin alphabet — symbol: 1. Marks the first item in a list, 2. A hypothetical item or person designated the first, usually when there are more than one <HumorBaby> .wt ⰴ <ro-bot-o> ⰴ — letter: 1. The fifth letter of the Glagolitic alphabet <HumorBaby> .wt ! <ro-bot-o> ! — punctuation mark: 1. Denoting excitement, surprise, or shock, 2. Expresses excitement, anger or other emotion — symbol: 1. (mathematics) The factorial: the result of a multiplication of all integers from 1 to the given number, 2. (chess) Indicates a good move ```
88ac121
to
752b604
Compare
Boy, I can't spell |
@dgw I'm shamelessly invoking you. No reason. At all. Nah just kidding: I'm scanning through oldest PR available for 7.x. You can probably merge that "tomorrow" for your github streak. 😉 |
The refactor comes from changing the control flow, and the bugfix is the ability of the module to parse more parts of speech. While this may not be a bug per se, I'd think not being able to get definitions for a word/term that is actually in Wiktionary is not a preferred behavior (pardon the double negative 😅).
Regardless, this PR is opened to address the points that came up in #1405:
parse all parts of speech
give subdefinitions like those for-able
their own sub-number (e.g., 1a., 1b., ...)update formatting algo to be smarter about sub-definitions and suchre-order parts of speech list so that definitions can be prioritized... it would be a shame if some obscure circumposition (A circumposition is a discontinuous adposition... um wat?) definition led to the more common verb definition being truncated.I'll put the commit messages as comments below to get a prettified output of what was done (and since I'll probably overwrite/summarize them in a squash, if/when it is time)
EDIT: The removed (strikethrough) enhancements will be moved to another issue.
EDIT 2 (@dgw): The removed enhancements were moved to #1593.