Replies: 2 comments 1 reply
-
Category links can be created by templates or added manually as a link in wikitext. Some category links are hidden in view page and only displayed in edit page. And some categories don't have any page. Category links added to the "senses" list are created from that sense wikitext list. In extractor code we might forget to add a category in some part of the page. Fist we need to find where the link was created then find the code process this wikitext. Wikitext category doc: https://www.mediawiki.org/wiki/Help:Categories wiktextract/src/wiktextract/page.py Lines 397 to 409 in 0ee4b87 |
Beta Was this translation helpful? Give feedback.
-
Every json object I have examined seems to have a category entry "English entries with incorrect language header." In wiktionary, this appears to be a group with no members. What is the purpose of this category? And why is it listed for most (all?) entries in the English extract? Thanks!
|
Beta Was this translation helpful? Give feedback.
-
If there is any documentation about this I would love to read it. Or if the code isn't too complicated maybe you can point me to the source code that creates the "category" entries?.
How does the extract process determine when to add a category element to the top-level "categories" dict or to a category nested inside a sense entry? On the web page all categories are listed at the very bottom of the page. I had interpreted that to mean they are global (top-level) for that word.
Also, I have noticed a discrepancy in the category entries in the json compared to the wikt web page for a word.
First, every json object I have examined seems to have a category entry "English entries with incorrect language header." In wiktionary, this appears to be a group with no members. What is the purpose of this category? And why is it listed for most (all?) entries in the English extract?
I have also noticed that often there are other categories on the web page that are not present in the json, and there are categories in the json not present on the web page. Examples for two words (estrogen and fun) follow:
estrogen
json Categories not on Wikt page:
English entries with incorrect language header
Pages with entries
Pages with 4 entries
Entries with translation boxes
Wikt categories missing from json:
English terms with audio pronunciation
English 3-syllable words
English terms with IPA pronunciation
fun
json Categories not on Wikt page:
English entries with incorrect language header
Entries with translation boxes
Pages with 8 entries
Pages with entries
Wikt categories missing from json:
English 1-syllable words
English terms with IPA pronunciation
English terms with audio pronunciation
English terms with quotations
English informal terms
English terms with usage examples
English colloquialisms
Why are some wikt page categories omitted from the json objects?
One of the ideas I have for this wikt data extract is to create a rhyming dictionary. Besides the IPA, the "English X-syllable words" might be useful to me for this purpose.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions