Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of removing Arabic diacritics from the headwords #366

Closed
sobaee opened this issue Feb 26, 2022 · 22 comments
Closed

Possibility of removing Arabic diacritics from the headwords #366

sobaee opened this issue Feb 26, 2022 · 22 comments
Labels

Comments

@sobaee
Copy link

sobaee commented Feb 26, 2022

Hello Saeed

I have an Ar-En glossary; its headwords not just contain Arabic letters but also diacritics (A.K.A TASHKEEL / HARAKAT) above or below each letter. This makes the searching process in the dictionary very hard,

I need an option to delete the Arabic diacritics from the headwords during conversion with keeping the letters (without affecting the definitions).
I think there are python ways that could do that, put I'm a beginner 😰.

This is the Ar-En Morphology dictionary:
https://mega.nz/file/bdFyAZQD#QgHPND-rqbsICW3rMKhn6UaA4qt10VQvwllN08VoGK4

See this:
Screenshot_20220226164233

@ilius ilius added the Feature label Feb 27, 2022
@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Should we keep the original headword after the trimmed headword, to prevent duplicate headwords?
For example:

أخذ (أَخَذ)

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

No
Just:
أخذ

Why?

Because I will convert it first with "title headwords" then the 2nd step will be to remove the diacritics from the headwords

The final result will be like this:
Screenshot_20220227162555

Okay?

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Because I will convert it first with "title headwords" then the 2nd step will be to remove the diacritics from the headwords

You mean write-option word_title?
That only changes definition, not headword.

I mean what if there are two entries with headwords that will be the same if we remove diacritics. For example کَتَب (to write) and کُتُب (books).

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Because I will convert it first with "title headwords" then the 2nd step will be to remove the diacritics from the headwords

You mean write-option word_title? That only changes definition, not headword.

I mean what if there are two entries with headwords that will be the same if we remove diacritics. For example کَتَب (to write) and کُتُب (books).

Can they be alternatives for the same search word?

I mean when I write in the search كتب
I find both كَتَب
And كُتِب

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Can they be alternatives for the same search word?

Yes, but some dictionaries / formats don't support alternates.

Also why two steps? You convert twice?

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Most dictionaries do support prefix searching I guess, no?
More than dictionaries that support alternates.

Because this change doesn't have to be format-specific.
I can add a global flag / config parameter.

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Because when I use multidictionaries and writes أخذ
Then press okay; I will not find أخذ (أُخِذ)

I know that I will find it if I scrolled the arrow before pressing enter!

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

You want to convert to slob?

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

You want to convert to slob?

Slob, or txt,

But I will need to convert it after that to mdx 🤷‍♂️

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Okay.
We don't want to remove "tashdeed", do we?

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

What about Tanwin / Nunation?

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

What about Tanwin / Nunation?

Yes, remove them all, but keep "ؤ" "ئ" "ء" "ه" "ة"
Because they are letters themselves

About آ and أ replace them by "ا" (it's very important to enrich arabic word search in arabic multidictionaries)

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

It would be great for anyone to search in the Arabic Mo'ajams without he needs to write all these diacritics

I hope and I'm sure you can do this "In Sha'a Allah"

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

I pushed to this branch:
https://github.com/ilius/pyglossary/tree/arabic-diacritics

Add flag --trim-arabic-diacritics to your command.

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Great 🙏
Thank you very much

I will try it right now

Is "replacement of آ and أ with ا" included too?

@ilius
Copy link
Owner

ilius commented Feb 27, 2022

Is "replacement of آ and أ with ا" included too?

Yes

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Perfectly done

Thanks Saeed

I appreciate that

You are life saver

Oh my god, You have just saved lifes of about 10 previously unusable Ar-En dictionaries 😍

@sobaee sobaee closed this as completed Feb 27, 2022
@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Sorry Saeed

Yes, this worked with Ar-En Morphology dictionary.

But with other dictionaries it comes back with this error:

python main.py Almawrid_Plus_ar-en.mdx Almawrid_Plus_ar-en.txt --trim-arabic-diacritics
[INFO] Found 0 mdd files with 0 entries
[INFO] extracting links...
[INFO] extracting links done, sizeof(linksDict)=73816
[INFO] wordCount = 55572
[INFO] Failed to detect sourceLang and targetLang from glossary name "b'Almawrid_Plus_ar-en'"
[INFO] Writing to Tabfile file '/storage/emulated/0/pyglossary-master/Almawrid_Plus_ar-en.txt'
[ERROR] Exception while calling plugin's write function
Traceback (most recent call last):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 850, in _write
for entry in self:
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 316, in _readersEntryGen
yield from self._applyEntryFiltersGen(reader)
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 333, in _applyEntryFiltersGen
entry = entryFilter.run(entry)
File "/storage/emulated/0/pyglossary-master/pyglossary/entry_filters.py", line 341, in run
entry._word = [hw_t] + words
TypeError: can only concatenate list (not "tuple") to list
Traceback (most recent call last):
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 850, in _write
for entry in self:
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 316, in _readersEntryGen
yield from self._applyEntryFiltersGen(reader)
File "/storage/emulated/0/pyglossary-master/pyglossary/glossary.py", line 333, in _applyEntryFiltersGen
entry = entryFilter.run(entry)
File "/storage/emulated/0/pyglossary-master/pyglossary/entry_filters.py", line 341, in run
entry._word = [hw_t] + words
TypeError: can only concatenate list (not "tuple") to list

[CRITICAL] Writing file 'Almawrid_Plus_ar-en.txt' failed.

Please download this dictionary and try:

https://mega.nz/file/rUtCCB4T#ShmvtlDth0h_ANcNm1-Xq-laolz7g3lehRktCGmMF3I

May the problem that there is headwords contradiction.

Please a solution 🥺

ilius added a commit that referenced this issue Feb 27, 2022
@ilius
Copy link
Owner

ilius commented Feb 27, 2022

I updated the branch.

@sobaee
Copy link
Author

sobaee commented Feb 27, 2022

Perfectly done

Many thanks

@ilius
Copy link
Owner

ilius commented Feb 28, 2022

I pushed to master branch.

@sobaee
Copy link
Author

sobaee commented Feb 28, 2022

Great
Thanks a lot 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants