Wrong matching for Arabic #36

curquiza · 2021-04-08T13:57:52Z

Related to meilisearch/meilisearch#1331

ahmedkrmn · 2021-10-25T15:38:21Z

Hi Clémentine, I've been trying to work on this issue. After some testing, I came to the following conclusion:

v0.1.2 [Good]: Works as expected. Searching with ا displays words with both ا and أ. Searching with أ displays words with both أ and ا. This is the expected behavior when searching Arabic words.
v0.1.3 [Bad]: Introduced in d9ee132. Searching with ا displays words with both ا and أ, but searching with أ displays neither.
v0.1.4 [Bad]: Same behavior as v0.1.3.
v0.2.0 till main [Bad]: Searching with ا displays words with ا only. Searching with أ displays words with أ only.

What do you suggest doing to fix this?

curquiza · 2021-11-03T09:55:48Z

Hello @ahmedkrmn thanks for your interest! 😁

@ManyTheFish can help you on this when he will have the time :)

ManyTheFish · 2021-11-03T13:00:22Z

Hello @ahmedkrmn are you sure that deunicoding Arabic script is a good thing to do?
the sentence

المتعة والمرح في تعلم العربية

would be deunicoded as

lmt`@ wlmrH fy t`lm l`rby@

🤔

I can't write Arabic script, so I don't know what should be the good behavior.

Reex11 · 2021-12-16T13:57:08Z

Hello @ManyTheFish,
I believe that that the characters أ ا إ آ should be processed in a way similar to "lowercasing".
So when a user search for a query containing for example احمد he should be able to receive all these variations أحمد احمد إحمد آحمد.

ManyTheFish · 2021-12-20T14:27:14Z

Hello @Reex11, I will investigate your case, 🤔
I tried if the lowercase function of rust could help us but no:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92777875ea531819640dfedea3d42395

Is there a name of this similar process to "lowercasing"? This could help me to find a library or a function that would do the job.

Thanks for your help 😁

Reex11 · 2021-12-20T18:14:16Z

Thank you for Investigating this.
Actually, its not literally lowercasing 😅, Its basically a kind of normalization proccess.
There is a new Arabic text processing library released recently Maha, it will be very useful as it has already done a lot of work in Arabic text normalization.

You need to know this first:
ا letter is called Alef
ء this symbol is called Hamza
أ this letter is Alef with Hamza above

Now, this library is calling this process normalization which I believe is right.
Here you can find Alef Variations
And here you can find what is called Alef Variations Normalization

I'll dig around to see if there's anything else to consider.

Reex11 · 2021-12-20T18:32:33Z

Hi again,
I found the following:

Harakat - like these َ ِ ُ - should be totally ignored. So, removing them should be part of normalization process.
Ta' Marbota letter ة should be normalized to Ha' letter ه.
Waw letter و is a stop word usually - it means and -, there maybe an issue here because the letter Waw is not always a stop word.
for example سماء وأرض here the Waw letter is a stop word (Translation Earth and Sky).
But in other cases its not a stop word. Ex. كتاب وليد here the Waw letter is part of an actual word, (Translation Waleed's Book)

I'll lookup for a solution for Waw stopword. And I already have some workarounds in mind.
I understand that you may face difficulties in understanding some parts of the languages. So, Let me know if you need any help.

ManyTheFish · 2021-12-22T15:52:04Z

Hello @Reex11! Thanks for your help, we have to design or find a specialized normalizer for this.
I have a question about tokenization, are words only space-separated?

Reex11 · 2021-12-22T23:17:08Z

Hi @ManyTheFish,
First, You should know that I have basic knowledge about NLP.

I think that there are a lot of cases that are not space-separated.
But its ok to start with space-separation. ( and I believe that this is the general case in Arabic supported tokenizers I seen )
Although, There are some important and common conditions that need to be considered to improve the search results.
Such as And => و , The => الـ

Example:
الشجرة => The Tree is a combination of الـ and شجرة
الـ is equivalent to The and its always connected (not space separated) to the next word.

I found a great Arabic NLP library, I think its the best so far. Its called CAMeL tools

curquiza · 2022-05-18T13:51:34Z

Closed in favor of meilisearch/product#139
Any contribution to add an Arabic normalizer and segmenter is welcomed!

curquiza added the hacktoberfest label Oct 6, 2021

meili-bot removed the hacktoberfest label Nov 4, 2021

curquiza mentioned this issue Mar 2, 2022

Wrong matching in Arabic characters for vesions 0.18.1+ meilisearch/meilisearch#1331

Closed

curquiza closed this as completed May 18, 2022

ManyTheFish mentioned this issue Sep 29, 2022

Arabic script: Implement specialized Segmenter #133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong matching for Arabic #36

Wrong matching for Arabic #36

curquiza commented Apr 8, 2021 •

edited

Loading

ahmedkrmn commented Oct 25, 2021

curquiza commented Nov 3, 2021

ManyTheFish commented Nov 3, 2021 •

edited

Loading

Reex11 commented Dec 16, 2021

ManyTheFish commented Dec 20, 2021

Reex11 commented Dec 20, 2021 •

edited

Loading

Reex11 commented Dec 20, 2021

ManyTheFish commented Dec 22, 2021

Reex11 commented Dec 22, 2021

curquiza commented May 18, 2022

Wrong matching for Arabic #36

Wrong matching for Arabic #36

Comments

curquiza commented Apr 8, 2021 • edited Loading

ahmedkrmn commented Oct 25, 2021

curquiza commented Nov 3, 2021

ManyTheFish commented Nov 3, 2021 • edited Loading

Reex11 commented Dec 16, 2021

ManyTheFish commented Dec 20, 2021

Reex11 commented Dec 20, 2021 • edited Loading

Reex11 commented Dec 20, 2021

ManyTheFish commented Dec 22, 2021

Reex11 commented Dec 22, 2021

curquiza commented May 18, 2022

curquiza commented Apr 8, 2021 •

edited

Loading

ManyTheFish commented Nov 3, 2021 •

edited

Loading

Reex11 commented Dec 20, 2021 •

edited

Loading