Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong matching for Arabic #36

Closed
curquiza opened this issue Apr 8, 2021 · 10 comments
Closed

Wrong matching for Arabic #36

curquiza opened this issue Apr 8, 2021 · 10 comments

Comments

@curquiza
Copy link
Member

curquiza commented Apr 8, 2021

Related to meilisearch/meilisearch#1331

@ahmedkrmn
Copy link

Hi Clémentine, I've been trying to work on this issue. After some testing, I came to the following conclusion:

  • v0.1.2 [Good]: Works as expected. Searching with ا displays words with both ا and أ. Searching with أ displays words with both أ and ا. This is the expected behavior when searching Arabic words.
  • v0.1.3 [Bad]: Introduced in d9ee132. Searching with ا displays words with both ا and أ, but searching with أ displays neither.
  • v0.1.4 [Bad]: Same behavior as v0.1.3.
  • v0.2.0 till main [Bad]: Searching with ا displays words with ا only. Searching with أ displays words with أ only.

What do you suggest doing to fix this?

@curquiza
Copy link
Member Author

curquiza commented Nov 3, 2021

Hello @ahmedkrmn thanks for your interest! 😁

@ManyTheFish can help you on this when he will have the time :)

@ManyTheFish
Copy link
Member

ManyTheFish commented Nov 3, 2021

Hello @ahmedkrmn are you sure that deunicoding Arabic script is a good thing to do?
the sentence

المتعة والمرح في تعلم العربية

would be deunicoded as

lmt`@ wlmrH fy t`lm l`rby@

🤔

I can't write Arabic script, so I don't know what should be the good behavior.

@Reex11
Copy link

Reex11 commented Dec 16, 2021

Hello @ManyTheFish,
I believe that that the characters أ ا إ آ should be processed in a way similar to "lowercasing".
So when a user search for a query containing for example احمد he should be able to receive all these variations أحمد احمد إحمد آحمد.

@ManyTheFish
Copy link
Member

Hello @Reex11, I will investigate your case, 🤔
I tried if the lowercase function of rust could help us but no:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92777875ea531819640dfedea3d42395

Is there a name of this similar process to "lowercasing"? This could help me to find a library or a function that would do the job.

Thanks for your help 😁

@Reex11
Copy link

Reex11 commented Dec 20, 2021

Thank you for Investigating this.
Actually, its not literally lowercasing 😅, Its basically a kind of normalization proccess.
There is a new Arabic text processing library released recently Maha, it will be very useful as it has already done a lot of work in Arabic text normalization.

You need to know this first:
ا letter is called Alef
ء this symbol is called Hamza
أ this letter is Alef with Hamza above

Now, this library is calling this process normalization which I believe is right.
Here you can find Alef Variations
And here you can find what is called Alef Variations Normalization

I'll dig around to see if there's anything else to consider.

@Reex11
Copy link

Reex11 commented Dec 20, 2021

Hi again,
I found the following:

  • Harakat - like these َ ِ ُ - should be totally ignored. So, removing them should be part of normalization process.
  • Ta' Marbota letter ة should be normalized to Ha' letter ه.
  • Waw letter و is a stop word usually - it means and -, there maybe an issue here because the letter Waw is not always a stop word.
    for example سماء وأرض here the Waw letter is a stop word (Translation Earth and Sky).
    But in other cases its not a stop word. Ex. كتاب وليد here the Waw letter is part of an actual word, (Translation Waleed's Book)

I'll lookup for a solution for Waw stopword. And I already have some workarounds in mind.
I understand that you may face difficulties in understanding some parts of the languages. So, Let me know if you need any help.

@ManyTheFish
Copy link
Member

Hello @Reex11! Thanks for your help, we have to design or find a specialized normalizer for this.
I have a question about tokenization, are words only space-separated?

@Reex11
Copy link

Reex11 commented Dec 22, 2021

Hi @ManyTheFish,
First, You should know that I have basic knowledge about NLP.

I think that there are a lot of cases that are not space-separated.
But its ok to start with space-separation. ( and I believe that this is the general case in Arabic supported tokenizers I seen )
Although, There are some important and common conditions that need to be considered to improve the search results.
Such as And => و , The => الـ

Example:
الشجرة => The Tree is a combination of الـ and شجرة
الـ is equivalent to The and its always connected (not space separated) to the next word.

I found a great Arabic NLP library, I think its the best so far. Its called CAMeL tools

@curquiza
Copy link
Member Author

Closed in favor of meilisearch/product#139
Any contribution to add an Arabic normalizer and segmenter is welcomed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants