Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement of InformalNormalizer #214

Merged
merged 5 commits into from
Mar 12, 2022
Merged

Conversation

riasati
Copy link
Contributor

@riasati riasati commented Feb 19, 2022

No description provided.

Copy link
Contributor

@imani imani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove test txt files and other files which is not necessary to include in the repo.

Some recommendations would be given in order to improve speed and code clarity.

Please check the speed before and after modfying the code.

Comment on lines 1 to 20
# sourceFileAddress = "./output-test-formal.txt"
# destinationFileAddress = "./output-test-formal-space.txt"
sourceFileAddress = "./shekasteh-test.tok.formal"
destinationFileAddress = "./shekasteh-test-space.tok.formal"

def main(sourceAddress,destinationAddress):
with open(sourceAddress, "r", encoding='utf-8') as readFile, open(destinationAddress, "w", encoding='utf-8') as writeFile:
while True:
line = readFile.readline().strip()
if not line:
break
line = line.replace('‌', ' ')
line = line.replace('‎', ' ')
line = line.replace('.', '')
line = line.replace('؟', '')
line = line.replace('!', '')
writeFile.write(line + "\n")


main(sourceFileAddress,destinationFileAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not necessary to include in the repository.

@@ -0,0 +1,917 @@
. باید جدا بشویم تا فضای بیشتری رو بتوانیم چک کنیم
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to include in the repo.

@@ -0,0 +1,917 @@
. باید جدا بشویم تا فضای بیشتری را بتوانیم چک کنیم
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to include in the repo.

@@ -0,0 +1,1012 @@
من مگر این را بهت نگفتم ؟
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to include in the repo.

@@ -0,0 +1,1012 @@
من مگر این را بهت نگفتم ؟
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to include in the repo.

def appendSuffixToWord(OneCollectionOfWordAndSuffix):
mainWord = OneCollectionOfWordAndSuffix["word"]
suffixList = OneCollectionOfWordAndSuffix["suffix"]
adhesiveAlphabet = ["ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "س", "ش", "ص", "ض", "ع", "غ", "ف", "ق", "ک", "گ", "ل", "م", "ن", "ه", "ی"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert it to set for speedup

hazm/InformalNormalizer.py Show resolved Hide resolved
Comment on lines 444 to 449
# if suffixList[i] == "هاست":
# for alphabet in adhesiveAlphabet:
# if returnWord.endswith(alphabet):
# returnWord += "‌"
# break
# returnWord += "ها است"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove comment parts if are no longer needed

hazm/InformalNormalizer.py Show resolved Hide resolved
@@ -0,0 +1,146 @@
# from break_words import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be added to the repository.

@imani imani merged commit fd5c140 into roshan-research:master Mar 12, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants