Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added handling for abbreviations #47

Merged
merged 9 commits into from
Nov 13, 2023
Merged

added handling for abbreviations #47

merged 9 commits into from
Nov 13, 2023

Conversation

thorunna
Copy link
Contributor

@thorunna thorunna commented Nov 2, 2023

Pull request to improve correct_spaces() so that it splits abbreviations correctly. Previous handling incorrectly split e.g. 't.d.' into 't. d.', but should leave it now.

@thorunna
Copy link
Contributor Author

thorunna commented Nov 2, 2023

The only change that is not related to formatting, which I can't seem to reverse, is in lines 3059 and 3060.

@vthorsteinsson
Copy link
Member

Looks really good! But it would be great to add tests for this to the test suite.

@vthorsteinsson
Copy link
Member

The formatting changes are due to Black being set to a line length of 120 instead of 88, which was the original default. I'm not sure that it's a good idea to change the line length.

src/tokenizer/tokenizer.py Outdated Show resolved Hide resolved
src/tokenizer/tokenizer.py Outdated Show resolved Hide resolved
test/test_tokenizer.py Show resolved Hide resolved
@@ -3169,17 +3168,16 @@ def valid_sent(sent: Optional[List[Tok]]) -> bool:
# The following regex catches English numbers with a dot only
r"|([\+\-\$€]?\d+\.\d+(?!\,\d))" # -1234.56
# The following regex catches Icelandic abbreviations, e.g. a.m.k., A.M.K., þ.e.a.s.
r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)"
# r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)"
r"|([a-záðéíóúýþæö]+\.(?:[a-záðéíóúýþæö]+\.)+)(?![a-záðéíóúýþæö]+\s)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hættirðu við að styðja hástafi líka? Það er reyndar alveg spurning hvort leyfa eigi blandaða há- og lágstafi, eða hvort þarna ættu einfaldlega að vera tvær segðir, önnur fyrir hástafi eingöngu og hin fyrir lágstafi eingöngu. - Svo er spurning hvort við ættum að harðkóða undantekningu fyrir skammstöfunina "þ.á m." sem hefur mikla sérstöðu ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nei ég setti inn re.IGNORECASE í línu 3180 til þess að geta haft þessa segð styttri. Með því að leyfa blöndu grípum við skammstöfun í byrjun setningar, og svo er ágætt að geta gripið skrýtna hástöfun þó að hún sé málfræðilega röng.

@thorunna
Copy link
Contributor Author

Ok to merge?

@vthorsteinsson
Copy link
Member

Yes, looks good!

@thorunna thorunna merged commit 5524df4 into master Nov 13, 2023
@thorunna thorunna deleted the correct-spaces branch November 13, 2023 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants