-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added handling for abbreviations #47
Conversation
The only change that is not related to formatting, which I can't seem to reverse, is in lines 3059 and 3060. |
Looks really good! But it would be great to add tests for this to the test suite. |
The formatting changes are due to Black being set to a line length of 120 instead of 88, which was the original default. I'm not sure that it's a good idea to change the line length. |
src/tokenizer/tokenizer.py
Outdated
@@ -3169,17 +3168,16 @@ def valid_sent(sent: Optional[List[Tok]]) -> bool: | |||
# The following regex catches English numbers with a dot only | |||
r"|([\+\-\$€]?\d+\.\d+(?!\,\d))" # -1234.56 | |||
# The following regex catches Icelandic abbreviations, e.g. a.m.k., A.M.K., þ.e.a.s. | |||
r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)" | |||
# r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)" | |||
r"|([a-záðéíóúýþæö]+\.(?:[a-záðéíóúýþæö]+\.)+)(?![a-záðéíóúýþæö]+\s)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hættirðu við að styðja hástafi líka? Það er reyndar alveg spurning hvort leyfa eigi blandaða há- og lágstafi, eða hvort þarna ættu einfaldlega að vera tvær segðir, önnur fyrir hástafi eingöngu og hin fyrir lágstafi eingöngu. - Svo er spurning hvort við ættum að harðkóða undantekningu fyrir skammstöfunina "þ.á m." sem hefur mikla sérstöðu ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nei ég setti inn re.IGNORECASE í línu 3180 til þess að geta haft þessa segð styttri. Með því að leyfa blöndu grípum við skammstöfun í byrjun setningar, og svo er ágætt að geta gripið skrýtna hástöfun þó að hún sé málfræðilega röng.
Ok to merge? |
Yes, looks good! |
Pull request to improve correct_spaces() so that it splits abbreviations correctly. Previous handling incorrectly split e.g. 't.d.' into 't. d.', but should leave it now.