added handling for abbreviations #47

thorunna · 2023-11-02T13:47:04Z

Pull request to improve correct_spaces() so that it splits abbreviations correctly. Previous handling incorrectly split e.g. 't.d.' into 't. d.', but should leave it now.

thorunna · 2023-11-02T13:55:42Z

The only change that is not related to formatting, which I can't seem to reverse, is in lines 3059 and 3060.

vthorsteinsson · 2023-11-02T14:05:13Z

Looks really good! But it would be great to add tests for this to the test suite.

vthorsteinsson · 2023-11-02T14:06:45Z

The formatting changes are due to Black being set to a line length of 120 instead of 88, which was the original default. I'm not sure that it's a good idea to change the line length.

src/tokenizer/tokenizer.py

test/test_tokenizer.py

vthorsteinsson · 2023-11-03T11:33:14Z

src/tokenizer/tokenizer.py

@@ -3169,17 +3168,16 @@ def valid_sent(sent: Optional[List[Tok]]) -> bool:
    # The following regex catches English numbers with a dot only
    r"|([\+\-\$€]?\d+\.\d+(?!\,\d))"  # -1234.56
    # The following regex catches Icelandic abbreviations, e.g. a.m.k., A.M.K., þ.e.a.s.
-    r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)"
+    # r"|(\p{L}+\.(?:\p{L}+\.)+)(?!\p{L}+\s)"
+    r"|([a-záðéíóúýþæö]+\.(?:[a-záðéíóúýþæö]+\.)+)(?![a-záðéíóúýþæö]+\s)"


Hættirðu við að styðja hástafi líka? Það er reyndar alveg spurning hvort leyfa eigi blandaða há- og lágstafi, eða hvort þarna ættu einfaldlega að vera tvær segðir, önnur fyrir hástafi eingöngu og hin fyrir lágstafi eingöngu. - Svo er spurning hvort við ættum að harðkóða undantekningu fyrir skammstöfunina "þ.á m." sem hefur mikla sérstöðu ;-)

Nei ég setti inn re.IGNORECASE í línu 3180 til þess að geta haft þessa segð styttri. Með því að leyfa blöndu grípum við skammstöfun í byrjun setningar, og svo er ágætt að geta gripið skrýtna hástöfun þó að hún sé málfræðilega röng.

thorunna · 2023-11-13T11:10:53Z

Ok to merge?

vthorsteinsson · 2023-11-13T11:20:43Z

Yes, looks good!

added handling for abbreviations

b123fbc

thorunna added 2 commits November 2, 2023 14:23

reversed formatting

15d54ad

added test cases for abbreviations

7098dd3

vthorsteinsson reviewed Nov 2, 2023

View reviewed changes

src/tokenizer/tokenizer.py Outdated Show resolved Hide resolved

src/tokenizer/tokenizer.py Outdated Show resolved Hide resolved

test/test_tokenizer.py Show resolved Hide resolved

thorunna added 4 commits November 2, 2023 15:54

improved handling for abbreviations and degree symbols

7696a27

added installation for regex module

25649f6

reversed change

877df42

went back to re module

e84e442

vthorsteinsson reviewed Nov 3, 2023

View reviewed changes

thorunna added 2 commits November 3, 2023 11:40

old regex string, which was commented out, removed

8a60543

updated regex string

66ffce0

thorunna merged commit 5524df4 into master Nov 13, 2023

thorunna deleted the correct-spaces branch November 13, 2023 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added handling for abbreviations #47

added handling for abbreviations #47

thorunna commented Nov 2, 2023

thorunna commented Nov 2, 2023

vthorsteinsson commented Nov 2, 2023

vthorsteinsson commented Nov 2, 2023

vthorsteinsson Nov 3, 2023

thorunna Nov 3, 2023

thorunna commented Nov 13, 2023

vthorsteinsson commented Nov 13, 2023

added handling for abbreviations #47

added handling for abbreviations #47

Conversation

thorunna commented Nov 2, 2023

thorunna commented Nov 2, 2023

vthorsteinsson commented Nov 2, 2023

vthorsteinsson commented Nov 2, 2023

vthorsteinsson Nov 3, 2023

Choose a reason for hiding this comment

thorunna Nov 3, 2023

Choose a reason for hiding this comment

thorunna commented Nov 13, 2023

vthorsteinsson commented Nov 13, 2023