Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly split synonyms #195

Merged
merged 3 commits into from
Feb 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Breaking:
New features:

* Add `pronunciation_audio_url` property #183 by @mundanevision20
* Synonyms are now properly split #195

Bugfixes:

Expand Down
2 changes: 1 addition & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ This example showcases the most useful functions of the `DudenWord` class.
'barmherziges Wesen, Verhalten'

> w.synonyms
'[Engels]güte, Milde, Nachsicht, Nachsichtigkeit; (gehoben) Herzensgüte, Mildtätigkeit, Seelengüte; (bildungssprachlich) Humanität, Indulgenz; (veraltend) Wohltätigkeit; (Religion) Gnade'
['[Engels]güte', 'Milde', 'Nachsicht', 'Nachsichtigkeit']

> w.origin
'mittelhochdeutsch barmherzekeit, barmherze, althochdeutsch armherzi, nach (kirchen)lateinisch misericordia'
Expand Down
50 changes: 41 additions & 9 deletions duden/word.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,16 +239,17 @@ def synonyms(self):
"""
Return the structure with word synonyms
"""
try:
section = self.soup.find("div", id="synonyme")
section = copy.copy(section)
if section.header:
section.header.extract()
return recursively_extract(
section, maxdepth=2, exfun=lambda x: x.text.strip()
)
except AttributeError:
section = self.soup.find("div", id="synonyme")
if section is None:
return None
section = copy.copy(section)
if section.header:
section.header.extract()
more_nav = section.find("nav", class_="more")
if more_nav:
more_nav.extract()

return split_synonyms(section.text.strip())

@property
def origin(self):
Expand Down Expand Up @@ -430,3 +431,34 @@ def alternative_spellings(self):
return None

return [spelling.get_text() for spelling in alternative_spellings]


def split_synonyms(text):
"""
Properly split strings like

meaning1, (commonly) meaning2; (formal, distant) meaning3
"""
# split by ',' and ';'
comma_splits = text.split(",")
fine_splits = []
for split in comma_splits:
fine_splits.extend(split.split(";"))

# now join back parts which are inside of parentheses
final_splits = []
inside_parens = False
for split in fine_splits:
if inside_parens:
final_splits[-1] = final_splits[-1] + "," + split
else:
final_splits.append(split)

if "(" in split and ")" in split:
inside_parens = split.index("(") > split.index("(")
elif "(" in split:
inside_parens = True
elif ")" in split:
inside_parens = False

return [split.strip() for split in final_splits]
5 changes: 4 additions & 1 deletion tests/test_data/Barmherzigkeit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@ origin: mittelhochdeutsch barmherzekeit, barmherze, althochdeutsch armherzi, nac
grammar_overview: 'die Barmherzigkeit; Genitiv: der Barmherzigkeit'
compounds: null
synonyms:
- '[Engels]güte, Milde, Nachsicht, Nachsichtigkeit'
- '[Engels]güte'
- Milde
- Nachsicht
- Nachsichtigkeit
words_before:
- barmen
- Barmen
Expand Down
4 changes: 3 additions & 1 deletion tests/test_data/Feiertag.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ compounds:
- verbringen
- öffnen
synonyms:
- Festtag, Gedenktag, Ehrentag
- Festtag
- Gedenktag
- Ehrentag
words_before:
- Feierlichkeit
- Feiermodus
Expand Down
3 changes: 2 additions & 1 deletion tests/test_data/Kragen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ grammar_overview: 'der Kragen; Genitiv: des Kragens, Plural: die Kragen, süddeu
österreichisch, schweizerisch: Krägen'
compounds: null
synonyms:
- Gurgel, Kehle
- Gurgel
- Kehle
words_before:
- Kraftwerksbetreiberin
- Kraftwort
Expand Down
5 changes: 3 additions & 2 deletions tests/test_data/Petersilie.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ origin: mittelhochdeutsch pētersil(je), althochdeutsch petersilie, petrasile <
grammar_overview: 'die Petersilie; Genitiv: der Petersilie, Plural: die Petersilien'
compounds: null
synonyms:
- (schweizerisch) Peterli; (bayrisch, österreichisch umgangssprachlich) Petersil;
(südwestdeutsch und schweizerisch mundartlich) Peterle
- (schweizerisch) Peterli
- (bayrisch, österreichisch umgangssprachlich) Petersil
- (südwestdeutsch und schweizerisch mundartlich) Peterle
words_before:
- Peter-Paul-Kirche
- Petersburg
Expand Down
5 changes: 4 additions & 1 deletion tests/test_data/einfach.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,10 @@ compounds:
- stimmen
- werden
synonyms:
- einmal, nicht doppelt, nicht mehrfach, bequem
- einmal
- nicht doppelt
- nicht mehrfach
- bequem
words_before:
- Eineuromünze
- Eineurostück
Expand Down
5 changes: 4 additions & 1 deletion tests/test_data/laufen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,10 @@ compounds:
- Vertrag
- Vorbereitung
synonyms:
- eilen, fegen, hetzen, jagen
- eilen
- fegen
- hetzen
- jagen
words_before:
- Laufbekleidung
- Laufbrett
Expand Down
12 changes: 12 additions & 0 deletions tests/test_word.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Test word functions"""

from duden.word import split_synonyms


def test_split_synonyms():
"""Test one-line list splitting"""
assert split_synonyms("") == [""]
assert split_synonyms("a, b ,c") == ["a", "b", "c"]

expected = ["a", "b (b, c)", "d (d, e, f) g", "h"]
assert split_synonyms("a, b (b, c); d (d; e, f) g, h") == expected