I wrote code that generates numbers as Dutch text strings and vice versa #3197

PolyglotOpenstreetmap · 2019-01-26T07:30:11Z

PolyglotOpenstreetmap
Jan 26, 2019

Feature description

This code can convert a text string to the corresponding integer for Dutch text written as a single word.

For the reverse it can go up to 999999, splitting after the word 'duizend'.

I don't know how to go about writing the test code or how to properly integrate it into SpaCy.

import re

units_and_tens_re = re.compile(
    r"(?P<units>(eene|tweeë|drieë|viere|vijfe|zese|zevene|achte|negene))n(?P<tens>(twin|der|veer|vijf|zes|zeven|tach|negen)tig)")
numeric_lookup = {0:  'nul',
                  1:  'een',
                  2:  'twee',
                  3:  'drie',
                  4:  'vier',
                  5:  'vijf',
                  6:  'zes',
                  7:  'zeven',
                  8:  'acht',
                  9:  'negen',
                  10: 'tien',
                  11: 'elf',
                  12: 'twaalf',
                  13: 'dertien',
                  14: 'veertien',
                  20: 'twintig',
                  30: 'dertig',
                  40: 'veertig',
                  50: 'vijftig',
                  60: 'zestig',
                  70: 'zeventig',
                  80: 'tachtig',
                  90: 'negentig',
                  1000: 'duizend',
                  }

text_lookup={}
for number, text in numeric_lookup.items():
     text_lookup[text] = number

def split_of_hundreds(number):
    h = 0
    # 200 - 999
    for hundreds in range(2, 10):
        honderd_ = numeric_lookup[hundreds] + "honderd"
        if honderd_ in number:
            number = number.replace(honderd_ + "en", "")
            number = number.replace(honderd_, "")
            h = hundreds
            break
    # 100 - 199
    if 'honderd' in number:
        honderd_ = "honderd"
        number = number.replace(honderd_ + "en", "")
        number = number.replace(honderd_, "")
        h = 1
    return h, number


def parse_number(number: str) -> ([None,int], [None,bool]):
    """Accepts Dutch text representing a number which can be written as a single word
       in the range of 0-999 and their multiples of 1 thousand.
       After 'duizend' a space is required
       :param number: text string that may be a number
       :return: a tuple consisting of
                * either None or the value number represents as an int
                * None:  This string cannot be converted to a valid number
                  False: This string represents a cardinal number
                  True:  This string represents an ordinal number"""

    if number[-3:] in ('tal'):
        return (None, None)
    if number[-4:] in ('maal', 'hoog', 'hoge', 'voud', 'werf', 'poot'):
        return (None, None)
    if number[-5:] in ('jarig', 'delig', 'potig', 'armig', 'benig'):
        return (None, None)
    if number[-6:] in ('jarige', 'delige', 'potige', 'armige', 'benige',
                       'koppig', 'voudig', 'tallig'):
        return (None, None)
    if number[-7:] in ('koppige', 'voudige', 'tallige'):
        return (None, None)
    if number in ['honderdhout', 'honderdman', 'honderdponder', 'honderduit',
                  'duizendblad', 'duizenddingendoekje', 'duizenddollarvis', 'duizenderlei',
                  'duizendguldenkruid', 'duizendklapper', 'duizendknoop',
                  'duizendkunstenaar', 'duizendschoon']:
        return (None, None)

    ordinal = False
    if number == 'één':
        return (1, ordinal)
    elif number == 'nul':
        return (0, ordinal)
    elif number == 'duizend':
        return (1000, ordinal)
    t=0
    u=0
    # Can it be an ordinal number?
    if number.endswith("ste"):
        ordinal = True
        number = number[:-3]
    elif number.endswith("de"):
        ordinal = True
        number = number[:-2]

    # Is it a multiple of 1000?
    if number.endswith("duizend"):
        k = 1000
        number = number.replace("duizend", "")
    else:
        k = 1
    value = None
    h, number = split_of_hundreds(number)
    if h:
        value = 0
    if number:
        if number in text_lookup:
            # 01 - 14
            value = int(text_lookup[number])
            t = int(value/10)
            u = value - t * 10
        elif number.endswith('tien'):
            # 15 - 19
            unit = number.replace('tien','')
            if unit in ['vijf', 'zes', 'zeven', 'acht', 'negen']:
                u = text_lookup[unit]
                t = 1
                value = t * 10 + u
        if not(value):
            ut = units_and_tens_re.match(number)
            if ut:
                # 20 - 99
                t = ut['tens']
                u = ut['units'][:-1]
                value = text_lookup[t] + text_lookup[u]
    if not(value or h):
        return (None, None)
    return ((h*100 + value) * k, ordinal)


def ordinal_or_cardinal(cardinal_number:str, ordinal:bool, modulo_1000:str):
    if ordinal:
        if modulo_1000:
            last_letter = modulo_1000[-1:]
            modulo_1000 = " " + modulo_1000
        else:
            last_letter = cardinal_number[-1:]
        if last_letter in ['t', 'g', 'd'] or cardinal_number == "miljoen":
            return cardinal_number + modulo_1000 + 'ste'
        else:
            return cardinal_number + modulo_1000 + 'de'
    else:
        return cardinal_number + modulo_1000

def convert_number_to_text(n:int, ordinal:bool = False, _complete_number:int = 0):
    """
    Converts an integer number between 0 and 999 and their multiples of 1000
    to a Dutch text string
    :param n:
    :param ordinal returns ordinal number when set to True
    :param _complete_number: only used when called recursively to cater for standalone "één"
    :return: full word representation of n in Dutch,
             which may contain "(en)" between honderd/duizend and 1 to 12
             You need to either remove the () or (en)
    """
    modulo_1000 = ""
    if n == 1:
        if ordinal:
            return "eerste"
        else:
            if _complete_number == 0:
                return 'één'
            else:
                return 'een'
    if n == 3 and ordinal:
        return "derde"
    elif n == 1000:
        return ordinal_or_cardinal('duizend', ordinal, modulo_1000)
    elif n > 1000 and int(n/1000) == n/1000:
        # multiples of 1000 are written as a single word in Dutch
        thousand = "duizend"
        mod1000 = n % 1000
        if 1 < mod1000 < 13:
            en = " (en) "
        else:
            en = ""
        n = int(n / 1000)
        if mod1000:
            modulo_1000 = en + convert_number_to_text(mod1000,
                                                      ordinal)
    else:
        thousand = ""
    if n in numeric_lookup:
        return ordinal_or_cardinal(numeric_lookup[n] + thousand,
                                   ordinal, modulo_1000)
    hundreds = int(n / 100)
    tens = int((n-hundreds*100) / 10)
    units = int((n-hundreds*100-tens*10))
    if hundreds == 1:
        result = "honderd"
    elif hundreds != 0:
        result = f"{convert_number_to_text(hundreds, _complete_number=n)}honderd"
    else:
        result = ""
    tens_units = tens * 10 + units
    if hundreds and tens_units < 13:
        if tens_units:
            return ordinal_or_cardinal(result + "(en)" + convert_number_to_text(tens_units,
                                                                                _complete_number=n) + thousand,
                                       ordinal, modulo_1000)
        else:
            return ordinal_or_cardinal(result + thousand,
                                       ordinal, modulo_1000)
    if tens == 1 and units in (1, 2, 3, 4):
        return ordinal_or_cardinal(result + convert_number_to_text(tens_units,
                                                                   _complete_number=n) + thousand,
                                   ordinal, modulo_1000)
    elif tens == 1:
        return ordinal_or_cardinal(result + f"{convert_number_to_text(units, _complete_number=n)}tien" + thousand,
                                   ordinal, modulo_1000)
    elif tens !=0:
        if units in (2,3):
            result += convert_number_to_text(units, _complete_number=n) + "ën"
        elif units !=0:
            result += convert_number_to_text(units, _complete_number=n) + "en"
        return ordinal_or_cardinal(result + convert_number_to_text(tens*10,
                                                                   _complete_number=n) + thousand,
                                   ordinal, modulo_1000)
    else:
        return ordinal_or_cardinal(result + thousand,
                                   ordinal, modulo_1000)


if __name__ == "__main__":
    numbers = {}
    for i in range(0,1000):
        complete = convert_number_to_text(i)
        #print(parse_number(convert_number_to_text(i, ordinal=True)))
        if complete:
            if "(en)" in complete:
                numbers[complete.replace("(en)", "")] = parse_number(complete.replace("(en)", ""))[0]
                assert numbers[complete.replace("(en)", "")] == i
                numbers[complete.replace("(", "").replace(")", "")] = parse_number(complete.replace("(", "").replace(")", ""))[0]
                assert numbers[complete.replace("(", "").replace(")", "")] == i
            else:
                numbers[complete] = parse_number(complete)[0]
                print(i, numbers[complete])
                assert numbers[complete] == i

    for i in range(1,1000):
        complete = convert_number_to_text(i*1000)
        if complete:
            if "(en)" in complete:
                numbers[complete.replace("(en)", "")] = parse_number(complete.replace("(en)", ""))[0]
                assert numbers[complete.replace("(en)", "")] == i * 1000
                numbers[complete.replace("(", "").replace(")", "")] = parse_number(complete.replace("(", "").replace(")", ""))[0]
                assert numbers[complete.replace("(", "").replace(")", "")] == i * 1000
            else:
                numbers[complete] = parse_number(complete)[0]
                print('complete',complete, i*1000)
                assert numbers[complete] == i * 1000

    #print(numbers)
    print(parse_number("test"))
    print(parse_number("nonsense"))
    print(parse_number("brie"))
    print(parse_number("beste"))
    print(parse_number("honderdmaal"))
    print(parse_number("drieduizend vijfhonderdvierentachtig")) # wrong (not a single word, so to be expected)
    print(parse_number("drieduizend vierentachtig")) # wrong (not a single word, so to be expected)
    print(parse_number("vijfhonderdvierentachtig"))
    print(parse_number("vierendelen"))
    print(parse_number("vieren")) # plural of 4 or "to celebrate" / "to release (a rope)"
    print(parse_number("klavertjevier"))
    print(parse_number(""))

PolyglotOpenstreetmap · 2019-01-26T14:21:50Z

PolyglotOpenstreetmap
Jan 26, 2019
Author

I adapted it for German:

`
import re

units_and_tens_re = re.compile(
r"(?P(ein|zwei|drei|vier|fünf|sechs|sieben|acht|neun))und(?P(zwan|drei|vier|fünf|sech|sieb|acht|neun)zig)")
numeric_lookup = {0: 'null',
1: 'ein',
2: 'zwei',
3: 'drei',
4: 'vier',
5: 'fünf',
6: 'sechs',
7: 'sieben',
8: 'acht',
9: 'neun',
10: 'zehn',
11: 'elf',
12: 'zwölf',
16: 'sechzehn',
17: 'siebzehn',
20: 'zwanzig',
30: 'dreizig',
40: 'vierzig',
50: 'fünfzig',
60: 'sechzig',
70: 'siebzig',
80: 'achtzig',
90: 'neunzig',
1000: 'tausend',
}

text_lookup={}
for number, text in numeric_lookup.items():
text_lookup[text] = number

def split_of_hundreds(number):
h = 0
# 200 - 999
for hundreds in range(2, 10):
hundert_ = numeric_lookup[hundreds] + "hundert"
if hundert_ in number:
number = number.replace(hundert_, "")
h = hundreds
break
# 100 - 199
if 'hundert' in number:
hundert_ = "hundert"
number = number.replace(hundert_, "")
h = 1
return h, number

def parse_number(number: str) -> ([None,int], [None,bool]):
"""Accepts German text representing a number which can be written as a single word
in the range of 0-999 and their multiples of 1 thousand.
After 'tausent' a space is required
:param number: text string that may be a number
:return: a tuple consisting of
* either None or the value number represents as an int
* None: This string cannot be converted to a valid number
False: This string represents a cardinal number
True: This string represents an ordinal number"""

if number[-3:] in ('mal', 'hog'):
    return (None, None)
if number[-4:] in ('zahl', 'höhe', 'falt'):
    return (None, None)
if number[-5:] in ('järig'):
    return (None, None)
if number[-6:] in ('järige', 'teilig', 'faltig', 'zählig'):
    return (None, None)
if number[-7:] in ('teilige', 'faltige', 'zählige'):
    return (None, None)
if number in []:
    return (None, None)

ordinal = False
if number == 'eins':
    return (1, ordinal)
elif number == 'erste':
    return (1, True)
elif number == 'dritte':
    return (3, True)
elif number == 'siebte':
    return (7, True)
elif number == 'achte':
    return (8, True)
elif number == 'null':
    return (0, ordinal)
elif number == 'tausend':
    return (1000, ordinal)
t=0
u=0
# Can it be an ordinal number?
if number.endswith("ste"):
    ordinal = True
    number = number[:-3]
elif number.endswith("te"):
    ordinal = True
    number = number[:-2]

# Is it a multiple of 1000?
if number.endswith("tausend"):
    k = 1000
    number = number.replace("tausend", "")
else:
    k = 1
value = None
h, number = split_of_hundreds(number)
if h:
    value = 0
if number:
    if number in text_lookup:
        # 01 - 12, 16, 17
        value = int(text_lookup[number])
        t = int(value/10)
        u = value - t * 10
    elif number.endswith('zehn'):
        # 13 - 19
        unit = number.replace('zehn','')
        if unit in ['drei', 'vier', 'fünf', 'acht', 'neun']:
            u = text_lookup[unit]
            t = 1
            value = t * 10 + u
    if not(value):
        ut = units_and_tens_re.match(number)
        if ut:
            # 20 - 99
            t = ut['tens']
            u = ut['units']
            print(t, u)
            value = text_lookup[t] + text_lookup[u]
if not(value or h):
    return (None, None)
return ((h*100 + value) * k, ordinal)

def ordinal_or_cardinal(cardinal_number:str, ordinal:bool, modulo_1000:str):
if ordinal:
if modulo_1000:
last_letter = modulo_1000[-1:]
modulo_1000 = " " + modulo_1000
else:
last_letter = cardinal_number[-1:]
if last_letter in ['t', 'g', 'd'] or cardinal_number in ["million", "billion"]:
return cardinal_number + modulo_1000 + 'ste'
else:
return cardinal_number + modulo_1000 + 'te'
else:
return cardinal_number + modulo_1000

def convert_number_to_text(n:int, ordinal:bool = False, _complete_number:int = 0):
"""
Converts an integer number between 0 and 999 and their multiples of 1000
to a Dutch text string
:param n:
:param ordinal returns ordinal number when set to True
:param _complete_number: only used when called recursively to cater for standalone "één"
:return: full word representation of n in German.
"""
modulo_1000 = ""
if n == 1:
if ordinal:
return "erste"
else:
if _complete_number == 0:
return 'eins'
else:
return 'ein'
if n == 3 and ordinal:
return "dritte"
if n == 7 and ordinal:
return "siebte"
if n == 8 and ordinal:
return "achte"
elif n == 1000:
return ordinal_or_cardinal('tausend', ordinal, modulo_1000)
elif n > 1000 and int(n/1000) == n/1000:
# multiples of 1000 are written as a single word in German
thousand = "tausend"
mod1000 = n % 1000
n = int(n / 1000)
if mod1000:
modulo_1000 = en + convert_number_to_text(mod1000,
ordinal)
else:
thousand = ""
if n in numeric_lookup:
return ordinal_or_cardinal(numeric_lookup[n] + thousand,
ordinal, modulo_1000)
hundreds = int(n / 100)
tens = int((n-hundreds100) / 10)
units = int((n-hundreds100-tens10))
if hundreds == 1:
result = "hundert"
elif hundreds != 0:
result = f"{convert_number_to_text(hundreds, _complete_number=n)}hundert"
else:
result = ""
tens_units = tens * 10 + units
if hundreds and tens_units < 13:
if tens_units:
return ordinal_or_cardinal(result + convert_number_to_text(tens_units,
_complete_number=n) + thousand,
ordinal, modulo_1000)
else:
return ordinal_or_cardinal(result + thousand,
ordinal, modulo_1000)
if tens == 1 and units in (1, 2, 6, 7):
return ordinal_or_cardinal(result + convert_number_to_text(tens_units,
_complete_number=n) + thousand,
ordinal, modulo_1000)
elif tens == 1:
return ordinal_or_cardinal(result + f"{convert_number_to_text(units, _complete_number=n)}zehn" + thousand,
ordinal, modulo_1000)
elif tens !=0:
if units != 0:
result += convert_number_to_text(units, _complete_number=n) + "und"
return ordinal_or_cardinal(result + convert_number_to_text(tens10,
_complete_number=n) + thousand,
ordinal, modulo_1000)
else:
return ordinal_or_cardinal(result + thousand,
ordinal, modulo_1000)

if name == "main":
numbers = {}
for i in range(0,1000):
complete = convert_number_to_text(i)
print(parse_number(convert_number_to_text(i, ordinal=True)))
if complete:
numbers[complete] = parse_number(complete)[0]
print(i, numbers[complete], complete)
assert numbers[complete] == i

for i in range(1,1000):
    complete = convert_number_to_text(i*1000)
    if complete:
        if "(en)" in complete:
            numbers[complete.replace("(en)", "")] = parse_number(complete.replace("(en)", ""))[0]
            assert numbers[complete.replace("(en)", "")] == i * 1000
            numbers[complete.replace("(", "").replace(")", "")] = parse_number(complete.replace("(", "").replace(")", ""))[0]
            assert numbers[complete.replace("(", "").replace(")", "")] == i * 1000
        else:
            numbers[complete] = parse_number(complete)[0]
            print('complete',complete, i*1000)
            assert numbers[complete] == i * 1000

#print(numbers)
print(parse_number("test"))
print(parse_number("frei"))
print(parse_number("beste"))
print(parse_number("hunderdfalt"))
print(parse_number("fünfhundertvierundachtzig"))
print(parse_number(""))

`

0 replies

ines · 2019-01-26T15:37:27Z

ines
Jan 26, 2019
Maintainer

Thanks, this is super cool!

A slightly simplified version of this could be used for the like_num function in the lexical attributes (available as Token.like_num and Lexeme.like_num). See here for more details and here for an example of the English lex_attrs.py.

The like_num function takes a string and returns a boolean indicating whether the text resembles a number (e.g. "vierentachtig" or 123).

You could also consider offering your scripts as a spaCy plugin that sets custom extension attribtes on tokens and spans. This way, you won't have to worry about spaCy's internals while still giving users who want to try it an easy way to install it and plug it into their spaCy pipeline 🙂

0 replies

PolyglotOpenstreetmap · 2019-01-26T18:58:35Z

PolyglotOpenstreetmap
Jan 26, 2019
Author

Hi Ines,

If the second part of the returned tuple is True or False, like_num can be set to True. If it's None it's extremely unlikely it's a number. Since these strings have very specific semantic meaning, I wanted to go that additional step, the way it would make sense to do for strings representing date and time values.

Of course, if you have 2 adjacent like_num strings, you'd still have to add them up:

achthonderdvierenzestigduizend driehonderdentwaalf

864000 + 312 = 864312

That needs to be done at another level.

I see that lemma contains "be" for "am" in English, would that be a good place to store the value represented for each number concatenated as a single word? Or is there a better place to keep track of the semantics?

Jo

0 replies

DuyguA · 2019-01-29T16:45:28Z

DuyguA
Jan 29, 2019

Hello,
This is called a reverse normalizer, used in ASR and TTS purposes mostly.
For the issue above, you can do a very simple CFG parsing or a simpler regex parsing.
For how to do the parsing to generate addition code, also as far as I see you're interested in date-time strings you can check:

https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/

Coming to your question, no, lemmatizer files are for other purposes. Lemmatization is different, charging numbers with meanings are different, One usually do different types of semantics in different resources. Your task consist of 2 parts:

Parse a number string that is more than 2 words
Attach a semantic meaning to the parsed string above (i,.e add all the numbers in the string)

Lemmatization is about the semantic root of the word, sth completely different 🏭

For details, you can gitter/email me @PolyglotOpenstreetmap , I can help with the details if you wish.

0 replies

PolyglotOpenstreetmap · 2019-01-30T06:54:43Z

PolyglotOpenstreetmap
Jan 30, 2019
Author

Hi DuyGuy,

You made me realise it would indeed be possible to solve it entirely with a regex. It became a monster, but it also showed my previous solution had some flaws for the exceptions eerste and derde of ordinal numbers and for their wrong counterparts eende and driede.

#3200

I will check out the link you provided.

Thanks,

Jo

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I wrote code that generates numbers as Dutch text strings and vice versa #3197

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

I wrote code that generates numbers as Dutch text strings and vice versa #3197

PolyglotOpenstreetmap Jan 26, 2019

Feature description

Replies: 5 comments

PolyglotOpenstreetmap Jan 26, 2019 Author

ines Jan 26, 2019 Maintainer

PolyglotOpenstreetmap Jan 26, 2019 Author

DuyguA Jan 29, 2019

PolyglotOpenstreetmap Jan 30, 2019 Author

PolyglotOpenstreetmap
Jan 26, 2019

PolyglotOpenstreetmap
Jan 26, 2019
Author

ines
Jan 26, 2019
Maintainer

PolyglotOpenstreetmap
Jan 26, 2019
Author

DuyguA
Jan 29, 2019

PolyglotOpenstreetmap
Jan 30, 2019
Author