I wrote code that generates numbers as Dutch text strings and vice versa #3197
Replies: 5 comments
-
I adapted it for German: ` units_and_tens_re = re.compile( text_lookup={} def split_of_hundreds(number): def parse_number(number: str) -> ([None,int], [None,bool]):
def ordinal_or_cardinal(cardinal_number:str, ordinal:bool, modulo_1000:str): def convert_number_to_text(n:int, ordinal:bool = False, _complete_number:int = 0): if name == "main":
` |
Beta Was this translation helpful? Give feedback.
-
Thanks, this is super cool! A slightly simplified version of this could be used for the The You could also consider offering your scripts as a spaCy plugin that sets custom extension attribtes on tokens and spans. This way, you won't have to worry about spaCy's internals while still giving users who want to try it an easy way to install it and plug it into their spaCy pipeline 🙂 |
Beta Was this translation helpful? Give feedback.
-
Hi Ines, If the second part of the returned tuple is True or False, like_num can be set to True. If it's None it's extremely unlikely it's a number. Since these strings have very specific semantic meaning, I wanted to go that additional step, the way it would make sense to do for strings representing date and time values. Of course, if you have 2 adjacent like_num strings, you'd still have to add them up: achthonderdvierenzestigduizend driehonderdentwaalf 864000 + 312 = 864312 That needs to be done at another level. I see that lemma contains "be" for "am" in English, would that be a good place to store the value represented for each number concatenated as a single word? Or is there a better place to keep track of the semantics? Jo |
Beta Was this translation helpful? Give feedback.
-
Hello, https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/ Coming to your question, no, lemmatizer files are for other purposes. Lemmatization is different, charging numbers with meanings are different, One usually do different types of semantics in different resources. Your task consist of 2 parts:
Lemmatization is about the semantic root of the word, sth completely different 🏭 For details, you can gitter/email me @PolyglotOpenstreetmap , I can help with the details if you wish. |
Beta Was this translation helpful? Give feedback.
-
Hi DuyGuy, You made me realise it would indeed be possible to solve it entirely with a regex. It became a monster, but it also showed my previous solution had some flaws for the exceptions eerste and derde of ordinal numbers and for their wrong counterparts eende and driede. I will check out the link you provided. Thanks, Jo |
Beta Was this translation helpful? Give feedback.
-
Feature description
This code can convert a text string to the corresponding integer for Dutch text written as a single word.
For the reverse it can go up to 999999, splitting after the word 'duizend'.
I don't know how to go about writing the test code or how to properly integrate it into SpaCy.
Beta Was this translation helpful? Give feedback.
All reactions