-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some edge cases for nld rules #46
Comments
OK, i tried to reproduce this, but that proved to be hard.... Or with the -v option:
This look OK imho, so the question arises: Do you run ucto with the correct language selected? If so, then please collect a few examples in a file, and attach it here, together with the exact command-line you use to test is. thanx! |
The sentences with the *ma examples that seem so strange are as follows: These outputs gathered directly using |
I still cannot reproduce this:
Output:
I also tried frog on this file:
puzzling |
I think I know what is the problem here. You probably use the released uctodata version 0.5 from 0ct 18.
This might explain this drama :) |
I will try to fix some of these sub-issues. Then release a new uctodata and make new separate issues for the unresolved ones. |
It looks like I obtained Lamachine at the beginning of November, but the uctodata is indeed v0.5 in the VERSION file. |
LaMachine is normally build on the stable releases of our software.
You could argue about the use of smiley's in a lot of texts. ?: |
uctodata v0.6 is released now. |
Thanks for your efforts. I'm updating to v2 now, will let you know if there's more trouble. |
Pay close attention to Flemish political parties, they like messing with symbols in their abbreviations: N-VA is another like SP.A ;) There's also CD&V, and a minor one called Red! (which reminds me, make sure Yahoo! also keeps its exclamation point... I bet that's frustrated so many programmers over the years...) |
Well, we can't make everybody happy... Maybe we can add an 'exceptions' list to ucto. As an extra parameter. Maybe that is feasible. Need some thinking. |
Yea, I guess it's a stretch to account for all cases like that, because there's no end to that path... The fun part is that I'm also running ucto on text containing XML tags, which seems to confuse it somewhat, e.g. plural 's is chopped off when the token is enclosed in a tag: |
No, regarding parsing XML with regexp's I gladly refer to this link. |
I understand XML documents are not "regular" in the sense of regular expressions... I wasn't trying to parse XML using regular expressions, I was trying to tokenize text that contains XML tags :) |
I added an experimental option to ucto: --add-tokens This option should provide ucto with a file containing words/tokens that should stay untokenized. The example file is:
When these words/tokens appear in a text, they stay untouched. When running ucto on this file like this:
|
closing this issue. It's to general anyway |
I think I may have found a few edge cases where the rules for Dutch split words incorrectly:
The text was updated successfully, but these errors were encountered: