Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

values() bug with unicode characters #20

Open
andresriancho opened this issue Mar 27, 2014 · 12 comments
Open

values() bug with unicode characters #20

andresriancho opened this issue Mar 27, 2014 · 12 comments

Comments

@andresriancho
Copy link

>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
True

I would expect this output instead:

...
>>> m.values(u'core/encoding_utf8')
[1, 2]
...

Using datrie==0.6.1

andresriancho added a commit to andresriancho/django-moth that referenced this issue Mar 27, 2014
* Better handling for unicode
* Found pytries/datrie#20
andresriancho added a commit to andresriancho/django-moth that referenced this issue Mar 27, 2014
@kmike
Copy link
Member

kmike commented Mar 27, 2014

This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.

Unfortunately, libdatrie (which is this wrapper using) doesn't really support unicode; it only supports mapping a subset of Unicode (alphabet you're passing) to a single-byte 0-255 range.

So, to fix that the following is needed:

  1. make validation better;
  2. add notes to README about how alphabet work;
  3. maybe ditch libdatrie unicode support and do it ourselves - encode everything to utf8 and use a dummy 0-255 -> 0-255 mapping. The tricky part would be custom iteration because some chars will become multibyte.

I don't have plans to work on this in near future, so contributions are welcome.

What datrie version are you using? There was validation improvements in datrie 0.7.

andresriancho added a commit to andresriancho/w3af that referenced this issue Mar 27, 2014
* Failed to complete encoding tests due to pytries/datrie#20
* Enable these tests @ CircleCI #1368
@kmike
Copy link
Member

kmike commented Mar 27, 2014

I've taken a quick look at your code, and it seems you're building a trie at the startup and then just using it. If so, you may give https://github.com/kmike/DAWG a shot - it supports unicode properly, and it is faster and more memory efficient than datrie. The downside is that you can't modify DAWG after building.

@andresriancho
Copy link
Author

This is likely happens because of lack of validation. You alphabet is 0-255, but the data contain values outside this alphabet.

Are you sure? I mean... those special characters are translated to multi-byte somewhere, and those bytes are always 0-255.

I had to change to [chr(x) for x in range(0, 255)] because of this "false positive":

>>> from datrie import Trie
>>> import string
>>> m = Trie(string.printable)
>>> m[u'a'] = 1
>>> m[u'á'] = 1
>>> m[u'é'] = 1
>>> u'ì' in m
True

This looks like a similar/related bug.

So, to fix that the following is needed...

I believe 3) is the best option. Before sending anything to the library you simply encode('utf-8') and that that's it. It would be at wrapper level and potentially fix all these issues.

What datrie version are you using? There was validation improvements in datrie 0.7.

Using datrie==0.6.1

@andresriancho
Copy link
Author

you may give https://github.com/kmike/DAWG a shot

Will try it out, thanks for the link.

If nothing works, I might end up working on 3), does that make sense to you?

@andresriancho
Copy link
Author

After some minor tests: DAWG won't be useful in my case since my values are instances, not bytes.

>>> class A():
...     pass
... 
>>> a = A()
>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', a)]
>>> bytes_dawg = dawg.BytesDAWG(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dawg.pyx", line 473, in dawg.BytesDAWG.__init__ (src/dawg.cpp:7904)
  File "dawg.pyx", line 288, in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5795)
  File "dawg.pyx", line 37, in dawg.DAWG.__init__ (src/dawg.cpp:1982)
  File "dawg.pyx", line 472, in genexpr (src/dawg.cpp:7725)
TypeError: Expected bytes, got instance
>>>

@andresriancho
Copy link
Author

Source required to do 3) is in C. Haven't touched that in ages. Won't be able to help.

@kmike
Copy link
Member

kmike commented Mar 27, 2014

Even better than (3) is to talk to libdatrie author and make its "unicode support" optional. Currently, alphabet is a range of unicode chars that datrie support. It doesn't use multibyte characters, it always encode char using a single byte. For other wrappers I usually encode/decode data using utf8, but with libdatrie it means extra work/slowdowns as encoding/decoding will happen twice (in a wrapper and in a library).

You can use IntDAWG with indices and store values in a list (this is what datrie doing for you). Another option is https://github.com/kmike/marisa-trie - it is slower, but it is even more memory efficient, and it gives you indices for free.

@andresriancho
Copy link
Author

Running in the latest version of datrie (from git clone), gives a different result:

>>> from datrie import Trie
>>> m = Trie([chr(x) for x in range(0, 255)])
>>> m[u'core/encoding_utf8/\xe9.py'] = 1
>>> m[u'core/encoding_utf8/\u6539.py'] = 2
>>> m.values(u'core/encoding_utf8')
[1]
>>> u'core/encoding_utf8/\u6539.py' in m
False
>>>

Note the last "False", which differs from the initially reported "True". This makes the datrie a little bit more consistent, but still a bug IMHO. At least it should raise an exception when trying to add something that afterwards it will tell you that it is not there.

@andresriancho
Copy link
Author

Another option is https://github.com/kmike/marisa-trie

Oh man... you 💌 wrappers huh?

@kmike
Copy link
Member

kmike commented Mar 27, 2014

Here is one more for your pleasure: https://github.com/kmike/hat-trie :) I was looking for a perfect data structure at one point of my life..

I agree it should raise an exception. That said, it is documented that what you're trying to do won't work (see a warning here https://pypi.python.org/pypi/datrie/).

@andresriancho
Copy link
Author

You can use IntDAWG with indices and store values in a list (this is what datrie doing for you)

Thanks for the hint, finally I ended up using the same idea but with IntCompletionDAWG and works like a charm!

andresriancho/django-moth@038419d

@eromoe
Copy link

eromoe commented Dec 19, 2017

Current unicode support is still buggy, #48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants