Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Differrent values on other implementations (murmur3 32 bits) #24

Closed
thor27 opened this issue Jul 11, 2018 · 8 comments
Closed
Labels

Comments

@thor27
Copy link

thor27 commented Jul 11, 2018

Hi!

I'm seeing different values when using other murmur libs in python, take a look:

In [1]: import pyhash

In [2]: import mmh3 # https://github.com/hajimes/mmh3

In [3]: import pymmh3 # https://github.com/wc-duck/pymmh3

In [4]: pyhash.murmur3_32()('foo')
Out[4]: 2085578581

In [5]: mmh3.hash('foo', signed=False)
Out[5]: 4138058784

In [10]: pymmh3.hash('foo') + 2**32
Out[10]: 4138058784

I've also tried this online tool: http://murmurhash.shorelabs.com/ with the same 4138058784
hash value. Why the values from pyhash differs from other implementations? it's possible to get the same result?

Thanks!

@flier
Copy link
Owner

flier commented Jul 12, 2018

It seems you use Python 3.x?

I have reproduced the issue in Python 2.7, the unicode string will be hashed to 2085578581L.

In [1]: import pyhash

In [2]: pyhash.murmur3_32()('foo')
Out[2]: 4138058784L

In [3]: pyhash.murmur3_32()(u'foo')
Out[3]: 2085578581L

@flier flier added the bug label Jul 12, 2018
flier added a commit that referenced this issue Jul 12, 2018
@flier
Copy link
Owner

flier commented Jul 12, 2018

please try the latest git commit, it should be aligned in both Python 2.x and 3.x

@thor27
Copy link
Author

thor27 commented Jul 12, 2018

Yes, I was using python3, I will update here and test. Thanks.

@thor27
Copy link
Author

thor27 commented Jul 12, 2018

Hi, the syntax of numbers with L at the end does not work on python3:

In [1]: import pyhash
Traceback (most recent call last):

  File "/home/thomaz/projetos/thumbor/vpython-pyfasthash/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-1-6937f29f5d7b>", line 1, in <module>
    import pyhash

  File "/home/thomaz/projetos/thumbor/pyfasthash/pyhash.py", line 89
    bytes_hash=3698262380L,
                         ^
SyntaxError: invalid syntax

@thor27
Copy link
Author

thor27 commented Jul 12, 2018

I've tryed to just remove all "L" and seems to import fine, but doesn't works:

In [1]: import pyhash

In [2]: pyhash.murmur3_32()('foo'.encode('ascii'))
Out[2]: 4138058784

In [3]: pyhash.murmur3_32()('foo')
Out[3]: 2085578581

@flier
Copy link
Owner

flier commented Jul 13, 2018

sure, after you 'foo'.encode('ascii') or use b'foo', it is a bytes string in Python 3.x, or str in Python 2.x, the hash value (4138058784) should be difference to a unicode string (2085578581).

    # https://github.com/flier/pyfasthash/issues/24
    def testDefaultStringType(self):
        hasher = murmur3_32()

        self.assertEqual(hasher('foo'), hasher(u'foo'))
        self.assertNotEqual(hasher('foo'), hasher(b'foo'))

Besides, the L suffix will be automatic removed by 2to3 conversion tools in setup steps

@thor27
Copy link
Author

thor27 commented Jul 13, 2018

Hi!
That is ok, I understood, but having different values for string and bytes is a desired behaviour? It is at least very error prone. It should be interesting to have at least a note on README about this. For example, because of this I indexed a 35 TB data structure (file system based) incorrectly and I had to reindex everything again (3 days process to complete). Anyway, it's working as desired in my codebase here now as I understand the issue. Thanks a lot for the support!

@flier
Copy link
Owner

flier commented Jul 24, 2018

In Python, str and unicode is totally different type, so, the hash value is definitely difference. Please check the discussion for more details :)

Sorry about the wasted time, I will update the README soon, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants