Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows build and wheel #492

Merged
merged 21 commits into from
Nov 6, 2015
Merged

Windows build and wheel #492

merged 21 commits into from
Nov 6, 2015

Conversation

tmylk
Copy link
Contributor

@tmylk tmylk commented Oct 19, 2015

TODO:

  • upload wheel to a fileshare
  • add steps to release wiki to download wheel from the share and upload to PyPi using wheelhouse-uploader and twine

@tmylk
Copy link
Contributor Author

tmylk commented Nov 2, 2015

Blocked by #470

Tests fail with "ValueError: Maximum allowed dimension exceeded"
https://ci.appveyor.com/project/tmylk/gensim-ahs80

@tmylk
Copy link
Contributor Author

tmylk commented Nov 3, 2015

Python 2 builds.

Python 3 Fails with "UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character".

Can't find a file with non-Ascii name though. Regex search returns nothing:

find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'

@gojomo
Copy link
Collaborator

gojomo commented Nov 3, 2015

The cfgdata being encoded by the offending line comes from get_initdata(), which appears to include not just a filename but other package metadata, including author name: https://github.com/python/cpython/blob/7fae13e4889ca44bbf22b61af277aca6d5dc13b9/Lib/distutils/command/bdist_wininst.py#L209

@tmylk
Copy link
Contributor Author

tmylk commented Nov 3, 2015

@gojomo Thanks a lot, that was the problem.

Next step is Python3.5 on 64-bit windows

=====================================================================
ERROR: Test DBOW doc2vec training.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python35-x64\lib\site-packages\gensim\test\test_doc2vec.py", line 160, in test_dbow_hs
    model = doc2vec.Doc2Vec(list_corpus, dm=0, hs=1, negative=0, min_count=2, iter=20)
  File "C:\Python35-x64\lib\site-packages\gensim\models\doc2vec.py", line 584, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 501, in build_vocab
    self.finalize_vocab()  # build tables & arrays
  File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 631, in finalize_vocab
    self.reset_weights()
  File "C:\Python35-x64\lib\site-packages\gensim\models\doc2vec.py", line 604, in reset_weights
    super(Doc2Vec, self).reset_weights()
  File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 953, in reset_weights
    self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
  File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 965, in seeded_vector
    once = random.RandomState(uint32(self.hashfxn(seed_string)))
OverflowError: Python int too large to convert to C long

@gojomo
Copy link
Collaborator

gojomo commented Nov 3, 2015

I've seen that error reported before by Windows users, and they found a workaround. (Maybe in old github issues or forum threads? I even think the unint32() call may be an attempted workaround.)

But also, that error makes me suspicious as to whether that build is really 64-bit in all important ways. Could some 32-bit libraries/steps have snuck in, and the 'x64' in the name only reflects intent and not actuality?

To really understand what's going wrong, I might also try unpacking that line into 3...

h = self.hashfxn(seed_string) 
ui = uint32(h)
once = random.RandomState(ui)

...to know exactly which one is triggering the error, and (with an extra print or catch) exactly what Python int value is breaking things.

@tmylk
Copy link
Contributor Author

tmylk commented Nov 5, 2015

I guess that the built-in hash function returns more than uint32 number on a 64-bit platform.
We need a uint32 to pass into random.RandomState.

@gojomo Is there anything wrong with this hack from the mailing list?

hashfxn = lambda value:hash(value) & 0xffffffff

Otherwise have to add a new dependency on hashlib.

Duplicate of #321

ERROR: test_zero_workers_mode (gensim.test.test_word2vec.TestWord2VecModel)

Traceback (most recent call last):
File "C:\Python35-x64\lib\site-packages\gensim\test\test_word2vec.py", line 124, in test_zero_workers_mode
model = word2vec.Word2Vec(sentences, min_count=1)
File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 435, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 501, in build_vocab
self.finalize_vocab() # build tables & arrays
File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 631, in finalize_vocab
self.reset_weights()
File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 953, in reset_weights
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
File "C:\Python35-x64\lib\site-packages\gensim\models\word2vec.py", line 966, in seeded_vector
ui = uint32(h)
OverflowError: Python int too large to convert to C long
-------------------- >> begin captured logging << --------------------

@gojomo
Copy link
Collaborator

gojomo commented Nov 5, 2015

Yes, the built-in hash function will give larger-than-32-bit numbers by design, and numpy.uint32 is already supposed to be doing the casting/clipping down to the range numpy.random.RandomState needs.

So, the core problem is numpy.uint32() isn't working on this Windows config. On other 64-bit platforms, it works. For example, try numpy.uint32(sys.maxsize+1).

We can do the range-clipping ourselves with a bit-mask like in that workaround. However, since allowing a user-defined hashfxn is intended, simply putting the fix there is fine for an individual user's workaround but non-ideal for the general case. I'd instead change the line in seeded_vector() from:

    once = random.RandomState(uint32(self.hashfxn(seed_string)))

...to...

    once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)

That eliminates the dependency on uint32() (there) entirely, and will keep working if the user has some reason to set a custom hashfxn that gives full-range hashes.

(That uint32() can't be relied upon to do the conversion it's supposed to might still be a time-bomb for anyone using numpy on such a Windows setup, but this one particular function should be OK...)

@piskvorky
Copy link
Owner

@gojomo if that is the case, it may be worth reporting to the NumPy devs as well.

I'm surprised such a bug would be there though -- lots of people use NumPy on Windows 64 AFAIK.

@gojomo
Copy link
Collaborator

gojomo commented Nov 6, 2015

I'm not the best reporter to Numpy, since I can't directly reproduce... but I gave it a try at numpy/numpy#6637. (I've seen mentions that 64-bit Windows isn't officially supported by Numpy, so unsure how much attention it'll get.)

tmylk added a commit that referenced this pull request Nov 6, 2015
@tmylk tmylk merged commit 06f9e13 into develop Nov 6, 2015
@piskvorky piskvorky deleted the appveyor branch November 6, 2015 13:20
@tmylk
Copy link
Contributor Author

tmylk commented Nov 6, 2015

Windows build is now green.
Thanks to @gojomo and @piskvorky for making it possible.

Windows wheels coming soon!

@piskvorky
Copy link
Owner

Great :)

@ghost ghost mentioned this pull request Nov 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants