-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word-vectors outside train()
-update tokens being changed; also, vectors_lockf
0.0-values don't suppress changes
#3100
Comments
Note that the I'd still like to know & fix if it's not suitable for experimentation, but digging in would require:
Is there a particular reason you've picked the range 550000:551000 for special treatment? (Off the top of my head, that doesn't seem a threshold meaningful to potential classes of bugs, like pointer overflows.) That any word-vectors outside the used-tokens of the the training-update change, regardless of any Also, the potential size of this model means you're touching perhaps hundreds of GB of RAM (and maybe, but hopefully not, any virtual-memory IO). It'd be helpful to do one or both of: (1) a memory-test on the current machine; (2) see if this reproduces similarly on any other system. |
Thanks for the fast response. In summary, there are two distinct problems, one is about
I re-tested with gensim 4.0.1 (most recent version) without setting
To check the bias on the updated words, I additionally tested the distribution of updated words: updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])
Counter({0: 59072,
10000000: 62699,
20000000: 62937,
30000000: 62400,
40000000: 61929,
50000000: 62698,
60000000: 62667,
70000000: 62672,
80000000: 62810,
90000000: 62591}) The updated words distributed almost uniformly
Nope. I just set the arbitrary numbers. I believe changing the number should not impact any of the results above.
(1) After rebuilding this system the last December, I've tested the entire memory with memtest86, and there was no problem for multiple iterations
I use the following iterator-generator to generate random sequences (not the entire code, but it will be okay to reproduce). class RandomnumberIterator():
import numpy as np
def __init__(self, generator_function, max_number, random_seed, max_iteration, max_walking_dist):
self.random_seed = random_seed
self.max_iteration = max_iteration
self.generator_function = generator_function
self.max_walking_dist = max_walking_dist
self.max_number = max_number
self.generator = self.generator_function(random_seed, max_number, max_iteration, max_walking_dist)
def __iter__(self):
self.generator = self.generator_function(self.random_seed, self.max_number,
self.max_iteration, self.max_walking_dist)
return self
def __next__(self):
result = next(self.generator)
if result is None:
raise StopIteration
else:
return result
def __len__(self):
return self.max_iteration
def RandomWalkGenerator(random_seed, max_number, max_iteration, max_walking_dist):
import numpy as np
random.seed(random_seed)
np.random.seed(random_seed)
for it in range(max_iteration):
yield list(np.random.randint(0, high=max_number, size=max_walking_dist).astype(str))
# Initial trainning
vector_dim = 100
windowsize = 5
vocab_max_number = 100000000
rand_seed = 0
num_sentences = 50000000
sentence_length = 30
now_it = RandomnumberIterator(RandomWalkGenerator, vocab_max_number, rand_seed, num_sentences, sentence_length)
base_model = Word2Vec(now_it, vector_size=vector_dim, window=5, min_count=1, workers=8)
# Update step
num_additional_sentences = 1000000 # Additional tranning sentences
additional_sentence_length = 10 # Sentence length for the update
additinal_sentences_vocabs = 551000
rand_seed = 0
tunned_model = deepcopy(base_model)
now_it = RandomnumberIterator(RandomWalkGenerator, additinal_sentences_vocabs, rand_seed, num_additional_sentences, sentence_length)
tunned_model.train(now_it, total_examples = num_additional_sentences, epochs=5)
It is also suspicious. |
Thanks for MRE code - but can you show your cod that concludes out-of-range vectors change, too? What's the smallest That a part of your config is Intel-MKL on an officially-unsupported processor is a concern. Is that necessary to reproduce the problem? (Won't other officially-supported libraries only lose a little bit of performance?) |
Here it is: Here, for vocab in range(vocab_max_number):
now_paperid = str(vocab)
if(now_paperid in tunned_vector.key_to_index and now_paperid in base_vector.key_to_index):
diff = np.sum(base_vector[now_paperid] - tunned_vector[now_paperid])
temp_diffs_list.append(diff)
else:
temp_diffs_list.append(0) # for the words never used for both initial train and update.
Maybe somewhere between 50,000,000 and 10,000,000. I think it requires 100GB of RAM. There was no problem at vocab size of 10,000,000.
I am not sure, yet I will try on the other system. |
Never mind. I was confused. In this test, I used
|
As you requested, I tested with the other machine with following spec.
Detailed dependent package versions are as follows:
I changed random seed, thus the number is not so exactly matched with old test (with Threadripper 3990X)
The updated words are also distributed uniformly updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])
Counter({0: 59156,
10000000: 62747,
20000000: 62260,
30000000: 62610,
40000000: 62138,
50000000: 62313,
60000000: 62403,
70000000: 62477,
80000000: 62456,
90000000: 62200}) Regarding the gensim's internal index, the abnormally updated vector is distributed as follows updated_words = [base_vector.key_to_index[str(idx)] for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
sorted(Counter([int(x/1000000)*1000000 for x in updated_words]).items())
[(0, 21719),
(1000000, 21691),
(2000000, 21572),
(3000000, 21438),
(4000000, 22119),
(5000000, 21374),
(6000000, 21599),
(7000000, 21405),
(8000000, 21797),
(9000000, 21686),
(10000000, 22173),
(11000000, 21424),
(12000000, 22106),
(13000000, 21628),
(14000000, 12259),
(15000000, 10942),
(16000000, 10906),
(17000000, 10800),
(18000000, 11011),
(19000000, 10853),
(20000000, 11012),
(21000000, 10880),
(22000000, 10846),
(23000000, 11265),
(24000000, 10643),
(25000000, 10781),
(26000000, 11137),
(27000000, 11080),
(28000000, 11209),
(29000000, 10926),
(30000000, 10890),
(31000000, 11084),
(32000000, 10898),
(33000000, 10743),
(34000000, 10543),
(35000000, 10953),
(36000000, 11065),
(37000000, 10823),
(38000000, 10797),
(39000000, 10812),
(40000000, 10881),
(41000000, 10715),
(42000000, 10275)] In summary, it is persist for the other machine (from the different cpu vendor). I think it will be my last test. I already tested too much :> |
train()
-update tokens being changed; also, vectors_lockf
0.0-values don't suppress changes
Last shot: AMD 3990X with intel MKL blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
None
Linux-5.4.0-70-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1
updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])
Counter({0: 59156,
10000000: 62747,
20000000: 62260,
30000000: 62610,
40000000: 62138,
50000000: 62313,
60000000: 62403,
70000000: 62477,
80000000: 62456,
90000000: 62200}) It seems that both MKL and OpenBLAS have similar problem |
Thanks for all your work reproducing this under multiple configurations! The main thing that would help now, given the rarity of debugging machines with over 100GB RAM, is to determine the true minimum RAM in which this problem can be shown - even if, for example, it takes a larger random dataset. It seemed above it was reliably triggered at a 50,000,000-size vocabulary, with 100-dimensional vectors. Can you try 50M/100M with toy-sized 2-dimensional vectors, & see if that might also show the problem (in a lot less time/space)? |
Interestingly, there is no problem with 2-dim vectors.
|
Thanks! That a (50M * 2 dims * 4 bytes) 400MB set of vectors doesn't show the problem, but a (50M * 100 dim * 4 bytes) 20GB set does is informative. Given the kinds of power-of-2 thresholds often involved with pointer-type errors, it'd similarly be useful to probe around the 3GB-or-5GB, or 15GB-or-17GB thresholds. Specifically: (around)3GB: (50M words * 16 dims * 4bytes), and/or (7.5M words * 100 dim * 4 bytes) 15GB: (50M words * 75 dims * 4 bytes) and/or (37.5M words * 100 dim * 4 bytes) If some of these reliably show erroneous updates, while others definitively don't, we'll have extra hints of the source of the problem, and possibly a much-smaller-in-RAM-demands minimum-reproducible-example. |
Here are the result @gojomo requested: Dims: 16, Vocab Size: 50000000, OPENBLAS Dims: 24, Vocab Size: 50000000, OPENBLAS Dims: 75, Vocab Size: 50000000, OPENBLAS Dims: 84, Vocab Size: 50000000, OPENBLAS Add on; Dims: 90, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS |
This will be the last test on my side. Dims: 84, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Dims: 85, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Dims: 86, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Dims: 87, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Dims: 88, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Dims: 89, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS Please note that your calculation on Memory usage may be divided by 1000 (because, for the 84 dimensions, the memory usage is certainly below 16GB in my calculation). Here is a summary of a new test.
As more memory was used over 16GB, the errors increased proportionally. |
Thank you; this is great focusing-in on the threshold of the problem. I'd been mistakenly using decimal GB when binary GiB (pedantically, 'gibibytes') are the relevant numbers for this sort of addressing. That the errors begin as soon as the array is larger than 2^32 total floats strongly hints some erroneously-narrow pointer calcs are happening, and/or some high-bytes in a properly 64-bit pointer are being corrupted. I'll comb over the relevant code soon. |
In trying to understand the proposed fix in #3136, it occurs to me that the triggering code here may not be showing quite what we thought. Notably, because the vocabulary is by default frequency-sorted, a low number (like There may still be a bug in the pointer types & arithmetic, but it's hard for me to see that kind of bug causing updates at later array positions. (An improperly truncated result type would typically cause extra updates at smaller index positions.) So at least some of the symptom here may be mis-interpreted. Is it truly the case that the fix in #3136 eliminates the position >551000 changes shown by this test code? |
The numbers (like 551000) is worked as string vocab, not about the position (index). for vocab in range(vocab_max_number):
now_paperid = str(vocab)
if(now_paperid in tunned_vector.key_to_index and now_paperid in base_vector.key_to_index):
diff = np.sum(base_vector[now_paperid] - tunned_vector[now_paperid])
temp_diffs_list.append(diff)
else:
temp_diffs_list.append(0) The numbered words <551000 certainly can be anywhere regarding position, of course. I generated random sequence to be ease with the numbers.
No. it eliminates the moving of words >"551000", which indeed can be anywhere. And note that it also fixed the issue on the real data with ~65000000 vocab size in 100d. |
Problem description
I have a task to update small numbers of words from the trained word2vec model.
Only a small number of target words should be updated. Thus, I trained the base model, then try to suppress update non-target vectors for the additional training with
trainables.vectors_lockf=0
.However, the behavior is wired and does not work as I expected.
Even if I set the
vectors_lockf
as 0.0 for the non-target words, the vectors of a small number of non-target words were still updated.Steps/code/corpus to reproduce
I've found the problem during the train of real data at first. The data is consist of ~60,000,000 unique words.
I want to exclude the possibility that the error is from the dataset, thus I reproduced the problem with randomly generated sequences.
For the test, I made the vocab of integers ranging from 0 to max_val, which I can adjust.
Base training sentence is drawn from vocab above, with a length of 30.
First, I trained the word2vec model with the randomly generated sentences.
Then, I update the model with the sequences, with a length of 10 generated from integers from 0 to 551000. I suppressed the update by setting
vectors_lockf=0
for all words, except odd numbers between 550000 and 551000.I may conclude that vectors_lockf works abnormally for the large vocab size.
I use AMD Threadripper 3990X and 256GB ddr4 rams (without ECC).
I haven't been tested with other systems.
I faced a similar situation for the Gensim 4 with the real data, yet I have not tested Gensim 4 with the randomly generated sequences
Thanks!
Versions
Please provide the output of:
Linux-5.4.0-66-generic-x86_64-with-glibc2.10
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: