Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word-vectors outside train()-update tokens being changed; also, vectors_lockf 0.0-values don't suppress changes #3100

Closed
bluekura opened this issue Apr 1, 2021 · 15 comments · Fixed by #3136

Comments

@bluekura
Copy link
Contributor

bluekura commented Apr 1, 2021

Problem description

I have a task to update small numbers of words from the trained word2vec model.
Only a small number of target words should be updated. Thus, I trained the base model, then try to suppress update non-target vectors for the additional training with trainables.vectors_lockf=0.
However, the behavior is wired and does not work as I expected.
Even if I set the vectors_lockf as 0.0 for the non-target words, the vectors of a small number of non-target words were still updated.

Steps/code/corpus to reproduce

I've found the problem during the train of real data at first. The data is consist of ~60,000,000 unique words.

I want to exclude the possibility that the error is from the dataset, thus I reproduced the problem with randomly generated sequences.

For the test, I made the vocab of integers ranging from 0 to max_val, which I can adjust.
Base training sentence is drawn from vocab above, with a length of 30.

First, I trained the word2vec model with the randomly generated sentences.
Then, I update the model with the sequences, with a length of 10 generated from integers from 0 to 551000. I suppressed the update by setting vectors_lockf=0 for all words, except odd numbers between 550000 and 551000.

  • Thus, expected differences in word vectors between the original model and additionally tuned model are as follows:
[0:550000) : no vectors should be changed, because vectors_lockf = 0
[550000:551000), odd numbers:  all vectors are possibly changed because vectors_lockf = 1 
[550000:551000), even numbers: no vectors should be changed because vectors_lockf = 0
[551000:max_val): no vectors should be changed because these words are not in additional train sequences
  • With vocab size of 100,000,000, vectors_lockf = 0 does not work normally.
[0:550000) : 4 vectors changed
[550000:551000), odd numbers:  212 vectors changed (as expected)
[550000:551000), even numbers: no vectors changed (as expected)
[551000:max_val): 572 vectors changed
  • With vocab size of 50,000,000, vectors_lockf = 0 does not work normally.
[0:550000) : no vectors changed (as expected)
[550000:551000), odd numbers:  439 vectors changed (as expected)
[550000:551000), even numbers: no vectors changed (as expected)
[551000:max_val): 118 vectors changed
  • With vocab size of 10,000,000, vectors_lockf = 0, it works well.
[0:550000) : no vectors changed (as expected)
[550000:551000), odd numbers:  500 vectors changed (as expected)
[550000:551000), even numbers: no vectors changed (as expected)
[551000:max_val): no vectors changed (as expected)
  • With vocab size of 5,000,000, vectors_lockf = 0, it works well.
[0:550000) : no vectors changed (as expected)
[550000:551000), odd numbers:  500 vectors changed (as expected)
[550000:551000), even numbers: no vectors changed (as expected)
[551000:max_val): no vectors changed (as expected)

I may conclude that vectors_lockf works abnormally for the large vocab size.

I use AMD Threadripper 3990X and 256GB ddr4 rams (without ECC).
I haven't been tested with other systems.
I faced a similar situation for the Gensim 4 with the real data, yet I have not tested Gensim 4 with the randomly generated sequences

Thanks!

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Linux-5.4.0-66-generic-x86_64-with-glibc2.10
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.5.2
gensim 3.8.3
FAST_VERSION 1

@gojomo
Copy link
Collaborator

gojomo commented Apr 1, 2021

Note that the lockf functionality is an experimental & advanced function, that's essentially "use at your own risk & source-code level understanding". Further, in gensim-4.0.0, it's been further stripped-down, to avoid memory-overhead costs in the overwhelmingly common case where it's not being used. It might not survive in future versions at all, unless there's some published work reporting it offers clear benefits unattainable in other ways.

I'd still like to know & fix if it's not suitable for experimentation, but digging in would require:

  • reproducing the problem in gensim-4.0.0, where the .trainables has been eliminated, and the _lockf arrays are no longer pre-allocated for the user - you'd have to expand the tiny array that's put there by default for your own uses before tampering with per-word values.
  • a full recipe showing the code that reproducues & verifies the error. (Any further info about what happens in gensim-3.8.3 is irrelevant, given the changes around this feature, unless it ssomehow showed the problem to be worse in gensim-4.0.0, and thus a possible new-bug.)

Is there a particular reason you've picked the range 550000:551000 for special treatment? (Off the top of my head, that doesn't seem a threshold meaningful to potential classes of bugs, like pointer overflows.)

That any word-vectors outside the used-tokens of the the training-update change, regardless of any _lockf setting, is surprising, as in your 100,000,000 vocab example. So it'd be interesting, in gensim-4.0.0 reproduction attempts, to not mess with lockf settings/expectations at all. Just check: does training using only tokens 550000:551000 result in any changes outside that range? If so, there's a problem unrelated to lockf (as well as possible problems with lockf).

Also, the potential size of this model means you're touching perhaps hundreds of GB of RAM (and maybe, but hopefully not, any virtual-memory IO). It'd be helpful to do one or both of: (1) a memory-test on the current machine; (2) see if this reproduces similarly on any other system.

@bluekura
Copy link
Contributor Author

bluekura commented Apr 2, 2021

Thanks for the fast response.

In summary, there are two distinct problems, one is about lockf function, and the other is about the update of not used words.

reproducing the problem in gensim-4.0.0, where the .trainables has been eliminated, and the _lockf arrays are no longer pre-allocated for the user - you'd have to expand the tiny array that's put there by default for your own uses before tampering with per-word values.

I re-tested with gensim 4.0.1 (most recent version) without setting vectors_lockf.
I generate random numbers between [0, 551000) for the update, and thus, the words larger than 551000 should not be updated, and the maximum number of the updated vector should be less than 551000. However, in this trial, 858,825 vectors were updated.
Specifically, the vectors updated as follows:

[0:551000): 236,350 vectors changed (as expected)
[55100:100,000,000): 622,475 vectors changed (~ 0.626 % of words updated)

To check the bias on the updated words, I additionally tested the distribution of updated words:

updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])

Counter({0: 59072,
         10000000: 62699,
         20000000: 62937,
         30000000: 62400,
         40000000: 61929,
         50000000: 62698,
         60000000: 62667,
         70000000: 62672,
         80000000: 62810,
         90000000: 62591})

The updated words distributed almost uniformly

Is there a particular reason you've picked the range 550000:551000 for special treatment? (Off the top of my head, that doesn't seem a threshold meaningful to potential classes of bugs, like pointer overflows.)

Nope. I just set the arbitrary numbers. I believe changing the number should not impact any of the results above.

Also, the potential size of this model means you're touching perhaps hundreds of GB of RAM (and maybe, but hopefully not, any virtual-memory IO). It'd be helpful to do one or both of: (1) a memory-test on the current machine; (2) see if this reproduces similarly on any other system.

(1) After rebuilding this system the last December, I've tested the entire memory with memtest86, and there was no problem for multiple iterations
(2) For storing and update the weight, it requires more than 200GB of memory for the 100,000,000 vocabs. Even though I have more machines that met the spec, but it is fully loaded for now.

a full recipe showing the code that reproducues & verifies the error. (Any further info about what happens in gensim-3.8.3 is irrelevant, given the changes around this feature, unless it ssomehow showed the problem to be worse in gensim-4.0.0, and thus a possible new-bug.)

I use the following iterator-generator to generate random sequences (not the entire code, but it will be okay to reproduce).

class RandomnumberIterator():
    import numpy as np
    def __init__(self, generator_function, max_number, random_seed, max_iteration, max_walking_dist):
        self.random_seed = random_seed
        self.max_iteration = max_iteration
        self.generator_function = generator_function
        self.max_walking_dist = max_walking_dist
        self.max_number = max_number
        self.generator = self.generator_function(random_seed, max_number, max_iteration, max_walking_dist)

    def __iter__(self):
        self.generator = self.generator_function(self.random_seed, self.max_number, 
                                                 self.max_iteration, self.max_walking_dist)
        return self

    def __next__(self):
        result = next(self.generator)
        if result is None:
            raise StopIteration
        else:
            return result
    
    def __len__(self): 
        return self.max_iteration

def RandomWalkGenerator(random_seed, max_number, max_iteration, max_walking_dist):
    import numpy as np
    random.seed(random_seed)
    np.random.seed(random_seed)
    for it in range(max_iteration):
        yield list(np.random.randint(0, high=max_number, size=max_walking_dist).astype(str))

# Initial trainning
vector_dim = 100
windowsize = 5
vocab_max_number = 100000000
rand_seed = 0
num_sentences = 50000000
sentence_length = 30

now_it = RandomnumberIterator(RandomWalkGenerator, vocab_max_number, rand_seed,  num_sentences, sentence_length)
base_model = Word2Vec(now_it, vector_size=vector_dim, window=5, min_count=1, workers=8)

# Update step
num_additional_sentences = 1000000 # Additional tranning sentences
additional_sentence_length = 10 # Sentence length for the update
additinal_sentences_vocabs = 551000
rand_seed = 0

tunned_model = deepcopy(base_model)
now_it = RandomnumberIterator(RandomWalkGenerator, additinal_sentences_vocabs, rand_seed, num_additional_sentences, sentence_length)
tunned_model.train(now_it, total_examples = num_additional_sentences, epochs=5)

It is also suspicious.

@gojomo
Copy link
Collaborator

gojomo commented Apr 2, 2021

Thanks for MRE code - but can you show your cod that concludes out-of-range vectors change, too?

What's the smallest vocab_max_number you can find that shows this (out-of-range) problem? (Spinning up a 200GB RAM machine for verification is harder than if it can be seen anywhere smaller.)

That a part of your config is Intel-MKL on an officially-unsupported processor is a concern. Is that necessary to reproduce the problem? (Won't other officially-supported libraries only lose a little bit of performance?)

@bluekura
Copy link
Contributor Author

bluekura commented Apr 2, 2021

Thanks for MRE code - but can you show your cod that concludes out-of-range vectors change, too?

Here it is:

Here, base_vector is the original KeyedVector and tunned_vector is the updated one.

for vocab in range(vocab_max_number):
    now_paperid = str(vocab)
    if(now_paperid in tunned_vector.key_to_index and now_paperid in base_vector.key_to_index):
        diff = np.sum(base_vector[now_paperid] - tunned_vector[now_paperid])
        temp_diffs_list.append(diff)
    else:
        temp_diffs_list.append(0) # for the words never used for both initial train and update. 

What's the smallest vocab_max_number you can find that shows this (out-of-range) problem? (Spinning up a 200GB RAM machine for verification is harder than if it can be seen anywhere smaller.)

Maybe somewhere between 50,000,000 and 10,000,000. I think it requires 100GB of RAM. There was no problem at vocab size of 10,000,000.

That a part of your config is Intel-MKL on an officially-unsupported processor is a concern. Is that necessary to reproduce the problem? (Won't other officially-supported libraries only lose a little bit of performance?)

I am not sure, yet I will try on the other system.

@bluekura
Copy link
Contributor Author

bluekura commented Apr 3, 2021

That a part of your config is Intel-MKL on an officially-unsupported processor is a concern. Is that necessary to reproduce the problem? (Won't other officially-supported libraries only lose a little bit of performance?)

I am not sure, yet I will try on the other system.

Never mind. I was confused. In this test, I used basic OPENBLAS and LAPACK, instead of MKL. ATLAS.

blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']

@bluekura
Copy link
Contributor Author

bluekura commented Apr 3, 2021

As you requested, I tested with the other machine with following spec.

Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (10 core) * 2
DDR4 512GB ECC-REG

Detailed dependent package versions are as follows:

Linux-5.4.0-66-generic-x86_64-with-glibc2.31
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 0

I changed random seed, thus the number is not so exactly matched with old test (with Threadripper 3990X)

[0:551000): 237, 083vectors changed (as expected)
[55100:100,000,000): 620760 vectors changed (~ 0.626 % of words updated)

The updated words are also distributed uniformly

updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])

Counter({0: 59156,
         10000000: 62747,
         20000000: 62260,
         30000000: 62610,
         40000000: 62138,
         50000000: 62313,
         60000000: 62403,
         70000000: 62477,
         80000000: 62456,
         90000000: 62200})

Regarding the gensim's internal index, the abnormally updated vector is distributed as follows

updated_words = [base_vector.key_to_index[str(idx)] for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
sorted(Counter([int(x/1000000)*1000000 for x in updated_words]).items())

[(0, 21719),
 (1000000, 21691),
 (2000000, 21572),
 (3000000, 21438),
 (4000000, 22119),
 (5000000, 21374),
 (6000000, 21599),
 (7000000, 21405),
 (8000000, 21797),
 (9000000, 21686),
 (10000000, 22173),
 (11000000, 21424),
 (12000000, 22106),
 (13000000, 21628),
 (14000000, 12259),
 (15000000, 10942),
 (16000000, 10906),
 (17000000, 10800),
 (18000000, 11011),
 (19000000, 10853),
 (20000000, 11012),
 (21000000, 10880),
 (22000000, 10846),
 (23000000, 11265),
 (24000000, 10643),
 (25000000, 10781),
 (26000000, 11137),
 (27000000, 11080),
 (28000000, 11209),
 (29000000, 10926),
 (30000000, 10890),
 (31000000, 11084),
 (32000000, 10898),
 (33000000, 10743),
 (34000000, 10543),
 (35000000, 10953),
 (36000000, 11065),
 (37000000, 10823),
 (38000000, 10797),
 (39000000, 10812),
 (40000000, 10881),
 (41000000, 10715),
 (42000000, 10275)]

In summary, it is persist for the other machine (from the different cpu vendor).

I think it will be my last test. I already tested too much :>

@gojomo gojomo changed the title trainables.vectors_lockf does not work as expected for the large number of words word-vectors outside train()-update tokens being changed; also, vectors_lockf 0.0-values don't suppress changes Apr 3, 2021
@bluekura
Copy link
Contributor Author

bluekura commented Apr 4, 2021

Last shot: AMD 3990X with intel MKL

blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/include']
None

Linux-5.4.0-70-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1
  • With vocab size of 100000000
    [0:551000): 237083 vectors changed
    [551000:max_val): 620760 vectors changed
updated_words = [idx for idx, val in enumerate(temp_diffs_list) if val != 0 and idx > 551000]
Counter([int(x/10000000)*10000000 for x in updated_words])

Counter({0: 59156,
         10000000: 62747,
         20000000: 62260,
         30000000: 62610,
         40000000: 62138,
         50000000: 62313,
         60000000: 62403,
         70000000: 62477,
         80000000: 62456,
         90000000: 62200})

It seems that both MKL and OpenBLAS have similar problem

@gojomo
Copy link
Collaborator

gojomo commented Apr 5, 2021

Thanks for all your work reproducing this under multiple configurations! The main thing that would help now, given the rarity of debugging machines with over 100GB RAM, is to determine the true minimum RAM in which this problem can be shown - even if, for example, it takes a larger random dataset.

It seemed above it was reliably triggered at a 50,000,000-size vocabulary, with 100-dimensional vectors. Can you try 50M/100M with toy-sized 2-dimensional vectors, & see if that might also show the problem (in a lot less time/space)?

@bluekura
Copy link
Contributor Author

bluekura commented Apr 6, 2021

  • Openblas, 2dim, 50,000,000 unique words

    [0:551000): 551000 vectors changed
    [551000:50,000,000): No vectors changed

  • Openblas, 100dim, 50,000,000 unique words
    [0:551000): 473421 vectors changed
    [551000:50,000,000): 152512 vectors changed

Interestingly, there is no problem with 2-dim vectors.

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/anaconda/anaconda3/envs/gensim-4.0.0/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
None

Linux-5.4.0-70-generic-x86_64-with-glibc2.10
Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 0

@gojomo
Copy link
Collaborator

gojomo commented Apr 6, 2021

Thanks! That a (50M * 2 dims * 4 bytes) 400MB set of vectors doesn't show the problem, but a (50M * 100 dim * 4 bytes) 20GB set does is informative.

Given the kinds of power-of-2 thresholds often involved with pointer-type errors, it'd similarly be useful to probe around the 3GB-or-5GB, or 15GB-or-17GB thresholds. Specifically:

(around)3GB: (50M words * 16 dims * 4bytes), and/or (7.5M words * 100 dim * 4 bytes)
(around)5GB: (50M words * 24 dims * 4bytes) and/or (12.5M words * 100 dim * 4 bytes)

15GB: (50M words * 75 dims * 4 bytes) and/or (37.5M words * 100 dim * 4 bytes)
17GB: (50M words * 84 dims * 4 bytes) and/or (42.5M words * 100 dim * 4 bytes)

If some of these reliably show erroneous updates, while others definitively don't, we'll have extra hints of the source of the problem, and possibly a much-smaller-in-RAM-demands minimum-reproducible-example.

@bluekura
Copy link
Contributor Author

bluekura commented Apr 9, 2021

Here are the result @gojomo requested:

Dims: 16, Vocab Size: 50000000, OPENBLAS
[0: 551000): 551000 vectors changed
[551000: vocab_max_number): 0 vectors changed

Dims: 24, Vocab Size: 50000000, OPENBLAS
[0: 551000): 551000 vectors changed
[551000: vocab_max_number): 0 vectors changed

Dims: 75, Vocab Size: 50000000, OPENBLAS
[0: 551000): 551000 vectors changed
[551000: vocab_max_number): 0 vectors changed

Dims: 84, Vocab Size: 50000000, OPENBLAS
[0: 551000): 551000 vectors changed
[551000: vocab_max_number): 0 vectors changed

Add on;

Dims: 90, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 90, [0: 551000): 525802 vectors changed
Dims: 90, [vocab_max_number: 551000): 49528 vectors changed
~ 18GB : (50M words * 90 dims * 4 bytes) and/or (42.5M words * 100 dim * 4 bytes)

@bluekura
Copy link
Contributor Author

bluekura commented Apr 13, 2021

@gojomo

This will be the last test on my side.

Dims: 84, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 84, [0: 551000): 551000 vectors changed
Dims: 84, [vocab_max_number: 551000): 0 vectors changed

Dims: 85, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 85, [0: 551000): 551000 vectors changed
Dims: 85, [vocab_max_number: 551000): 0 vectors changed

Dims: 86, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 86, [0: 551000): 550335 vectors changed
Dims: 86, [vocab_max_number: 551000): 1295 vectors changed

Dims: 87, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 87, [0: 551000): 543961 vectors changed
Dims: 87, [vocab_max_number: 551000): 13861 vectors changed

Dims: 88, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 88, [0: 551000): 537657 vectors changed
Dims: 88, [vocab_max_number: 551000): 26249 vectors changed

Dims: 89, Vocab Size: 50000000, RealVocabSize: 50000000, OPENBLAS
Dims: 89, [0: 551000): 531744 vectors changed
Dims: 89, [vocab_max_number: 551000): 37863 vectors changed

Please note that your calculation on Memory usage may be divided by 1000 (because, for the 84 dimensions, the memory usage is certainly below 16GB in my calculation).

Here is a summary of a new test.

dims Mem Usage Error Memory over 16GB # of Vectors outside 16GB
84 15.64621925 0 0  
85 15.83248377 0 0  
86 16.01874828 1295 0.018748283 58519.81395
87 16.2050128 13861 0.205012798 632559.8161
88 16.39127731 26249 0.391277313 1193553.455
89 16.57754183 37863 0.577541828 1741940.494
90 16.76380634 49528 0.763806343 2278141.156
100 18.62645149 152512 2.626451492 7050327.04

image

As more memory was used over 16GB, the errors increased proportionally.

@gojomo
Copy link
Collaborator

gojomo commented Apr 13, 2021

Thank you; this is great focusing-in on the threshold of the problem. I'd been mistakenly using decimal GB when binary GiB (pedantically, 'gibibytes') are the relevant numbers for this sort of addressing.

That the errors begin as soon as the array is larger than 2^32 total floats strongly hints some erroneously-narrow pointer calcs are happening, and/or some high-bytes in a properly 64-bit pointer are being corrupted. I'll comb over the relevant code soon.

@gojomo
Copy link
Collaborator

gojomo commented May 12, 2021

In trying to understand the proposed fix in #3136, it occurs to me that the triggering code here may not be showing quite what we thought. Notably, because the vocabulary is by default frequency-sorted, a low number (like '1') is unlikely to actually be in position 1. In fact, given the flat random distribution of synthetic terms, it may be equally likely to be anywhere. Thus I think that the probe that's attempting to show an error will in fact be correctly changing some vectors outside of the 0:551000 range.

There may still be a bug in the pointer types & arithmetic, but it's hard for me to see that kind of bug causing updates at later array positions. (An improperly truncated result type would typically cause extra updates at smaller index positions.) So at least some of the symptom here may be mis-interpreted. Is it truly the case that the fix in #3136 eliminates the position >551000 changes shown by this test code?

@bluekura
Copy link
Contributor Author

bluekura commented May 12, 2021

The numbers (like 551000) is worked as string vocab, not about the position (index).
As you cans see in the test code (as I shown already) I tested using the number as "word".

for vocab in range(vocab_max_number):
    now_paperid = str(vocab)
    if(now_paperid in tunned_vector.key_to_index and now_paperid in base_vector.key_to_index):
        diff = np.sum(base_vector[now_paperid] - tunned_vector[now_paperid]) 
        temp_diffs_list.append(diff)
    else:
        temp_diffs_list.append(0)

The numbered words <551000 certainly can be anywhere regarding position, of course. I generated random sequence to be ease with the numbers.

Is it truly the case that the fix in #3136 eliminates the position >551000 changes shown by this test code?

No. it eliminates the moving of words >"551000", which indeed can be anywhere.

And note that it also fixed the issue on the real data with ~65000000 vocab size in 100d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants