Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_latest_training_loss returns 0 #3228

Open
ginward opened this issue Sep 8, 2021 · 8 comments
Open

get_latest_training_loss returns 0 #3228

ginward opened this issue Sep 8, 2021 · 8 comments

Comments

@ginward
Copy link

ginward commented Sep 8, 2021

Problem description

It seems that the get_latest_training_loss function in fasttext returns only 0. Both gensim 4.1.0 and 4.0.0 do not work.

from gensim.models.callbacks import CallbackAny2Vec
from pprint import pprint as print
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

class callback(CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        print('Loss after epoch {}: {}'.format(self.epoch, loss))
        self.epoch += 1

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100, callbacks=[callback()])

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words,
    callbacks=model.callbacks, compute_loss=True,
)

print(model)
'Loss after epoch 0: 0.0'
'Loss after epoch 1: 0.0'
'Loss after epoch 2: 0.0'
'Loss after epoch 3: 0.0'
'Loss after epoch 4: 0.0'

If currently FastText does not support get_latest_training_loss, the documentation here needs to be removed:

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.get_latest_training_loss

Versions

I have tried this in three different environments and neither of them works.

First environment:

[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
>>> import sys; print("Python", sys.version)
Python 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48)
[GCC 9.3.0]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.2
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Second environment:

Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
macOS-10.16-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.20.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

Third environment:

Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
macOS-10.16-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.20.3
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.1
>>> import gensim; print("gensim", gensim.__version__)
/Users/jinhuawang/miniconda3/lib/python3.9/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
gensim 4.0.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0
@ginward
Copy link
Author

ginward commented Sep 10, 2021

This is related to #2658, which probably should not be closed. @gojomo
It seems that currently fasttext would not return the correct loss using get_latest_training_loss.

@gojomo
Copy link
Collaborator

gojomo commented Sep 10, 2021

#2658 is closed as a duplicate, because #2617 is a more comprehensive discussion of what broken (or simply never implemented) in the *2Vec models.

The docs are wrong to imply there's any loss-tallying in FastText - it's never been implemented. That could be corrected right away, by overriding the superclass method with another that documents/warns that there's no loss-tracking yet for the FastText model. Actually adding loss-trackiing to FastText (and Doc2Vec) will require a bit more design & work, as hinted in #2617 (& some of the other issues it references).

@ginward
Copy link
Author

ginward commented Sep 10, 2021

#2658 is closed as a duplicate, because #2617 is a more comprehensive discussion of what broken (or simply never implemented) in the *2Vec models.

The docs are wrong to imply there's any loss-tallying in FastText - it's never been implemented. That could be corrected right away, by overriding the superclass method with another that documents/warns that there's no loss-tracking yet for the FastText model. Actually adding loss-trackiing to FastText (and Doc2Vec) will require a bit more design & work, as hinted in #2617 (& some of the other issues it references).

I see. But if loss training has never been implemented, how do we know if the training needs to be early stopped, or if the training needs more epochs?

@gojomo
Copy link
Collaborator

gojomo commented Sep 10, 2021

I see. But if loss training has never been implemented, how do we know if the training needs to be early stopped, or if the training needs more epochs?

You'd have to use other heuristics. AFAIK, neither the original Google word2vec.c (on which Gensim's original implementation of the word2vec algorithm was closely based) nor the Facebook fasttext tool even offer early-stopping as an option: you pick your epochs & live with it until either training finishes or you destructively interrupt the training-in-progress. If you later suspect it was too little or too much, you try another value in a wholly-separate run.

They do each, however, show a running loss that a user can watch for hints.

It's definitely a desirable feature to have - hence the many requests, & partial/buggy implementation inside Gensim's Word2Vec, & the open #2617 expressing a goal of fixing/completing the work! It's just not been done, or urgently-required by someone who was sufficiently skilled & motivated to contribute/fund the necessary work, yet.

(Note, though, running-loss is also somewhat prone to misinterpretation, with some people thinking it's an accurate measure of model quality for other purposes, and that, of a set of candidate models, the one with the lowest loss will work best for outside purposes. That's not inherently the case, as it's just a report on the model's internal training goal. That internal goal is, if all sorts of other things are also done right, at best only an approximation of fitness for the real external purposes where people use word-vectors. For example, a massively-'overfit' model can have an arbitrarily low training loss, while being entirely useless for other tasks.)

@cpuodzius
Copy link

After reading out some replies here and on stackoverflow, I'm aware that loss-tallying is yet to be implemented.
However, the running loss after each epoch - differently than said here- is also always zero to me.

I'm running gemsim==4.0.1 and my example code is:

from gensim.models.callbacks import CallbackAny2Vec

class LossLogger(CallbackAny2Vec):
    '''Callback to print loss after each epoch.'''

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        print('Loss after epoch {}: {}'.format(self.epoch, loss))
        self.epoch += 1

callbacks = [LossLogger()]
        
doc2vec_model = Doc2Vec(
            documents,
            vector_size=128,
            window=0,
            min_count=5,
            dm=0,
            sample=0.0001,
            workers=4,
            epochs=10,
            alpha=0.025,
            seed=42,
            compute_loss = True,
            callbacks = callbacks
        )
Loss after epoch 1: 0.0
Loss after epoch 2: 0.0
Loss after epoch 3: 0.0
Loss after epoch 4: 0.0
Loss after epoch 5: 0.0
Loss after epoch 6: 0.0
Loss after epoch 7: 0.0
Loss after epoch 8: 0.0
Loss after epoch 9: 0.0

Why the model.get_latest_training_loss() return always 0 even if it was initialized with True ?

@gojomo
Copy link
Collaborator

gojomo commented Sep 15, 2021

Gensim *2Vec model loss-tallying is...
...in Word2Vec, buggy/incomplete (but somewhat usable w/ workarounds).
...in Doc2Vec, never yet implemented, hence always 0.
...in FastText, never yet implemented, hence always 0.

But since Doc2Vec & FastText inherit fragments of the Word2Vec implementation (in initialization options & accessor method), it looks like it should work. But still, there's no tallying behind-the-scenes.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 4, 2021

Shouldn't we raise NotImplementedError instead of returning zero? It'd be less surprising for the user.

@gojomo
Copy link
Collaborator

gojomo commented Dec 5, 2021

Shouldn't we raise NotImplementedError instead of returning zero? It'd be less surprising for the user.

That'd be better than the current mysteriously-incomplete behavior! But such hard failures should start as soon as the user takes any step guaranteed to disappoint - such as initializing a model that can't track loss with compute_loss=True. And of course the fasttext doc-comments also shouldn't be describing compute_loss and get_latest_training_lost() as if they were functional while they're not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants