Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTRL algo does not work properly on views #1502

Closed
goldentom42 opened this issue Dec 21, 2018 · 19 comments · Fixed by #1505
Closed

FTRL algo does not work properly on views #1502

goldentom42 opened this issue Dec 21, 2018 · 19 comments · Fixed by #1505
Assignees
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label views Issues that are specific to "view" frames only
Milestone

Comments

@goldentom42
Copy link

Hi,

I'm trying to use datatable FTRL proximal algo on a dataset and it behaves strangely.
LogLoss increases with the number of epochs.

Here is the code I use :

train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
features = [f for f in train_dt.names if f not in ['HasDetections']]
for n in range(10):
    ftrl = Ftrl(nepochs=n+1)
    ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
    print(log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features]))))

The output is

0.6975873940617929
0.7004277294410224
0.7030339011892597
0.705290424565774
0.7072685897773024
0.7091474008277487
0.7108282513596036
0.7123130263929156
0.713890830846544
0.7151695514165213

my own version of FTRL trains correctly with the following output:

time_used:0:00:01.026606	epoch: 0   rows:10001	t_logloss:0.59638
time_used:0:00:01.715622	epoch: 1   rows:10001	t_logloss:0.52452
time_used:0:00:02.436984	epoch: 2   rows:10001	t_logloss:0.48113
time_used:0:00:03.158367	epoch: 3   rows:10001	t_logloss:0.44260
time_used:0:00:03.851369	epoch: 4   rows:10001	t_logloss:0.39633
time_used:0:00:04.553488	epoch: 5   rows:10001	t_logloss:0.38197
time_used:0:00:05.264179	epoch: 6   rows:10001	t_logloss:0.35380
time_used:0:00:05.973398	epoch: 7   rows:10001	t_logloss:0.32839
time_used:0:00:06.688121	epoch: 8   rows:10001	t_logloss:0.32057
time_used:0:00:07.394217	epoch: 9   rows:10001	t_logloss:0.29917
  • Your environment?
    I'm on ubuntu 16.04, clang+llvm-7.0.0-x86_64-linux-gnu-ubuntu-16.04, python 3.6,
    datatable is compiled from source.

let me know if you need more.

I guess I'm missing something but could not find anything in the unit tests.

Thanks for your help.

P.S. : make test results and the dataset I use are attached.
datatable_make_test_results.txt
dt_ftrl_test_set.csv.gz

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Dec 21, 2018

Dear @goldentom42, thanks for your interest and feedback!

We are in an active phase of development now, so any suggestions or bug reports are more than welcome. Unfortunately, I couldn't run your code, because not sure what trn_ in this context is (1) and what implementation of log_loss function you're using (2).

However, I was able to modify the code like this

import datatable as dt
from datatable.models import Ftrl
import numpy as np
from sklearn.metrics import log_loss

train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
features = [f for f in train_dt.names if f not in ['HasDetections']]

for n in range(10):
    ftrl = Ftrl(nepochs=n+1)
    ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
    print(log_loss(np.array(train_dt[:, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[:, features]))))

and run it. Below is the output I get on my local Mac:

0.6641336801418288
0.6534124523789995
0.6457434856528689
0.6400894173743356
0.6351222889652768
0.6308302448134359
0.6271074339230708
0.6237886948743143
0.6205773324214058
0.6177462879344261

So logloss seems to go down (learning rate alpha is 0.005 by default, so you may want to adjust it depending on your data). This is also consistent with the average logloss we output during the learning process.

To move forward in resolving your particular problem, could you please provide me with details on the mentioned points (1) and (2)? Meanwhile, you may also want to try the code above and see what you get. Just one thing to keep in mind: our FTRL implementation is parallelized with OpenMP and Hogwild, i.e. results from run to run, and between different systems may slightly vary.

@goldentom42
Copy link
Author

Hi @oleksiyskononenko,

Thanks for your kind reply and sorry for the missing parts in the code ... :( I should have checked twice.

Your implementation is lightfast but here's what I get (on the command line, previous post was in a notebook):

(tf_gpu) goldentom@Ubuntu-1604-xenial-64-minimal ~/kaggle/msft/src $ python ftrl_dt_test.py
/home/goldentom/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Training epoch: 0       Row: 1000       Prediction: 0.514756    Current loss: 0.723103  Average loss: 0.689654
Training epoch: 0       Row: 2000       Prediction: 0.539826    Current loss: 0.776150  Average loss: 0.793482
Training epoch: 0       Row: 3000       Prediction: 0.426119    Current loss: 0.555333  Average loss: 0.809531
Training epoch: 0       Row: 4000       Prediction: 0.507924    Current loss: 0.709123  Average loss: 0.815606
Training epoch: 0       Row: 5000       Prediction: 0.417661    Current loss: 0.540702  Average loss: 0.818705
Training epoch: 0       Row: 6000       Prediction: 0.460405    Current loss: 0.616936  Average loss: 0.820067
Training epoch: 0       Row: 7000       Prediction: 0.494487    Current loss: 0.682181  Average loss: 0.819666
Training epoch: 0       Row: 8000       Prediction: 0.518727    Current loss: 0.731321  Average loss: 0.820048
Training epoch: 0       Row: 9000       Prediction: 0.550760    Current loss: 0.800199  Average loss: 0.744723
Training epoch: 0       Row: 10000      Prediction: 0.422169    Current loss: 0.548474  Average loss: 0.678556
Row: 1000       Prediction: 0.537012
Row: 2000       Prediction: 0.494364
Row: 3000       Prediction: 0.408376
Row: 4000       Prediction: 0.480858
Row: 5000       Prediction: 0.409675
Row: 6000       Prediction: 0.443320
Row: 7000       Prediction: 0.483151
Row: 8000       Prediction: 0.505205
Row: 9000       Prediction: 0.536927
Row: 10000      Prediction: 0.418759
OMP: Error #15: Initializing libiomp5.so, but found libomp.so already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Aborted
(tf_gpu) goldentom@Ubuntu-1604-xenial-64-minimal ~/kaggle/msft/src $

Do you have any idea what could cause the issue ? As I said I had to build from source and that may be the problem but tests looked ok and datatable seems to work fine.

Thanks for your help,
Olivier

@st-pasha
Copy link
Contributor

Olivier, the OMP error that you are seeing is caused by several OMP libraries being linked during the runtime. Here's what this means:

  • When datatable is imported, it loads dynamic library libomp.so (unless it is already loaded);
  • If another library is loaded afterwards (such as numpy or scikit-learn), it may also want to have OpenMP support, and also try to load its own OpenMP library. Normally this process works smoothly, if that other library is compiled with dynamic loading of OpenMP. However, if the library has compiled OpenMP statically, then problems will occur: OpenMP will detect that there are 2 versions of itself being present, and will abort execution.

The right solution to this problem is to recompile all libraries with dynamic loading of OpenMP. Although I understand how this might not be an easy thing to do. Possible workarounds are:

  • Load the "bad" library first, so that it brings its own OpenMP; and only then import datatable, which will use the OpenMP that is already present;
  • Set environment variable KMP_DUPLICATE_LIB_OK=TRUE, although I don't know what the consequences might be.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Dec 21, 2018

@goldentom42 Oliver, if your original Python code works, and you clarify points (1) and (2), I can also try to reproduce the problem on my side. I suspect the difference may be caused by the fact that I measure logloss for all the rows :, while you use only a given subset trn_.

@goldentom42
Copy link
Author

Thank you so much @st-pasha and @oleksiyskononenko

This code works :

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np

import datatable as dt
from datatable.models import Ftrl


def main():
    
    # Read data
    train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
    features = [f for f in train_dt.names if f not in ['HasDetections']]
    for n in range(10):
        ftrl = Ftrl(nepochs=n+1)
        ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
        print('Current loss : %.6f' % log_loss(np.array(train_dt[:, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[:, features]))))

        
if __name__ == '__main__':
    main()

So I need to make sure I import datatable on the very last line of the imports !

@oleksiyskononenko for the points to clarify :
(1) trn_ was something I left from a previous cross-validation loop. when I decided to open this issue I wanted to make the script simpler... once again my bad :(
(2) The log_loss I'm using is the one in scikit-learn.

I'll be using your Ftrl implementation for the MSFT malware competition on Kaggle. This is a large dataset : 8+ million rows. Arno said it was lightfast and I have to say it is :)

Again thank you for your help.

P.S. : I may advertize datatable.ftrl on kaggle over the week-end, hope you don't mind !

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Dec 22, 2018

Dear Oliver @goldentom42, you're welcome.

I see, so the reason for the logloss to go up in your original script was the subset of rows used for validation, that was not relevant for this particular dataset? At least running the latest edition of your code, it seems that logloss goes down in a similar way to what I posted above. If it is the case, I will close the ticket.

Good luck with competition, hope our implementation will work well for you. Of course, you're more than welcome to advertize it :)

@goldentom42
Copy link
Author

Well logloss did go up with the cross validation in place for both training and validation set.
The trn_ was left when I opened the ticket and decided to drop the cross-validation loop.

I believe the problem I encountered came from not importing datatable in the last position (i'm using pandas, numpy and all sorts of other packages).

But now the import is last, all works well and 7s for almost 9 million rows !!! WOW

That's amazing so I'll surely advertize it ;-)

Again thanks for your kind support and you're welcome to close the ticket !
Olivier

@goldentom42
Copy link
Author

Opps I have just tried the following code :

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np

import datatable as dt
from datatable.models import Ftrl


def main():
    
    # Read data
    train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
    features = [f for f in train_dt.names if f not in ['HasDetections']]
    folds = KFold(5, True, 1)
    for trn_, val_ in folds.split(train_dt[:, 'HasDetections']):
        ftrl = Ftrl(nepochs=1)
        for n in range(3):
            ftrl.fit(train_dt[trn_, features], train_dt[trn_, 'HasDetections'])
            trn_preds = np.array(ftrl.predict(train_dt[trn_, features]))
            val_preds = np.array(ftrl.predict(train_dt[val_, features]))
            trn_log_score = log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features])))
            val_log_score = log_loss(np.array(train_dt[val_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[val_, features])))
            trn_auc_score = roc_auc_score(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features])))
            val_auc_score = roc_auc_score(np.array(train_dt[val_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[val_, features])))
            print('TRN logloss %.6f auc %.6f VAL logloss %.6f auc %.6f '
                  % (trn_log_score, trn_auc_score, val_log_score, val_auc_score))
        break

        
if __name__ == '__main__':
    main()

And training logloss goes up :

TRN logloss 0.696683 auc 0.505145 VAL logloss 0.698137 auc 0.501001
TRN logloss 0.699392 auc 0.504855 VAL logloss 0.701723 auc 0.497402
TRN logloss 0.701745 auc 0.504746 VAL logloss 0.704953 auc 0.495593

I would normally expect training loss to decrease with runs.

So I decided to fit ftrl on the whole dataset and changing these lines :

ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
trn_preds = np.array(ftrl.predict(train_dt[trn_, features]))
val_preds = np.array(ftrl.predict(train_dt[val_, features]))
            
full_log_score = log_loss(np.array(train_dt[:, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[:, features])))
trn_log_score = log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features])))
val_log_score = log_loss(np.array(train_dt[val_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[val_, features])))
print('Full logloss %.6f TRN logloss %.6f VAL logloss %.6f'
                  % (full_log_score, trn_log_score, val_log_score))

What I get is :

Full logloss 0.663574 TRN logloss 0.697486 VAL logloss 0.698919
Full logloss 0.652990 TRN logloss 0.700649 VAL logloss 0.703362
Full logloss 0.645516 TRN logloss 0.703176 VAL logloss 0.706823

So global log_loss decreases but not on slices ...

Sorry to be a pain.
Olivier

@goldentom42
Copy link
Author

To add to the mystery, ftrl.predict(train_dt[0:10, features]) returns
image

but ftrl.predict(train_dt[1:10, features]) returns
image

So looks like predictions are output in reverse order, is that expected ?

@goldentom42
Copy link
Author

FYI, I managed to make the cross-validation loop work with the following code :

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np
import pandas as pd

import datatable as dt
from datatable.models import Ftrl


def main():
    
    # Read data
    train_df = pd.read_csv('dt_ftrl_test_set.csv.gz')
    train_df['HasDetections'] = train_df['HasDetections'].astype(bool)
    features = [f for f in train_df.columns if f not in ['HasDetections']]
    folds = KFold(5, True, 1)
    for trn_, val_ in folds.split(train_df['HasDetections']):
        ftrl = Ftrl(nepochs=1)
        for n in range(3):
            ftrl.fit(
                dt.Frame(train_df[features].iloc[trn_]), # .reset_index(drop=True)), 
                dt.Frame(train_df['HasDetections'].iloc[trn_])
            )
            trn_preds = np.array(ftrl.predict(dt.Frame(train_df[features].iloc[trn_]))) # .reset_index(drop=True))))
            val_preds = np.array(ftrl.predict(dt.Frame(train_df[features].iloc[val_]))) # .reset_index(drop=True))))

            trn_log_score = log_loss(train_df['HasDetections'].iloc[trn_], trn_preds)
            val_log_score = log_loss(train_df['HasDetections'].iloc[val_], val_preds)
            print('TRN logloss %.6f VAL logloss %.6f'
                  % (trn_log_score, val_log_score))

        break

        
if __name__ == '__main__':
    main()

Not sure what's going on but it seems I found a work-around.

Best, Olivier

@oleksiyskononenko
Copy link
Contributor

Hi Oliver @goldentom42, it could be that there is really a bug in the predict method. When you pass it something like df[r1:r2,:], it scores on df[0:r2-r1,:], i.e. always scores on the original frame no matter what is your r1. I will update you as soon as I fix it.

@goldentom42
Copy link
Author

goldentom42 commented Dec 23, 2018

Thanks @oleksiyskononenko , yeah I came to the same conclusion. However the work-around I proposed does work when you create new datatable frames for the predict, which is weird but may make sense to you.

Please let me know if I can be of any help, I really like your FTRL implementation ;-)

Thanks for your help,
Olivier

@oleksiyskononenko
Copy link
Contributor

@goldentom42 thanks! Yep, nice catch, your last version does work because then you don't need to specify r1. May be an overkill from the performance and memory point of view, so we should definitely fix the bug asap. Shouldn't be a big deal as it is not related to the algo itself, but more to the way we treat incoming data.

@oleksiyskononenko oleksiyskononenko added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Dec 23, 2018
@oleksiyskononenko oleksiyskononenko changed the title [bug] FTRL log loss increases with number of epochs FTRL log loss increases with number of epochs Dec 23, 2018
@oleksiyskononenko oleksiyskononenko changed the title FTRL log loss increases with number of epochs FTRL algo does not work properly on views Dec 23, 2018
@goldentom42
Copy link
Author

Cool. let me know if you want me to pull the PR when ready, build and check if it works.

Have a enjoyable Christmas ;-)

@oleksiyskononenko
Copy link
Contributor

Thanks, you too! You should automatically receive a message when this ticket is closed. I will refer to this issue when make a PR with the fix.

@goldentom42
Copy link
Author

Fix pulled, built and tested.
Thanks

@oleksiyskononenko
Copy link
Contributor

@goldentom42 thanks, did it help?

@oleksiyskononenko oleksiyskononenko added the views Issues that are specific to "view" frames only label Dec 26, 2018
@goldentom42
Copy link
Author

@oleksiyskononenko, compared to the work-around I used the results are the same. I did not try using interactions yet, with 82 features that sounds like a good challenge !
Again thanks for your help and swift fix !

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Dec 27, 2018

@goldentom42 Yep, if it now gives the same results as your work-around, it means the bug is fixed. Thanks for your help!

As for the feature interactions, we are now doing a full second order that may need decent computer resources and could take significant amount of time (when there are many features). At some point we will allow for arbitrary feature interactions, that the user could specify based, for instance, on feature importance. More details here #1397

@st-pasha st-pasha added this to the Release 0.8.0 milestone Jan 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label views Issues that are specific to "view" frames only
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants