-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FTRL algo does not work properly on views #1502
Comments
Dear @goldentom42, thanks for your interest and feedback! We are in an active phase of development now, so any suggestions or bug reports are more than welcome. Unfortunately, I couldn't run your code, because not sure what However, I was able to modify the code like this
and run it. Below is the output I get on my local Mac:
So logloss seems to go down (learning rate To move forward in resolving your particular problem, could you please provide me with details on the mentioned points (1) and (2)? Meanwhile, you may also want to try the code above and see what you get. Just one thing to keep in mind: our FTRL implementation is parallelized with OpenMP and Hogwild, i.e. results from run to run, and between different systems may slightly vary. |
Thanks for your kind reply and sorry for the missing parts in the code ... :( I should have checked twice. Your implementation is lightfast but here's what I get (on the command line, previous post was in a notebook):
Do you have any idea what could cause the issue ? As I said I had to build from source and that may be the problem but tests looked ok and datatable seems to work fine. Thanks for your help, |
Olivier, the OMP error that you are seeing is caused by several OMP libraries being linked during the runtime. Here's what this means:
The right solution to this problem is to recompile all libraries with dynamic loading of OpenMP. Although I understand how this might not be an easy thing to do. Possible workarounds are:
|
@goldentom42 Oliver, if your original Python code works, and you clarify points (1) and (2), I can also try to reproduce the problem on my side. I suspect the difference may be caused by the fact that I measure logloss for all the rows |
Thank you so much @st-pasha and @oleksiyskononenko This code works :
So I need to make sure I import datatable on the very last line of the imports ! @oleksiyskononenko for the points to clarify : I'll be using your Ftrl implementation for the MSFT malware competition on Kaggle. This is a large dataset : 8+ million rows. Arno said it was lightfast and I have to say it is :) Again thank you for your help. P.S. : I may advertize datatable.ftrl on kaggle over the week-end, hope you don't mind ! |
Dear Oliver @goldentom42, you're welcome. I see, so the reason for the logloss to go up in your original script was the subset of rows used for validation, that was not relevant for this particular dataset? At least running the latest edition of your code, it seems that logloss goes down in a similar way to what I posted above. If it is the case, I will close the ticket. Good luck with competition, hope our implementation will work well for you. Of course, you're more than welcome to advertize it :) |
Well logloss did go up with the cross validation in place for both training and validation set. I believe the problem I encountered came from not importing datatable in the last position (i'm using pandas, numpy and all sorts of other packages). But now the import is last, all works well and 7s for almost 9 million rows !!! WOW That's amazing so I'll surely advertize it ;-) Again thanks for your kind support and you're welcome to close the ticket ! |
Opps I have just tried the following code :
And training logloss goes up :
I would normally expect training loss to decrease with runs. So I decided to fit ftrl on the whole dataset and changing these lines :
What I get is :
So global log_loss decreases but not on slices ... Sorry to be a pain. |
FYI, I managed to make the cross-validation loop work with the following code :
Not sure what's going on but it seems I found a work-around. Best, Olivier |
Hi Oliver @goldentom42, it could be that there is really a bug in the |
Thanks @oleksiyskononenko , yeah I came to the same conclusion. However the work-around I proposed does work when you create new datatable frames for the predict, which is weird but may make sense to you. Please let me know if I can be of any help, I really like your FTRL implementation ;-) Thanks for your help, |
@goldentom42 thanks! Yep, nice catch, your last version does work because then you don't need to specify |
Cool. let me know if you want me to pull the PR when ready, build and check if it works. Have a enjoyable Christmas ;-) |
Thanks, you too! You should automatically receive a message when this ticket is closed. I will refer to this issue when make a PR with the fix. |
Fix pulled, built and tested. |
@goldentom42 thanks, did it help? |
@oleksiyskononenko, compared to the work-around I used the results are the same. I did not try using interactions yet, with 82 features that sounds like a good challenge ! |
@goldentom42 Yep, if it now gives the same results as your work-around, it means the bug is fixed. Thanks for your help! As for the feature interactions, we are now doing a full second order that may need decent computer resources and could take significant amount of time (when there are many features). At some point we will allow for arbitrary feature interactions, that the user could specify based, for instance, on feature importance. More details here #1397 |
Hi,
I'm trying to use datatable FTRL proximal algo on a dataset and it behaves strangely.
LogLoss increases with the number of epochs.
Here is the code I use :
The output is
my own version of FTRL trains correctly with the following output:
I'm on ubuntu 16.04, clang+llvm-7.0.0-x86_64-linux-gnu-ubuntu-16.04, python 3.6,
datatable is compiled from source.
let me know if you need more.
I guess I'm missing something but could not find anything in the unit tests.
Thanks for your help.
P.S. : make test results and the dataset I use are attached.
datatable_make_test_results.txt
dt_ftrl_test_set.csv.gz
The text was updated successfully, but these errors were encountered: