Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FTRL algo to work properly on views #1505

Merged
merged 9 commits into from
Dec 26, 2018
Merged

Make FTRL algo to work properly on views #1505

merged 9 commits into from
Dec 26, 2018

Conversation

oleksiyskononenko
Copy link
Contributor

@oleksiyskononenko oleksiyskononenko commented Dec 24, 2018

  • take into account rowindices in column hashers, so that FTRL algo trains and predicts on correct data even if rowindex is not empty;
  • add a relevant test.

Closes #1502

@oleksiyskononenko oleksiyskononenko added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Dec 24, 2018
@oleksiyskononenko oleksiyskononenko self-assigned this Dec 24, 2018
c/extras/dt_ftrl.cc Outdated Show resolved Hide resolved

ft.fit(df_train[rows,:], df_target[rows,:])
predictions = ft.predict(df_train[rows,:])
model = dt.Frame(ft.model.to_dict())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@st-pasha, this is something I don't understand. If I just do model = dt.Frame(ft.model), then model is still a reference to ft.model. Why is it so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dt.Frame(DT) creates a shallow copy of DT. This is equivalent to DT.copy(). What makes you think it doesn't work?

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I don't say it doesn't work. I just didn't expect dt.Frame(DT) to return a shallow copy. What is the best way to make a deep copy then? I managed to do it as dt.Frame(ft.model.to_dict()), but it doesn't seem to be an elegant solution. Is there any datatable way to do it, or I better just do copy.deepcopy(DT)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe copy.deepcopy() just pickles and then unpickles the object. The more important question is why do you need a deep copy? The datatable should work correctly even with shallow copies.

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need a copy of the model here, otherwise if I simply do

model = dt.Frame(ft.model)
ft.reset()

model also resets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this a little, and I think there could be a problem with pointers z and n. Even though they are acquired as column->data_w() pointers, they are kept for an arbitrary long period of time. During that time the pointers may become invalid (in particular, they may become invalid for writing if you create a copy of the Frame). It might be better to re-acquire these pointers at the beginning of each python call that needs them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've made changes to re-acquire them at each Python call. Didn't solve the problem with this test though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Here's what I'm getting:

>>> import random
>>> import datatable as dt
>>> from datatable.models import Ftrl
>>> ft = Ftrl(d=4)
>>> df_train = dt.Frame([random.random() for _ in range(ft.d)])
>>> df_target = dt.Frame([bool(random.getrandbits(1)) for _ in range(ft.d)])
>>> ft.fit(df_train, df_target)
>>> model = ft.model
>>> model.to_list()
[[0.9553244374605763, 0.0, 0.0, 0.0], [1.0001815001713965, 0.0, 0.0, 0.0]]
>>> ft.reset()
>>> model.to_list()
[[0.9553244374605763, 0.0, 0.0, 0.0], [1.0001815001713965, 0.0, 0.0, 0.0]]
>>> print(ft.model)
None

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recompiled datatable from scratch and it seems like this problem has been fixed. I will update tests. Thanks!

Actually there was also another problem. When I just did df_train_range = dt.Frame(df_train[rows,:]), it kind of inherited rowindex and df_train_range was still a view. That is why I had to do df_train_range = dt.Frame(df_train[rows,:].to_list()). Is it something we expect due to shallow copying? If yes, then what is the best way to create a real frame based on the view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any frame you can write frame.materialize(), and it will be converted from "view" into a "real" data frame, in-place.


ft.fit(df_train[rows,:], df_target[rows,:])
predictions = ft.predict(df_train[rows,:])
model = dt.Frame(ft.model.to_dict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dt.Frame(DT) creates a shallow copy of DT. This is equivalent to DT.copy(). What makes you think it doesn't work?

c/extras/dt_ftrl.cc Outdated Show resolved Hide resolved
c/extras/dt_ftrl.cc Outdated Show resolved Hide resolved
@st-pasha st-pasha merged commit 25cd0c8 into master Dec 26, 2018
@st-pasha st-pasha deleted the ftrl-views branch December 26, 2018 11:22
@st-pasha st-pasha added this to the Release 0.8.0 milestone Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FTRL algo does not work properly on views
2 participants