Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make FTRL algo to work properly on views #1505

Merged
merged 9 commits into from
Dec 26, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 19 additions & 24 deletions c/extras/dt_ftrl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -64,33 +64,28 @@ void Ftrl::fit(const DataTable* dt_X, const DataTable* dt_y) {
auto c_y = static_cast<BoolColumn*>(dt_y->columns[0]);
auto d_y = c_y->elements_r();

// Do training for `nepochs`.
for (size_t i = 0; i < params.nepochs; ++i) {
double total_loss = 0;
const RowIndex ri_X = dt_X->rowindex;
st-pasha marked this conversation as resolved.
Show resolved Hide resolved
const RowIndex ri_y = c_y->rowindex();

// Do training for `nepochs`.
for (size_t e = 0; e < params.nepochs; ++e) {
#pragma omp parallel num_threads(config::nthreads)
{
// Array to store hashed column values and their interactions.
uint64ptr x = uint64ptr(new uint64_t[nfeatures]);
size_t ith = static_cast<size_t>(omp_get_thread_num());
size_t nth = static_cast<size_t>(omp_get_num_threads());

for (size_t j = ith; j < dt_X->nrows; j += nth) {
if (ISNA<int8_t>(d_y[j])) continue;
bool y = d_y[j];
hash_row(x, j);
double p = predict_row(x);
update(x, p, y);

double loss = logloss(p, y);
#pragma omp atomic update
total_loss += loss;
if ((j+1) % REPORT_FREQUENCY == 0) {
printf("Training epoch: %zu\tRow: %zu\tPrediction: %f\t"
"Current loss: %f\tAverage loss: %f\n",
i, j+1, p, loss, total_loss / (j+1));
ri_X.iterate(ith, dt_X->nrows, nth,
[&](size_t i, size_t j) {
st-pasha marked this conversation as resolved.
Show resolved Hide resolved
if (!ISNA<int8_t>(d_y[ri_y[i]])) {
bool y = d_y[ri_y[i]];
hash_row(x, j);
double p = predict_row(x);
update(x, p, y);
}
}
}
);
}
}
model_trained = true;
Expand Down Expand Up @@ -121,14 +116,14 @@ dtptr Ftrl::predict(const DataTable* dt_X) {
uint64ptr x = uint64ptr(new uint64_t[nfeatures]);
size_t ith = static_cast<size_t>(omp_get_thread_num());
size_t nth = static_cast<size_t>(omp_get_num_threads());
RowIndex ri = dt_X->rowindex;

for (size_t j = ith; j < dt_X->nrows; j += nth) {
hash_row(x, j);
d_y[j] = predict_row(x);
if ((j+1) % REPORT_FREQUENCY == 0) {
printf("Row: %zu\tPrediction: %f\n", j+1, d_y[j]);
ri.iterate(ith, dt_X->nrows, nth,
[&](size_t i, size_t j) {
hash_row(x, j);
d_y[i] = predict_row(x);
}
}
);
}
return dt_y;
}
Expand Down
20 changes: 20 additions & 0 deletions tests/extras/test_ftrl.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,26 @@ def test_ftrl_fit_predict_from_setters():
assert_equals(target1, target2)


def test_ftrl_fit_predict_view():
ft = Ftrl(d=100)
df_train = dt.Frame([random.random() for _ in range(ft.d)])
df_target = dt.Frame([bool(random.getrandbits(1)) for _ in range(ft.d)])
rows = range(ft.d//2, ft.d)

ft.fit(df_train[rows,:], df_target[rows,:])
predictions = ft.predict(df_train[rows,:])
model = dt.Frame(ft.model.to_dict())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@st-pasha, this is something I don't understand. If I just do model = dt.Frame(ft.model), then model is still a reference to ft.model. Why is it so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dt.Frame(DT) creates a shallow copy of DT. This is equivalent to DT.copy(). What makes you think it doesn't work?

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I don't say it doesn't work. I just didn't expect dt.Frame(DT) to return a shallow copy. What is the best way to make a deep copy then? I managed to do it as dt.Frame(ft.model.to_dict()), but it doesn't seem to be an elegant solution. Is there any datatable way to do it, or I better just do copy.deepcopy(DT)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe copy.deepcopy() just pickles and then unpickles the object. The more important question is why do you need a deep copy? The datatable should work correctly even with shallow copies.

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need a copy of the model here, otherwise if I simply do

model = dt.Frame(ft.model)
ft.reset()

model also resets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into this a little, and I think there could be a problem with pointers z and n. Even though they are acquired as column->data_w() pointers, they are kept for an arbitrary long period of time. During that time the pointers may become invalid (in particular, they may become invalid for writing if you create a copy of the Frame). It might be better to re-acquire these pointers at the beginning of each python call that needs them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've made changes to re-acquire them at each Python call. Didn't solve the problem with this test though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Here's what I'm getting:

>>> import random
>>> import datatable as dt
>>> from datatable.models import Ftrl
>>> ft = Ftrl(d=4)
>>> df_train = dt.Frame([random.random() for _ in range(ft.d)])
>>> df_target = dt.Frame([bool(random.getrandbits(1)) for _ in range(ft.d)])
>>> ft.fit(df_train, df_target)
>>> model = ft.model
>>> model.to_list()
[[0.9553244374605763, 0.0, 0.0, 0.0], [1.0001815001713965, 0.0, 0.0, 0.0]]
>>> ft.reset()
>>> model.to_list()
[[0.9553244374605763, 0.0, 0.0, 0.0], [1.0001815001713965, 0.0, 0.0, 0.0]]
>>> print(ft.model)
None

Copy link
Contributor Author

@oleksiyskononenko oleksiyskononenko Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recompiled datatable from scratch and it seems like this problem has been fixed. I will update tests. Thanks!

Actually there was also another problem. When I just did df_train_range = dt.Frame(df_train[rows,:]), it kind of inherited rowindex and df_train_range was still a view. That is why I had to do df_train_range = dt.Frame(df_train[rows,:].to_list()). Is it something we expect due to shallow copying? If yes, then what is the best way to create a real frame based on the view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any frame you can write frame.materialize(), and it will be converted from "view" into a "real" data frame, in-place.


ft.reset()
df_train_half = dt.Frame(df_train[rows,:].to_list())
df_target_half = dt.Frame(df_target[rows,:].to_list())
ft.fit(df_train_half, df_target_half)
predictions_half = ft.predict(df_train_half)

assert_equals(model, model)
assert_equals(predictions, predictions_half)


#-------------------------------------------------------------------------------
# Test feature importance
#-------------------------------------------------------------------------------
Expand Down