Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domain adaptation example and gradient reversal layer #4031

Closed
wants to merge 8 commits into from

Conversation

zumpchke
Copy link

@zumpchke zumpchke commented Oct 12, 2016

Add an example implementing the 'Domain-Adversarial Training of Neural Networks' paper (https://arxiv.org/abs/1505.07818)

This allows domain adaptation in an unsupervised manner by forcing the net to learn features that are domain invariant between training and target domains using concept of gradient reversal.

Further

  • Nontrivial example of functional API.
  • Shows how to visualize activations of intermediate layer.
  • Example of broken out training loop.

Credits:

figure_1
figure_2
figure_3
figure_4
model

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review for now. I would appreciate it if someone else could jump in and review this PR.



class GradientReversal(Layer):
def __init__(self, l, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a variable with a complete name, not l.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return input_shape

def get_config(self):
config = {'lambda': self.l}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entries in config should match the arguments in __init__.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if K._BACKEND == 'theano':
self.op = K.ReverseGradient(self.l)
elif K._BACKEND == 'tensorflow':
self.op = K.ReverseGradientBuilder()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface should be the same for Theano and TensorFlow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@fchollet
Copy link
Collaborator

It looks like you are including in the PR many changes that are unrelated to the PR. This makes the PR difficult to review. Please rebase from master.

@zumpchke
Copy link
Author

@fchollet done

@zumpchke
Copy link
Author

@fchollet Any further feedback on this PR?

@@ -983,7 +983,8 @@ def _standardize_user_data(self, x, y,
sample_weights = [standardize_weights(ref, sw, cw, mode)
for (ref, sw, cw, mode)
in zip(y, sample_weights, class_weights, self.sample_weight_modes)]
check_array_lengths(x, y, sample_weights)
if check_batch_dim:
check_array_lengths(x, y, sample_weights)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this. Can you clarify what you are trying to do?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the training process for domain adaptation, the input batch contains batch_size/2 samples from the source domain and batch_size/2 samples from the target domain. Per the model (added picture in PR), the full batch is used in the left branch to do "domain classification" whereas the first half of the batch is sliced out feature extraction in the right branch for the source domain.

The purpose of this is to bypass a sanity check in Keras by propagating a kwarg from train_on_batch and allow unequal batch lengths for the X and y arguments.

@@ -1215,6 +1216,7 @@ def train_on_batch(self, x, y,
from this class during training.
This can be useful to tell the model to "pay more attention" to
samples from an under-represented class.
check_batch_dim: Whether to check batch dimensions for sanity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description not consistent with actual behavior...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to make this clearer (as above).


class ReverseGradient(theano.Op):
'''Flips the sign of incoming gradient during training.'''
view_map = {0: [0]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't do what you think it does. Do not use global class attributes, especially not ones that are pointers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self.trainable_weights = []

def call(self, x, mask=None):
return self.op(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all you do with ReverseGradient is call it, why should it be a class? Everything in the backend is a function.

@@ -1924,3 +1925,24 @@ def ctc_decode(y_pred, input_length, greedy=True, beam_width=100,
for st in decoded]

return (decoded_dense, log_prob)


class ReverseGradient(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be a function, taking two arguments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can clarify this a bit since I wrote the original here. The grad reversal layer overrides the gradient of identity with an expression that has a hyperparam lambda. Since different instances can have different lambdas, we need to register a new gradient op for each instance of the grad reversal layer. This is accomplished by assigning each new gradient op a unique name via num_calls. The only reason this is a callable class is to avoid num_calls being a global.

A couple comments on the implementation here:

  1. It doesn't make sense for the lambda hyperparam to be an arg to init here - it defeats the entire purpose of implementing this as a class.
  2. The only reason this is a class is to organize num_calls, so the class should be instantiated once right here to make a op function and the class itself need not be exported from the module, e.g. https://github.com/pumpikano/tf-dann/blob/master/flip_gradient.py#L22. Alternately, num_calls can be a global or captured in a closure - this is a style decision that is up to @fchollet.

Hope this helps.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option: generate a unique gradient name with e.g. system time and forget num_calls altogether.

@@ -1319,6 +1319,7 @@ def switch(condition, then_expression, else_expression):
condition: scalar tensor.
then_expression: TensorFlow operation.
else_expression: TensorFlow operation.
lazy: Unused (compatibility with Theano backend)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this argument really indispensable then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, but the purpose here was to ensure that we don't get a TypeError upon switching implementations and using the lazy kwarg to switch.

@zumpchke
Copy link
Author

Hey @pumpikano @fchollet, thanks for the feedback. I've tried to simplify this PR to fit more into the ethos of 'everything in the backend is a function'. Let me know what you think.

@jmhessel
Copy link
Contributor

Does this have the potential to get merged soon? :-) Would be super helpful feature, I think!

# When building DANN model, route first half of batch (source examples)
# to domain classifier, and route full batch (half source, half target)
# to the domain classifier.
net = Lambda(lambda x: K.switch(K.learning_phase(), x[:int(batch_size / 2), :], x, lazy=True),
Copy link

@calanchue calanchue Nov 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we needs those switch operation(and related core modification) and lazy to do reversal gradient training? How about just make two model? Source model which include both classifier_output and domain_ouput. And target model which only include domain_output. After that, train source model with only source batch data and target model with only target batch data for each epoch. I means train_on_batch two times for source and target model.
this will reduce keras core modifcation which starts from this new 'switch' opertion, and also make the trained model doesn't depend on size of trained batch size when it evaluate

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keras allows to supply train methods with sample weights. So, I think it is possible just to pass a dictionary of sample weights. For classifier_output it should be the vector

classifier_output_w = np.ones(batch_size)
classifier_output_w[batch_size//2:] = 0

and for domain_ouput sample weights are just

domain_ouput_w = np.ones(batch_size)

Actually, I am trying to train the net (not for MNIST, though) with this approach and it seems to work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rykov8 How can i control each ouptput weight for each sample?
Do you mean training like below?

        metrics = dann_model.train_on_batch(
                                            sample_weights=classifier_output_w
                                            )
        metrics = dann_model.train_on_batch(
                                            sample_weights=domain_ouput_w
                                            )

But in this way, the model gets twice many src data than tgt data.

Copy link

@rykov8 rykov8 Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calanchue No, actually, I mean, that we can avoid this Lambda layer (in this example script): build a model exactly like it is constructed here, but without this layer. And then the idea is the following:
we generate a batch, that has 1/2 of labelled data and 1/2 of auxiliary data (used for adaptation). Moreover, we generate classification labels, that have 1/2 true labels and 1/2 some fictive labels (they don't matter actually) and also domain labels to denote whether a sample is from labelled data or from auxiliary data. Last but not least are the sample weights, described above. Than we just pass this data to any train method of out model. Some pseudo-code (not a working sample, I didn't test it, just to explain the idea):

def generator(X, labels, aux_data):
    # not a real generator, outputs the same batch, just to show the idea
    while True:
        train_data = X[:batch_size]
        train_data[batch_size//2:] = aux_data[:batch_size//2]
        y = np.zeros(batch_size)
        y[:batch_size//2] = labels[:batch_size//2]
        domain_y = np.zeros(batch_size)
        domain_y[:batch_size//2] = 1
        classifier_output_w = np.ones(batch_size)
        classifier_output_w[batch_size//2:] = 0
        domain_ouput_w = np.ones(batch_size)
        feed_dict = ({'main_input': train_data},
                     {'classifier_output': y,
                      'domain_output': domain_y},
                     {'classifier_output': classifier_output_w,
                      'domain_output': domain_ouput_w})
        yield feed_dict

...
feed_dict = generator.next()
model.train_on_batch(feed_dict)

I use model.fit_generator method in my experiments, but I think it is not very important. Here we can see, that fictive labels don't influence the main classifier because of sample weights and, moreover, we train main classifier and domain classifier in one forward-backward pass without any changes in Keras background or other stuff.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ I have tested it on train_on_batch and it works. Thank you!:+1:

Copy link

@rykov8 rykov8 Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome! Hope it will help to merge this PR with the main branch, I believe, that gradient reversal layer and domain adaptation method, that use it, could be quite useful for the community.

@calanchue
Copy link

calanchue commented Nov 28, 2016

Got some error while running dann.py. used theano backend.

Traceback (most recent call last):
  File "D:/dev/CODE/dann_keras/keras/examples/dann.py", line 224, in <module>
    dann_model = builder.build_dann_model(main_input, hp_lambda)
  File "D:/dev/CODE/dann_keras/keras/examples/dann.py", line 189, in build_dann_model
    branch = self.grl(net, hp_lambda)
  File "D:\dev\CODE\dann_keras\keras\keras\engine\topology.py", line 514, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "D:\dev\CODE\dann_keras\keras\keras\engine\topology.py", line 572, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "D:\dev\CODE\dann_keras\keras\keras\engine\topology.py", line 149, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
TypeError: call() takes at least 3 arguments (3 given)

maybe hp_lambda should be parameter of init(), not call()

I have tested your older version 262f9d0 and It doesn't have the problem.

@calanchue calanchue mentioned this pull request Nov 29, 2016
@calanchue
Copy link

calanchue commented Nov 29, 2016

i made modified version of dann. got more accurate result. if it is good enough, we can reduce core modification on keras

@calanchue
Copy link

calanchue commented Nov 29, 2016

There were some mistake. ignore 'Two fit' and please see 'Two fit dann'.

l = 2. / (1. + np.exp(-10. * p)) - 1
lr = 0.01 / (1. + 10 * p)**0.75
hp_lambda = l
builder.opt.lr = lr
Copy link

@rykov8 rykov8 Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but it seems, that here you just override hp_lambda and builder.opt.lr to become float numbers, but don't change the real gradient multiplier and learning rate in the graph. I believe, that you need to use K.set_value() as it is done in LearningRateScheduler callback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this! Yes, this changes are not affected learning rate and hp_lambda. I tried with Keras 2 (Tensorflow). In order to make changes to parameters hp_lambda and learning rate, I need to use K.set_value.

@keunwoochoi
Copy link
Contributor

Is there any update on this PR? It seems like the issues are more about the example than layer/backend implementations. Probably we could first have layer and backend and then discuss how to deal with the examples.

@fchollet
Copy link
Collaborator

Closing outdated PR. If you still care about the content of the PR, please submit a new PR to master, updated for the Keras 2.0 API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants