-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add domain adaptation example and gradient reversal layer #4031
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete review for now. I would appreciate it if someone else could jump in and review this PR.
|
||
|
||
class GradientReversal(Layer): | ||
def __init__(self, l, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a variable with a complete name, not l
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
return input_shape | ||
|
||
def get_config(self): | ||
config = {'lambda': self.l} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entries in config
should match the arguments in __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if K._BACKEND == 'theano': | ||
self.op = K.ReverseGradient(self.l) | ||
elif K._BACKEND == 'tensorflow': | ||
self.op = K.ReverseGradientBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface should be the same for Theano and TensorFlow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
It looks like you are including in the PR many changes that are unrelated to the PR. This makes the PR difficult to review. Please rebase from master. |
f7c65ca
to
262f9d0
Compare
@fchollet done |
@fchollet Any further feedback on this PR? |
@@ -983,7 +983,8 @@ def _standardize_user_data(self, x, y, | |||
sample_weights = [standardize_weights(ref, sw, cw, mode) | |||
for (ref, sw, cw, mode) | |||
in zip(y, sample_weights, class_weights, self.sample_weight_modes)] | |||
check_array_lengths(x, y, sample_weights) | |||
if check_batch_dim: | |||
check_array_lengths(x, y, sample_weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this. Can you clarify what you are trying to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the training process for domain adaptation, the input batch contains batch_size/2
samples from the source domain and batch_size/2
samples from the target domain. Per the model (added picture in PR), the full batch is used in the left branch to do "domain classification" whereas the first half of the batch is sliced out feature extraction in the right branch for the source domain.
The purpose of this is to bypass a sanity check in Keras by propagating a kwarg
from train_on_batch
and allow unequal batch lengths for the X and y arguments.
@@ -1215,6 +1216,7 @@ def train_on_batch(self, x, y, | |||
from this class during training. | |||
This can be useful to tell the model to "pay more attention" to | |||
samples from an under-represented class. | |||
check_batch_dim: Whether to check batch dimensions for sanity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description not consistent with actual behavior...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to make this clearer (as above).
|
||
class ReverseGradient(theano.Op): | ||
'''Flips the sign of incoming gradient during training.''' | ||
view_map = {0: [0]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably doesn't do what you think it does. Do not use global class attributes, especially not ones that are pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
self.trainable_weights = [] | ||
|
||
def call(self, x, mask=None): | ||
return self.op(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all you do with ReverseGradient
is call it, why should it be a class? Everything in the backend is a function.
@@ -1924,3 +1925,24 @@ def ctc_decode(y_pred, input_length, greedy=True, beam_width=100, | |||
for st in decoded] | |||
|
|||
return (decoded_dense, log_prob) | |||
|
|||
|
|||
class ReverseGradient(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be a function, taking two arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can clarify this a bit since I wrote the original here. The grad reversal layer overrides the gradient of identity with an expression that has a hyperparam lambda. Since different instances can have different lambdas, we need to register a new gradient op for each instance of the grad reversal layer. This is accomplished by assigning each new gradient op a unique name via num_calls
. The only reason this is a callable class is to avoid num_calls
being a global.
A couple comments on the implementation here:
- It doesn't make sense for the lambda hyperparam to be an arg to init here - it defeats the entire purpose of implementing this as a class.
- The only reason this is a class is to organize
num_calls
, so the class should be instantiated once right here to make a op function and the class itself need not be exported from the module, e.g. https://github.com/pumpikano/tf-dann/blob/master/flip_gradient.py#L22. Alternately,num_calls
can be a global or captured in a closure - this is a style decision that is up to @fchollet.
Hope this helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option: generate a unique gradient name with e.g. system time and forget num_calls
altogether.
@@ -1319,6 +1319,7 @@ def switch(condition, then_expression, else_expression): | |||
condition: scalar tensor. | |||
then_expression: TensorFlow operation. | |||
else_expression: TensorFlow operation. | |||
lazy: Unused (compatibility with Theano backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this argument really indispensable then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean, but the purpose here was to ensure that we don't get a TypeError
upon switching implementations and using the lazy
kwarg to switch
.
Hey @pumpikano @fchollet, thanks for the feedback. I've tried to simplify this PR to fit more into the ethos of 'everything in the backend is a function'. Let me know what you think. |
Does this have the potential to get merged soon? :-) Would be super helpful feature, I think! |
# When building DANN model, route first half of batch (source examples) | ||
# to domain classifier, and route full batch (half source, half target) | ||
# to the domain classifier. | ||
net = Lambda(lambda x: K.switch(K.learning_phase(), x[:int(batch_size / 2), :], x, lazy=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we needs those switch operation(and related core modification) and lazy to do reversal gradient training? How about just make two model? Source model which include both classifier_output and domain_ouput. And target model which only include domain_output. After that, train source model with only source batch data and target model with only target batch data for each epoch. I means train_on_batch two times for source and target model.
this will reduce keras core modifcation which starts from this new 'switch' opertion, and also make the trained model doesn't depend on size of trained batch size when it evaluate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keras allows to supply train methods with sample weights. So, I think it is possible just to pass a dictionary of sample weights. For classifier_output it should be the vector
classifier_output_w = np.ones(batch_size)
classifier_output_w[batch_size//2:] = 0
and for domain_ouput sample weights are just
domain_ouput_w = np.ones(batch_size)
Actually, I am trying to train the net (not for MNIST, though) with this approach and it seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rykov8 How can i control each ouptput weight for each sample?
Do you mean training like below?
metrics = dann_model.train_on_batch(
sample_weights=classifier_output_w
)
metrics = dann_model.train_on_batch(
sample_weights=domain_ouput_w
)
But in this way, the model gets twice many src data than tgt data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calanchue No, actually, I mean, that we can avoid this Lambda
layer (in this example script): build a model exactly like it is constructed here, but without this layer. And then the idea is the following:
we generate a batch, that has 1/2 of labelled data and 1/2 of auxiliary data (used for adaptation). Moreover, we generate classification labels, that have 1/2 true labels and 1/2 some fictive labels (they don't matter actually) and also domain labels to denote whether a sample is from labelled data or from auxiliary data. Last but not least are the sample weights, described above. Than we just pass this data to any train method of out model. Some pseudo-code (not a working sample, I didn't test it, just to explain the idea):
def generator(X, labels, aux_data):
# not a real generator, outputs the same batch, just to show the idea
while True:
train_data = X[:batch_size]
train_data[batch_size//2:] = aux_data[:batch_size//2]
y = np.zeros(batch_size)
y[:batch_size//2] = labels[:batch_size//2]
domain_y = np.zeros(batch_size)
domain_y[:batch_size//2] = 1
classifier_output_w = np.ones(batch_size)
classifier_output_w[batch_size//2:] = 0
domain_ouput_w = np.ones(batch_size)
feed_dict = ({'main_input': train_data},
{'classifier_output': y,
'domain_output': domain_y},
{'classifier_output': classifier_output_w,
'domain_output': domain_ouput_w})
yield feed_dict
...
feed_dict = generator.next()
model.train_on_batch(feed_dict)
I use model.fit_generator
method in my experiments, but I think it is not very important. Here we can see, that fictive labels don't influence the main classifier because of sample weights and, moreover, we train main classifier and domain classifier in one forward-backward pass without any changes in Keras
background or other stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ I have tested it on train_on_batch and it works. Thank you!:+1:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're welcome! Hope it will help to merge this PR with the main branch, I believe, that gradient reversal layer and domain adaptation method, that use it, could be quite useful for the community.
Got some error while running dann.py. used theano backend.
maybe hp_lambda should be parameter of init(), not call() I have tested your older version 262f9d0 and It doesn't have the problem. |
i made modified version of dann. got more accurate result. if it is good enough, we can reduce core modification on keras |
There were some mistake. ignore 'Two fit' and please see 'Two fit dann'. |
l = 2. / (1. + np.exp(-10. * p)) - 1 | ||
lr = 0.01 / (1. + 10 * p)**0.75 | ||
hp_lambda = l | ||
builder.opt.lr = lr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but it seems, that here you just override hp_lambda
and builder.opt.lr
to become float numbers, but don't change the real gradient multiplier and learning rate in the graph. I believe, that you need to use K.set_value()
as it is done in LearningRateScheduler callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this! Yes, this changes are not affected learning rate and hp_lambda. I tried with Keras 2 (Tensorflow). In order to make changes to parameters hp_lambda and learning rate, I need to use K.set_value.
Is there any update on this PR? It seems like the issues are more about the example than layer/backend implementations. Probably we could first have layer and backend and then discuss how to deal with the examples. |
Closing outdated PR. If you still care about the content of the PR, please submit a new PR to |
Add an example implementing the 'Domain-Adversarial Training of Neural Networks' paper (https://arxiv.org/abs/1505.07818)
This allows domain adaptation in an unsupervised manner by forcing the net to learn features that are domain invariant between training and target domains using concept of gradient reversal.
Further
Credits:
a sketch of implementation (in TF) and utility functions.
for Theano implementation (op) for gradient reversal.