Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in branch - Rak/resolve issue #41 per class metrics and #47 global metrics #46

Merged
merged 37 commits into from
Jan 27, 2020

Conversation

rak5216
Copy link
Collaborator

@rak5216 rak5216 commented Jan 8, 2020

Down to polishing and final details. See issue #41 and #47

@rak5216 rak5216 requested a review from Josh-Joseph January 8, 2020 04:38
@rak5216 rak5216 self-assigned this Jan 8, 2020
@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 8, 2020

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 8, 2020

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 8, 2020

@Josh-Joseph major step before approval is to review per class metric/loss implementation and determine if other metrics should also be included (eg, f1-score, precision, recall)

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 8, 2020

batch metrics written to new line issue: https://stackoverflow.com/questions/41442276/keras-verbose-training-progress-bar-writing-a-new-line-on-each-batch-issue

import ipykernel in train_segmentation_model.py solved issue. There were too many metrics to fit on one prompt line (with addition of per class metrics), so new line printing was auto triggered by keras, even for verbose=1 in fit_generator

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 8, 2020

Global metrics (incorrect if computed per batch): precision, recall, f score (& dice loss), IOU (& jaccard loss)
Batch-capable metrics (correct for global with running avg of batch): accuracy

We can compute global metrics correctly only if batch size = dataset size.

https://datascience.stackexchange.com/questions/45165/how-to-get-accuracy-f1-precision-and-recall-for-a-keras-model

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 9, 2020

@Josh-Joseph (all classes in top row) Note quantitative differences with keras cat_ce vs. segmentation_models cat_ce, looks okay qualitatively. Also note differences b/w 3 accuracy types from keras. See #41 for equation details and discussion.

image

@Josh-Joseph
Copy link
Collaborator

hmm interesting! While it's a bit unsettling keras and sm values are different I don't think it's an issue for what we're using it for. And just defaulting then to then to the keras one makes sense to me.

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 11, 2020

hmm interesting! While it's a bit unsettling keras and sm values are different I don't think it's an issue for what we're using it for. And just defaulting then to then to the keras one makes sense to me.

@Josh-Joseph , for same dataset in train & val, the train loss is different than the val loss. This makes sense since val weights are not common to the training weights. But the difference in proportionality is still an issue it seem.

image

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 11, 2020

@Josh-Joseph, the one hot implementation in models.py is ready for a look. I'll report some results tables and plots here soon. The implementation is meant to be as simple as possible. I had a big issue initially with instatiating class OneHotMetricWrapper(object) becasue the keras code needed to access object.__name__, which the instance doesn't have default access to (if i could change the keras code in saving.py, then object.__class__.__name__ would work). Since I can't change the code, to give the class instance access to the class name, after much tinkering, I found one solution is to inherit the KerasObject base class from segmentation_models, which is well suited to the implementation of my wrapper class too. Then the key line to make things work with class OneHotMetricWrapper(KerasObject): is super().__init__(name=self.name). This form is how both keras and _seg_models metric & loss classes plug-in easily to the keras framework.

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 21, 2020

So I've been trying to dig through the keras codebase and I still don't have a great answer to if/where/how val_loss is made/not made stateful...but, in the meantime, let me see if I can summarize some of this. If:

  • loss is automatically made stateless and
  • val_loss (and all other metrics in both train and validate) are automatically made stateful
  • => that explains all the strange results we've seen where that train and val ratio don't line up

Despite not being able to confirm this in the code -- if that's true that those are automatically set that way is the conclusion correct?

@Josh-Joseph I cannot find where val would be made stateful. Contrary to what i thought before, in BaseLogger and ProgbarLogger callback settings in training_generator.py, stateful_metrics are defined as .metric_names[1: ], which actually does not include the val_ names. So not sure if stateful/stateless is preserved in the val_ copy of the metric or not yet.

I can say val_loss acts stateful based on github.com/keras-team/keras/issues/ 13641

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 21, 2020

just updating that i ran some verification last night and I'm still not understanding a few things. One would be why callbacks treats val_loss as stateful, showing only the last batch value (contrary to what I thought, i was overlooking the fact the val metric names are not explicitly passed as stateful_metrics). Another thing is why I can't repeat the loss/val_loss discrepancy in stateful status. I passed keras.losses.CategoricalCrossentropy as a metric, which should not inherit MeanMetricWrapper class and is declared stateful, yet the training and val values match those of keras.metric.CategoricalCrossentropy as a metric, which does inherit MeanMetricWrapper . I was expecting keras.losses.CategoricalCrossentropy (inherits Loss and LossFunctionWrapper-which only averages over 1 batch not all batches) to show last batch but something is causing it to be averaged over all batches. I couldn't find any safety mechanism in Keras that would convert losses.CatCE to metrics.CatCE when passed as metric. When keras.losses.CategoricalCrossentropy is passed as the loss, epoch averaging is done in training by BaseLogger (only stateless metric); however, again, the val_loss is treated as stateful oddly.

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 22, 2020

just updating that i ran some verification last night and I'm still not understanding a few things. One would be why callbacks treats val_loss as stateful, showing only the last batch value (contrary to what I thought, i was overlooking the fact the val metric names are not explicitly passed as stateful_metrics). Another thing is why I can't repeat the loss/val_loss discrepancy in stateful status. I passed keras.losses.CategoricalCrossentropy as a metric, which should not inherit MeanMetricWrapper class and is declared stateful, yet the training and val values match those of keras.metric.CategoricalCrossentropy as a metric, which does inherit MeanMetricWrapper . I was expecting keras.losses.CategoricalCrossentropy (inherits Loss and LossFunctionWrapper-which only averages over 1 batch not all batches) to show last batch but something is causing it to be averaged over all batches. I couldn't find any safety mechanism in Keras that would convert losses.CatCE to metrics.CatCE when passed as metric. When keras.losses.CategoricalCrossentropy is passed as the loss, epoch averaging is done in training by BaseLogger (only stateless metric); however, again, the val_loss is treated as stateful oddly.

@Josh-Joseph Think I found out why I can't achieve stateful loss values from defining metric=keras.losses.CatCe. Turns out that during compile, keras indeed effectively converts metric=keras.losses.fcn to metric=keras.metrics.fcn in order to inherit MeanMetricWrapper so that all metric=fcns are by default compatible with the auto BaseLogger, such that it can't be tricked into giving last batch values for functions originally without MeanMetricWrapper. Generally, it seems that any metric=fcn that's not already a part of the keras.metrics.Metric class will be forced to be wrapped in the MeanMetricWrapper, including custom metrics and s_m metrics--assuming they are of some non-keras metric class, which is not really a good thing for global metrics. For reference keras metrics all inherit MeanMetricWrapper(Mean(Reduce(Metric))). Thus, if we want to have a custom metric or s_m metric remain stateful, then we can (1) see if adding a custom BaseLogger call works to overwrite the auto callback or (2) write a custom wrapper that lets the non-Keras metric inherit class Metric from keras which should bypass the MeanMetricWrapper since Metric is the parent class.

The exact sequence responsible for the conversion from an arbitrary non-keras.metrics.fcn to a version of the fcn inheriting MeanMetricWrapper is:

training.py :

  • (L33) compile ->
  • (L211) self._cache_output_metric_attributes(metrics, weighted_metrics) ->
  • (L737) self._per_output_metrics = training_utils.collect_per_output_metric_info(...) ->

training_utils.py :

(L945) // If the metric function is not stateful, we create a stateful version.
            if not isinstance(metric_fn, metrics_module.Metric):
                metric_fn = metrics_module.MeanMetricWrapper(
                    metric_fn, name=metric_name)

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 22, 2020

@Josh-Joseph, i think we can fully focus on the stateful aspect (required by global metrics). My most recent statement was just trying to get stateful capability for the sake of it, by circumventing MeanMetricWrapper auto conversion if the metric isn't an instance of from keras.metrics import Metric. But solving this problem isn't really so helpful because we ultimately don't want last batch values, we want global metrics---which invokes addressing issue #47 in parallel. This may involve returning to keras global metrics that actually allow per class capability, much to my surprise, like s_m, so per class metrics and correct global metrics (issues #41 and #47) are linked. This is because bottomline we need global metrics to not inherit MeanMetricWrapper, however, Metric is a metaclass with metamethods, so we can't outright instantiate sometheing that inherits without overriding the metamethods, which are update_states and results. In fact, the global metrics buried in keras.metrics.py show how this can be done, since for example, we see class Precision(Metric). Inside the Precision class, we see these two methods defined. Ideally, for IOU, precision, recall, and fscore (and jaccard and dice losses by extension), we want a running total of TP, TN, FP, and FN's, and then we can piece together the metrics at the end of each epoch, or val, or test. I'm not quite sure if this is how the global metrics currently operate or not, but we could custom write the two methods if needed.

@rak5216 rak5216 changed the title Pull in branch - Rak/resolve issue #41 class metrics Pull in branch - Rak/resolve issue #41 per class metrics and #47 global metrics Jan 22, 2020
@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 24, 2020

note on global metrics: keras-based precision, recall, fscore, and iou are not computing class avgs. The global metrics only need averaging when they are first computed per class, followed by avg'ing to get to a single score. In our case, we're computing differently, all classes are blended globally (aka all simultaneously accumulated) together. Both cases still have a range from 0 to 1. In the posted results, you'll see the following differences between class-avg'd IoU and f1score and their global counterparts.

note on thresholding for metric computation: It's been found that one hot typically is equivalent to those results obtained by threshold=0.5 (standard default), depending on the metric of course. Though, adjusting the threshold away from 0.5 can create differences with one hot based scores. This is dataset dependent. Ultimately, one hot metrics, where relevant, are the truest predictors of performance, since inference is done this way.

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 24, 2020

Model run VI:

  • • identical datasets for train, val, and test.
  • • 358 batches per epoch (max).
  • • Finalize one hot, manually check metric instantiation memory location and inheritance chain
  • • understand callbacks (do not customize baselogger or progbarlogger),
  • • understand stateful (inherits keras.metrics.Metric)/stateless (inherits keras.metrics.MeanMetricWrapper--default behavior if Metric not a parent class),
  • • ignore val_loss,
  • • rewrite PerClassBinaryAccuracy,
  • • verify global metrics, know they’re not class-averaged but globally accumulated, and test threshold effects (0.5),
  • • rewrite per class metric calls.
  • • verify val vs. test identity if same datasets
  • • added metrics_utils.py to cleanup structure.
results = compiled_model.fit_generator(
        train_generator,
        steps_per_epoch=len(train_generator),
        epochs=epochs,
        validation_data=validation_generator,
        validation_steps=len(validation_generator),
        callbacks=[model_checkpoint_callback, tensorboard_callback, tensorboard_image_callback, csv_logger_callback]
 )

all metrics_analysis

models_segmentation-model-small_2cropClass_3class_VarMinPx_final-train3X_20200124T154724Z_plots_metrics_mosaic

# manually check metric instantiation memory location and inheritance chain - before compile
    print(all_metrics)
    for m in all_metrics:
        if hasattr(m.__class__, '__name__'):
            print(m.name, m.__class__.__name__, m.__class__.__mro__)
    input("pre-compile - press enter")

precompile-metrics

# manually check metric instantiation memory location and inheritance chain - after compile (may be `MeanMetricWrapper` overrides if not instance of `Metric`)
    print(all_metrics)
    for m in all_metrics:
        if hasattr(m.__class__, '__name__'):
            print(m.name, m.__class__.__name__, m.__class__.__mro__)
    input("post-compile - press enter")

postcompile-metrics

cmd line for training and val:

train val cmd line

@Josh-Joseph
Copy link
Collaborator

wow what a pull request this turned out to be :)

all seems good, feel free to merge/delete the branch!

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 27, 2020

The issues #41 and #47 are now resolved. Some other notes will follow before closing this PR

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 27, 2020

val_loss issue: https://github.com/keras-team/keras/issues/13641

corresponding code: https://github.com/keras-team/keras/blob/master/keras/engine/training_generator.py#L420

Somewhere in the code, outs_per_batch is told to store stateful for the first element (val_loss) and stateless for the other elements ('val_' + metrics). This narrows the focus a lot, but I haven't found it yet

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 27, 2020

Also I discovered after looking into the auto stateful_metrics declaration that tf.keras and keras are similar but not identical, including for how this particular code segment is organized, tho tf.keras does something similar. Sometimes tho the two are out of sync, which some github commits will sometimes call out

@rak5216
Copy link
Collaborator Author

rak5216 commented Jan 27, 2020

yeah the post-metric score averaging approach would be similar to the one hot approach, in which i would make a mean subclass. However, similar to one hot, i would need to write a unique mean wrapper subclass for each metric and also a subclass for the one-hot subclass of each metric, which each will inherit a given metric, since the inherited methods for each metric are different. I spent more than a day working on different methods to make a dynamically defined subclass definition that includes changing inheritance, via factory methods, nested classes, import as repeatedly, and the type function, but there is seemingly no simple answer in python for class (not instance) definition on the fly. type seemed close, but i couldn't control So i had to write out the one hot subclass definition for each relevant metric. Otherwise, trying to write just one subclass definition would involve changing inheritance depending on the metric i am trying to one hot, which changes the varying instance methods inherited from the parent class. Seems like classes can't be redefined during the run anyways, so I guess that eliminates worry about mixing up instance methods (@josh)? Thus by writing a separate subclass for each metric that inherits the metric, I can achieve manipulation of the pred and label data either going into or coming out of the parent class call method, and all other parent methods are preserved, so seems makes a seamless integration in the keras machine. This is easier to explain in person or over the phone if needed before the next meeting.

With all that said, doing all of this to achieve averaging seems unnecessary, since it can more easily be done in postprocessing given the mix of imported (fixed) and custom metrics. I think consistency is best done here by just noting in the code comment that global metrics are not averaged (ie, scaled by num_classes) and need to be averaged after program completion by dividing by num_classes.

I think the type(classname, superclasses, attributes_dict) function could be used to dynamically create subclass definitions that preserved all methods of metric except for the data manip to output mean (this could also be used to rewrite the one hot approach (same results just less code)). Two class levels will be added to each metric. Metric subclass is mean (or one-hot) level that I somehow write to be agnostic wrt inheritance (or use factory fcn or dynamic creation). Subclass inherits __init __ from parent (preservation) and only other subclass method is a tweak to __call __ to perform data manip then return parent call . I basically copy/pasted this format for each metric i wanted to one-hot, so that class name, inheritance, and init attributes only really changed. Then the subsubclass would be created though type and inherit the different combinations of methods needed for each metric. I could run a loop to create what's needed based only on what's defined in a global metrics list...hmmm it would def be less readable tho since subsubclasses would be created at runtime. I think the main obstacles are I need to dynamically name the subclass and the subsubclass (type is the only way i know to do this), but if i use type, then i am not sure how to modify the newly created class's __call __ to be a mix of the data manip treatment + inherited call. Possibly with a fcn like

def func(cls, self, true, pred): 
    data manip
    return super(self.__class__, self).__call__()

I guess this thinking trajectory would be monkey patching
https://www.python-course.eu/python3_classes_and_type.php
http://alexgaudio.com/2011/10/07/dynamic-inheritance-python.html
https://stackoverflow.com/questions/38541015/how-to-monkey-patch-a-call-method

and with all that said, none of it is actually useful (unless we want to make one hot more concise/less code, which isn't needed). The global metrics only need averaging when they are first computed per class, followed by avg'ing to get to a single score. In our case, we're computing differently, all classes are blended globally (aka all simultaneously accumulated) together. Both cases still have a range from 0 to 1. In the posted results, you'll see the following differences between class-avg'd IoU and f1score and their global counterparts. I guess it's up to us to decide which is more useful. Either way, thankfully the mean subclass in unnecessary then.

@rak5216 rak5216 merged commit e179190 into master Jan 27, 2020
@rak5216 rak5216 deleted the rak/resolve_issue41_class_metrics branch January 27, 2020 17:28
@rak5216 rak5216 restored the rak/resolve_issue41_class_metrics branch February 4, 2020 22:43
@rak5216
Copy link
Collaborator Author

rak5216 commented Feb 5, 2020

Model run VII:
• Added 1H1H and 1HUS and more metrics to verify 1H is working and try to understand accuracy
• Based on comparisons with tf2 in Pull Request 48, 1H in tf1 did not work in most cases, except for when inheriting non-Keras classes and being converted to keras.MeanMetricWrapper
• rerun test from Model Run VI
o reopened this branch based ontf2 comparison (#48 ) showing that 1H was not working apparently in prior model runs in tf1

pipenv run python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small_2cropClass_3class_VarMinPx_final-train3X --model-id segmentation-model-small_2cropClass_3class_VarMinPx_final-train3X_20200124T154724Z

metrics

Precursor discussion:

I refined my thinking on when 1H works and when not in tf1. just speculating, but i think it has to do with the keras.Metric __ new __ method (left window: https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/metrics.py#L67) and the timing of setting update_state method.

Case 1: 1H worked with custom metrics that originally inherit non-Keras classes (like s_m, and not like keras.Metric or keras.MeanMetricWrapper); an example is class ClassBinaryAccuracySM(MetricSM) wherein MetricSM is imported from s_m and has no connection to Keras classes. The 1H version--class OneHotClassBinaryAccuracySM(ClassBinaryAccuracySM)-- inherits this class and modifies __ call __ with 1H calc then super().__ call __ ; and this is the only case that actually worked prior in tf1 with otherwise keras.metrics imports. Of significance, the custom class def doesn't even include an update_state method (similar to all s_m metrics) because this keras compile code line (https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/engine/training_utils.py#L946) discovers that the custom metric does not inherit keras.Metric and then treats the whole class instance simply as a fcn and converts it to a keras.MeanMetricWrapper instance, thereby embedding the 1H in the call definition BEFORE update_state is constructed via keras.Metric.__ new __. Note that tensorflow 2.1 has removed the ability from Keras to simply create functions as custom metrics, since Keras would wrap it with MeanMetricWrapper; it's now required that tf2 custom metrics are class definitions with all required methods defined (update_state, result, etc.). And we recently learned that all custom metrics should not return anything during update_state.

Case 2: 1H does not work on stateless metrics like keras.Accuracy (this I just realized) when only modifying __ call __ in tf1, even though keras.Accuracy also inherits MeanMetricWrapper. I guess this is because update_state is already "wrapped" (see https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/utils/metrics_utils.py#L30) during __ new __ , which is before the 1H-revised __ call __ definition. Not sure how the update_state_wrapper can indeed somehow 'freeze' the update_state method. Note that update_state and result are abstract methods by the keras.Metric definition, so perhaps the wrapper act like placeholders until a sub-class defines them.

Case 3: 1H does not work on global metrics like keras.TruePositives which does not inherit keras.MeanMetricWrapper but only inherits keras.Metric. I believe the reason here is consistent with case 2, that somehow update_state is being sealed off and __ call __ (which seems to be the only way to access update_state during fit_gen or eval_gen) cannot pass-through the 1H values to the seemingly frozen update_state .

In tf1, to pass 1H values to update_state , I suppose I could have just moved the 1H manipulation to that method then did super().update_state , instead of call. Just seemed call was the best high-level place to put it, but really update_state is the only place that actually uses it.
In tf2, obviously something has changed with custom metrics and update_state (we don't have to return anything from this method), so somehow the leaves update_ops just working in the background.

image

@rak5216
Copy link
Collaborator Author

rak5216 commented Feb 5, 2020

CONCLUSION: TF1 NEEDS ONE HOT IMPLEMENTED AT UPDATE_STATE NOT CALL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants