-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull in branch - Rak/resolve issue #41 per class metrics and #47 global metrics #46
Conversation
@Josh-Joseph major step before approval is to review per class metric/loss implementation and determine if other metrics should also be included (eg, f1-score, precision, recall) |
batch metrics written to new line issue: https://stackoverflow.com/questions/41442276/keras-verbose-training-progress-bar-writing-a-new-line-on-each-batch-issue
|
Global metrics (incorrect if computed per batch): precision, recall, f score (& dice loss), IOU (& jaccard loss) We can compute global metrics correctly only if batch size = dataset size. |
@Josh-Joseph (all classes in top row) Note quantitative differences with keras cat_ce vs. segmentation_models cat_ce, looks okay qualitatively. Also note differences b/w 3 accuracy types from keras. See #41 for equation details and discussion. |
hmm interesting! While it's a bit unsettling keras and sm values are different I don't think it's an issue for what we're using it for. And just defaulting then to then to the keras one makes sense to me. |
@Josh-Joseph , for same dataset in train & val, the train loss is different than the val loss. This makes sense since val weights are not common to the training weights. But the difference in proportionality is still an issue it seem. |
@Josh-Joseph, the one hot implementation in |
@Josh-Joseph I cannot find where I can say |
just updating that i ran some verification last night and I'm still not understanding a few things. One would be why callbacks treats val_loss as stateful, showing only the last batch value (contrary to what I thought, i was overlooking the fact the val metric names are not explicitly passed as stateful_metrics). Another thing is why I can't repeat the loss/val_loss discrepancy in stateful status. I passed keras.losses.CategoricalCrossentropy as a metric, which should not inherit MeanMetricWrapper class and is declared stateful, yet the training and val values match those of keras.metric.CategoricalCrossentropy as a metric, which does inherit MeanMetricWrapper . I was expecting keras.losses.CategoricalCrossentropy (inherits Loss and LossFunctionWrapper-which only averages over 1 batch not all batches) to show last batch but something is causing it to be averaged over all batches. I couldn't find any safety mechanism in Keras that would convert losses.CatCE to metrics.CatCE when passed as metric. When keras.losses.CategoricalCrossentropy is passed as the loss, epoch averaging is done in training by BaseLogger (only stateless metric); however, again, the val_loss is treated as stateful oddly. |
@Josh-Joseph Think I found out why I can't achieve stateful loss values from defining metric=keras.losses.CatCe. Turns out that during compile, keras indeed effectively converts metric=keras.losses.fcn to metric=keras.metrics.fcn in order to inherit The exact sequence responsible for the conversion from an arbitrary non-keras.metrics.fcn to a version of the fcn inheriting
|
@Josh-Joseph, i think we can fully focus on the stateful aspect (required by global metrics). My most recent statement was just trying to get stateful capability for the sake of it, by circumventing |
note on global metrics: keras-based precision, recall, fscore, and iou are not computing class avgs. The global metrics only need averaging when they are first computed per class, followed by avg'ing to get to a single score. In our case, we're computing differently, all classes are blended globally (aka all simultaneously accumulated) together. Both cases still have a range from 0 to 1. In the posted results, you'll see the following differences between class-avg'd IoU and f1score and their global counterparts. note on thresholding for metric computation: It's been found that one hot typically is equivalent to those results obtained by threshold=0.5 (standard default), depending on the metric of course. Though, adjusting the threshold away from 0.5 can create differences with one hot based scores. This is dataset dependent. Ultimately, one hot metrics, where relevant, are the truest predictors of performance, since inference is done this way. |
Model run VI:
cmd line for training and val: |
wow what a pull request this turned out to be :) all seems good, feel free to merge/delete the branch! |
corresponding code: Somewhere in the code, |
Also I discovered after looking into the auto stateful_metrics declaration that tf.keras and keras are similar but not identical, including for how this particular code segment is organized, tho tf.keras does something similar. Sometimes tho the two are out of sync, which some github commits will sometimes call out |
yeah the post-metric score averaging approach would be similar to the one hot approach, in which i would make a mean subclass. However, similar to one hot, i would need to write a unique mean wrapper subclass for each metric and also a subclass for the one-hot subclass of each metric, which each will inherit a given metric, since the inherited methods for each metric are different. I spent more than a day working on different methods to make a dynamically defined subclass definition that includes changing inheritance, via factory methods, nested classes, import as repeatedly, and the type function, but there is seemingly no simple answer in python for class (not instance) definition on the fly. With all that said, doing all of this to achieve averaging seems unnecessary, since it can more easily be done in postprocessing given the mix of imported (fixed) and custom metrics. I think consistency is best done here by just noting in the code comment that global metrics are not averaged (ie, scaled by num_classes) and need to be averaged after program completion by dividing by num_classes. I think the type(classname, superclasses, attributes_dict) function could be used to dynamically create subclass definitions that preserved all methods of metric except for the data manip to output mean (this could also be used to rewrite the one hot approach (same results just less code)). Two class levels will be added to each metric. Metric subclass is mean (or one-hot) level that I somehow write to be agnostic wrt inheritance (or use factory fcn or dynamic creation). Subclass inherits __init __ from parent (preservation) and only other subclass method is a tweak to __call __ to perform data manip then return parent call . I basically copy/pasted this format for each metric i wanted to one-hot, so that class name, inheritance, and init attributes only really changed. Then the subsubclass would be created though type and inherit the different combinations of methods needed for each metric. I could run a loop to create what's needed based only on what's defined in a global metrics list...hmmm it would def be less readable tho since subsubclasses would be created at runtime. I think the main obstacles are I need to dynamically name the subclass and the subsubclass (type is the only way i know to do this), but if i use type, then i am not sure how to modify the newly created class's __call __ to be a mix of the data manip treatment + inherited call. Possibly with a fcn like
I guess this thinking trajectory would be monkey patching and with all that said, none of it is actually useful (unless we want to make one hot more concise/less code, which isn't needed). The global metrics only need averaging when they are first computed per class, followed by avg'ing to get to a single score. In our case, we're computing differently, all classes are blended globally (aka all simultaneously accumulated) together. Both cases still have a range from 0 to 1. In the posted results, you'll see the following differences between class-avg'd IoU and f1score and their global counterparts. I guess it's up to us to decide which is more useful. Either way, thankfully the mean subclass in unnecessary then. |
Model run VII:
Precursor discussion: I refined my thinking on when 1H works and when not in tf1. just speculating, but i think it has to do with the keras.Metric __ new __ method (left window: https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/metrics.py#L67) and the timing of setting update_state method. Case 1: 1H worked with custom metrics that originally inherit non-Keras classes (like s_m, and not like keras.Metric or keras.MeanMetricWrapper); an example is class ClassBinaryAccuracySM(MetricSM) wherein MetricSM is imported from s_m and has no connection to Keras classes. The 1H version--class OneHotClassBinaryAccuracySM(ClassBinaryAccuracySM)-- inherits this class and modifies __ call __ with 1H calc then super().__ call __ ; and this is the only case that actually worked prior in tf1 with otherwise keras.metrics imports. Of significance, the custom class def doesn't even include an update_state method (similar to all s_m metrics) because this keras compile code line (https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/engine/training_utils.py#L946) discovers that the custom metric does not inherit keras.Metric and then treats the whole class instance simply as a fcn and converts it to a keras.MeanMetricWrapper instance, thereby embedding the 1H in the call definition BEFORE update_state is constructed via keras.Metric.__ new __. Note that tensorflow 2.1 has removed the ability from Keras to simply create functions as custom metrics, since Keras would wrap it with MeanMetricWrapper; it's now required that tf2 custom metrics are class definitions with all required methods defined (update_state, result, etc.). And we recently learned that all custom metrics should not return anything during update_state. Case 2: 1H does not work on stateless metrics like keras.Accuracy (this I just realized) when only modifying __ call __ in tf1, even though keras.Accuracy also inherits MeanMetricWrapper. I guess this is because update_state is already "wrapped" (see https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/utils/metrics_utils.py#L30) during __ new __ , which is before the 1H-revised __ call __ definition. Not sure how the update_state_wrapper can indeed somehow 'freeze' the update_state method. Note that update_state and result are abstract methods by the keras.Metric definition, so perhaps the wrapper act like placeholders until a sub-class defines them. Case 3: 1H does not work on global metrics like keras.TruePositives which does not inherit keras.MeanMetricWrapper but only inherits keras.Metric. I believe the reason here is consistent with case 2, that somehow update_state is being sealed off and __ call __ (which seems to be the only way to access update_state during fit_gen or eval_gen) cannot pass-through the 1H values to the seemingly frozen update_state . In tf1, to pass 1H values to update_state , I suppose I could have just moved the 1H manipulation to that method then did super().update_state , instead of call. Just seemed call was the best high-level place to put it, but really update_state is the only place that actually uses it. |
CONCLUSION: TF1 NEEDS ONE HOT IMPLEMENTED AT UPDATE_STATE NOT CALL |
Down to polishing and final details. See issue #41 and #47