-
The problem can be decomposed into
- Segmenting the 'before' image, and;
- Classifying each pixel of identified buildings based on the 'after' image.
-
Strategy: get MVP, break problem into sub-components (building classification etc.), see which approach works best.
-
Experiments that can (probably) be run on the reduced-resolution component: broad architecture, best losses, polygonisation experiments.
-
Does polygonisation help? Specifically, at the 'before' layer, we can polygonalise to smooth the predictions and them use each polygon to predict the majority-damage. This should help address some of the pixel drift also. The polygonisation isn't differentiable, so we probably can't do this end-to-end.
- Current theory: best approach is to polygonalise the 'pre' image and then take the majority class of all pixels – solution to satellite drift.
-
What's the best set up for the loss of the building damage? There are four ordinal damage classes (not damages/slightly damaged/major damage/destroyed), as well as an implicit no-building class in the "post" heatmap.
- How do we handle the 'no-building' case? Do we explicitly model it as an option, or just model the damage classes and mask out the loss.
-
How do we handle images with no polygons? These are penalised heavily in the loss.
- Stuff to try: class-context concatenation, or explicit edge categorisation in a separate CNN module (https://paperswithcode.com/paper/gated-scnn-gated-shape-cnns-for-semantic)
- Is a single model for the before/after images better, or two separate specialised models.
- Pros: we can concatenate/combine the filters of before/after (maybe adding deformable conv/attentional mechanisms to account for pixel drift). Intuitively, seeing the 'before' picture helps you evaluate the extent of the damage moreso than just seeing the 'after' photo.
- Cons: harder to tune the combined model.
- What's the best way of combining?
- UNet vs. LinkNet – Linknet looks like it performs well.
- FPN uses 83% memory.
- PSPNet uses 78% memory.
Conclusion: LinkNet has best memory/performance profile, with UNet close behind (faster, more memory).
- Is it better to use models pretrained for building segmentation, or roll my own using a (potentially) nicer/more specialised architecture.
- Pretrained models.
- Most of the data that the models were pretrained on is also publically available – so the only other advantage is that the architectures demonstrably work.
- Not a lot of difference between initialised from scratch and pretrained.
- xdxd pretrained model has stability issues,
selimsef_spacenet4_densenet121unet
trains okay (with removing first encoder layer and head due to n_channels mismatch). - Biggest densenet works best –
selimsef_spacenet4_densenet121unet
andselimsef_spacenet4_resnet...
don't seem to work as well. - EfficientNet outperforms pretrained model.
Conclusion: train from scratch.
- What combo of Dice/Focal/BCE/Jaccard is best?
- Experiments with 4x Dice, 1x Focal inconclusive (https://app.wandb.ai/xvr-hlt/sky-eye/runs/mxknx2wr?workspace=user-xvr-hlt) vs (https://app.wandb.ai/xvr-hlt/sky-eye/runs/vppciq3g?workspace=user-xvr-hlt).
- What level of half-precision should we use?
- Half precision training using AMP in the default mode works best. See: full precision lower batch size, half-precision (default) 82% mem@batch8, half-precision (alternative) 86% mem@batch8, more variance.
- Evaluation is 30% building localisation + 70% classification.
- However, in order to score a pixel correctly for classification, we first need to have localised it correctly.
- This potentially implies we should be more recall oriented for localisation.
- F-scores are harmonic-meaned across classes.