This project investigates if and how systematic subsampling can be applied to imbalanced learning. All details can be found in this Jupyter notebook - good if you want a condensed, interactive version and like working with Jupyter notebooks. For a better reading experience I would recommend using the more detailed HTML. A quick overview is provided below.
The case for subsampling involves n > > p, so very large values of n. In such cases we may be interested in estimating model coefficients β̂m instead of β̂n where p ≤ m < < n with m freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large n as discussed above. The basic algorithm for estimating β̂m is simple:
- Subsample with replacement from the data with some sampling probability {πi}.
- Estimate least-squares estimator β̂m using the subsample.
Here we look at a few of the different subsampling methods investigated and proposed in Zhu et al, 2015, which differ primarily in their choice of subsampling probabilities {πi}. The baseline results from Zhu et al, 2015, are replicated here and consistent with the authors’ findings: systematic subsampling can greatly improve model performance.