Subnetwork ensembling and data augmentation: Effects on calibration

Deep Learning models based on convolutional neural networks are known to be uncalibrated, that is, they are either overconfident or underconfident in their predictions. Safety‐critical applications of neural networks, however, require models to be well‐calibrated, and there are various methods in the literature to increase model performance and calibration. Subnetwork ensembling is based on the over‐parametrization of modern neural networks by fitting several subnetworks into a single network to take advantage of ensembling them without additional computational costs. Data augmentation methods have also been shown to enhance model performance in terms of accuracy and calibration. However, ensembling and data augmentation seem orthogonal to each other, and the total effect of combining these two methods is not well‐known; the literature in fact is inconsistent. Through an extensive set of empirical experiments, we show that combining subnetwork ensemble methods with data augmentation methods does not degrade model calibration.


| INTRODUCTION
Deep learning models are starting to be used widely in safety-critical tasks such as autonomous driving (Bojarski et al., 2016) and medical applications. However, to be safely deployed in the real world, these models should output "reliable" predictions, meaning that the distribution of predictions needs to match the empirical distribution of the data. Models which are neither overconfident nor underconfident are called well-calibrated and it is an important characteristic to safely deploy deep learning models (Guo et al., 2017). Besides, real-world data often has different distribution than the data models are trained on. This requires models to be both calibrated and resistant to distributional shifts. Both ensembling and data augmentation techniques have been shown to improve calibration, robustness, and model performance (Havasi et al., 2021;Lakshminarayanan et al., 2017;Shorten & Khoshgoftaar, 2019). However, we still do not fully understand the effects (positive or negative) of combining ensembles with data augmentation methods.
Even a simple averaging of the predictions can help reduce individual model misclassifications and other errors (Fort et al., 2019;Lakshminarayanan et al., 2017). There are different methods for ensembling models which have been shown to be effective in improving accuracy and robustness while not changing the total number of parameters significantly. Among others, subnetwork ensemble frameworks (subnetwork ensemble), BatchEnsemble (Wen et al., 2020) and its variants, and MC dropout (Gal & Ghahramani, 2016) are examples of efficient ensembling methods (Wen et al., 2021). The idea behind training subnetworks comes from sparsity, and the fact that contemporary deep learning models have millions of parameters: are over-parameterized. The overparametrization of deep learning models lead to the lottery ticket hypothesis (Frankle & Carbin, 2019) and introduction of model pruning methods (Li et al., 2017). Instead of pruning a model to get a subnetwork, subnetwork ensemble models take advantage of available neurons and overparametrization with little structural changes turning a single network into an ensemble of subnetworks. This method enables the generation of ensembles while increasing the total number of parameters by less than 1%.
However, training such a model and ensuring independent subnetworks while sharing the main network's parameters with no explicit structural difference is a challenge.
Data augmentation methods encompass a diverse set of methods from basic geometric transformations of images to utilization of GANs (Shorten & Khoshgoftaar, 2019). Prior augmentation methods include cropping, flipping, and rotating, and so forth. Recent augmentation methods such as Cutmix (Yun et al., 2019), MixUp (H. Zhang et al., 2018), and AugMix (Hendrycks et al., 2020) manipulate both the pixels of images and their labels. These augmentations are called Mixed Sample Data Augmentations, and they try to emulate the distribution mismatch between the training and test data by increasing diversity among training images. Increasing the quantity of image datasets with synthetically created images helps to reduce neural networks' errors stemming from overconfidence. Consequently, models using data augmentation are less prone to overfitting and have better generalization capability (Hendrycks et al., 2020). Almost all state-of-art vision models use one or a few data augmentation approaches.
In theory, data augmentation is orthogonal to ensembling (Havasi et al., 2021;Wen et al., 2021). Both ensembling and data augmentation increase accuracy, generalizability, and calibration. However, one cannot directly combine ensembling and data augmentation without further analysis. The findings analysing the interaction between ensembling and data augmentation are mixed in the literature. Wen et al. (2021) shows how combining three ensembling methods (BatchEnsemble, MC dropout, and Deep Ensembles) with two data augmentation methods (Mixup and Augmix) without structural change on the said methods can harm the calibration of the model. Rahaman and Thiery (2021) also show a similar effect while Deep Ensembles used with MixUp augmentation. However, Rame et al. (2021) states that their findings do not confirm the pathology between ensembling and data augmentation, but that combining the two methods increases calibration.
In this paper, we try to clarify this conflict in the literature combining MIMO (Havasi et al., 2021), MixMo (Rame et al., 2021), and Masksembles (Durasov et al., 2021 frameworks (subnetwork ensemble) with data augmentation, and illustrate that this combination does not harm model calibration while increasing accuracy. Moreover, combining ensembles with data augmentation also helps to achieve better uncertainty estimates. We confirmed this behaviour across three different subnetwork ensemble frameworks and two data augmentation methods on three datasets. We also test all models on corrupted Cifar-10 and Cifar-100 datasets (Krizhevsky (2009)) and find consistent results in the presence of corrupted data. This paper is structured as follows: Section 2 provides some background on the topic, discusses recent approaches in subnetwork ensembles and related augmentation methodologies. Section 3 discusses our experimental approach and gives a brief commentary on key decisions in the experimental workflow. Section 4 presents the main findings and discusses the implications of our findings. Finally, Section 5 concludes the paper and discusses future work as well as how to interpret the results of this paper in future studies.
2 | RELATED WORK 2.1 | Ensemble learning have been proposed in the literature. MIMO, MixMo, and Masksembles frameworks are based on this idea: training subnetworks that independently learn the task while utilizing a single model's capacity. The most distinctive feature of subnetwork ensembling is that these models encapsulate diverse subnetworks and train all at once simultaneously. This structure allows them to flexibly exploit the base model's capacity stemming from overparametrization. However, the exact procedure to train models under subnetwork ensemble frameworks and combine the inputs into a shared representation are still active areas of research.

| Dropout
Dropout was first proposed as a neural network regularization method without an in-depth theoretical grounding (Srivastava et al., 2014). This technique "drops out" neurons in a neural network randomly with a pre-specified probability, hence the name. Although being computationally cheap and intuitively simple, it helps to stabilize training, reduce overfitting and improve generalization performance by removing weights randomly. The original paper proposes that it be used only during training. However, Gal and Ghahramani (2016) associated this method with Bayesian methods and suggested a method they called MC dropout which helps to produce better uncertainty estimates. MC dropout is based on the idea of using dropout at test time and can be viewed as an approximate Bayesian technique. Later, this technique was improved by other studies and several different variations introduced (Srivastava et al., 2014;Gal et al., 2017Shen et al., 2021.

| Data augmentation
Data augmentation (DA) increases the training data by introducing small perturbations or transformations ( Figure 1); allowing models to be trained on more data. DA helps capture invariant feature transformations and is also used to simulate out-of-distribution data. Therefore, models utilizing DA tend to have better calibration and accuracy resulting in a large uptake of DA in the literature. DA methods also drive the state-of-the-art models for vision tasks (He et al., 2016a). In addition to simple data augmentation methods like random right-left flipping, cropping, and resizing, recent data augmentation methods introduce more complex pixel-wise operations and label manipulations. Cutout

| Summary
The effects of ensembling and data augmentations on image classification tasks are well-studied in the literature. However, we observe limited knowledge and guidance on the total effect when these two seemingly orthogonal methods are combined. Being one of the recent ensembling strategies, subnetwork ensembles achieve ensembling by fitting diverse subnetworks inside a single base network. Recent data augmentation methods also use more complicated processes to generate a diverse set of new images. Accuracy is often the foremost metric targeted by studies.
Nevertheless, calibration is also an important metric for model deployment. Hence, in this paper, we seek to provide some clarity on the effects of combining subnetwork ensembles with data augmentation methods and whether this improves model accuracy without harming model calibration.
F I G U R E 1 Common data augmentation methods (Rame et al., 2021).

| METHODOLOGY
In this paper, we focus on supervised multiclass classification tasks using convolutional neural networks. Our models are based on the ResNet architecture (He et al., 2016a) which uses shortcut connections. We designed our experiments to test model performance and calibration in variety of scenarios. A key part of this paper is that we are not trying to compare models against each other, but rather explore the effects of augmentation on calibration. As such, we do not concern ourselves with whether a specific model is better calibrated than another model or whether there is a distinct advantage (e.g., better accuracy) of using one vs. the other. Instead, we seek to provide guidance on when (and where) augmentation does (or does not) improve model calibration.

| Experimental design
This paper seeks to understand the impact of combining subnetwork ensemble with data augmentation on calibration. Ensembling and data augmentation are thought to be independent of each other (Havasi et al., 2021;Wen et al., 2021) while both methods are used to enhance model performance. We try to verify Wen et al. (2021)'s hypothesis on ensembling and data augmentation pathology. To do this, we perform a structured 3 Â 3 Â 2 factorial experimental design consisting of three factors; subnetwork ensemble frameworks (3), data augmentation methods (2), and data sets (3). Although they are not our main focus, we also provide deep ensembles' results acting as a point of comparison. As subnetwork ensemble frameworks, we utilized Multi-input Multi-output (MIMO), MixMo, and Masksembles.

| MIMO
In MIMO (Figure 2), the network takes M inputs and outputs M outputs (predictions) where M is the number of desired subnetworks in a single model. MIMO requires only two changes: the input layer takes M images which are simply stacked images and the output layer has M prediction vectors instead of a single one. In this sense, MIMO uses channel-wise concatenation in pixels for the inputs. These inputs are independently sampled from the training set and require no preprocessing. The base network is trained to predict matching images simultaneously. Each subnetwork learns to disregard features from other images. This ensures the independence of subnetworks. The loss is calculated according to corresponding labels. During testing, the same input is repeated M times, and the outputs are averaged to get the final prediction. Clearly, MIMO does not need the neural network to have large structural changes. In terms of network structure, it is enough to change the first convolutional and last dense layers.

| MixMo
MixMo ( Figure 3) has a similar setting to MIMO but instead of channel-wise concatenation of images in pixels, it first encodes each image and then employs a mixing block to combine inputs (Rame et al., 2021). Inspired by mixing data augmentation methods, MixMo uses a generalized multi-input mixing block to combine inputs. Using identity encoding layers and choosing channel-wise concatenation turns the MixMo framework The network receives two input images, stacks them, and outputs a prediction for each image. All subnetworks share the same base network. At test time, the same input is repeated M times and predictions are averaged to obtain the final prediction.
into MIMO. However, the mixing block is not limited to any specific augmentation method; changing the mixing block results in a different framework. Choosing different augmentations in the mixing block results in different MixMo variants. The two variants presented in the original paper are Cut-MixMo and Linear-MixMo in which CutMix and MixUp (see Figure 1) augmentation methods used to mix input images. Following Rame et al. (2021), we chose Cut-MixMo and refer to it as MixMo from now on, as it is the more performant variant. Havasi et al. (2021) introduces input repetition and batch repetition during MIMO training. Input repetition helps subnetworks share the same features but degrades diversity among subnetworks. Following the MixMo paper (Rame et al., 2021), we do not utilize input repetition. Batch repetition has a regularization effect on the network training. MIMO finds that the batch repetition value b ¼ 4 is optimal, and MixMo also uses b ¼ 4. Hence following both papers, we also used batch repetition b ¼ 4. One of the core components of subnetwork ensemble frameworks is the number of subnetworks. Since the original network's capacity is limited, as the number of total subnetworks increases, after an optimal number of subnetworks, the performance of the network decreases. Both MIMO and MixMo find that the optimal number of subnetworks is between 2 and 4 for the base models and datasets we utilized. Moreover, the number of subnetworks also increases the training time. We choose the number of subnetworks (M ¼ 3) for all models.

| Masksembles
Masksembles ( Figure 4) uses parameter masks to introduce a structured way to drop model parameters. The idea behind Masksembles is similar to MC dropout. It basically replaces stochastic sampling in MC dropout and uses fixed number of pre-determined random masks. The diversity F I G U R E 3 MixMo framework with M ¼ 2. The network receives two input images, encodes them with convolutional layers, mixes them according to the mixing operation (CutMix or MixUp) and outputs a prediction for each image. All subnetworks share the same base network. At test time, the same input is repeated M times and predictions are averaged to obtain the final prediction.  (Gal & Ghahramani, 2016), Masksemble layers are placed before all the convolutional and fully connected layers.

| Optimization
For all models we used Stochastic Gradient Descent (SGD) with identical hyper-parameters as the corresponding original papers. For Deep Ensembles, we used four models to match the number of subnetworks for MIMO and MixMo. For each model's specification, we trained three randomly initialized models and take the average of metrics. We follow the original papers for learning rate, optimization algorithm, and batch size.

| Augmentations
To combine with subnetwork ensemble frameworks, we chose two common data augmentation methods: MixUp and CutMix. We go beyond simple data augmentations like flipping, rotation, pixel padding and use recent data augmentation approaches. Indeed, our data augmentation methods fall under the Mixed Sample Data Augmentation, which is the notion of manipulating both images and targets, and creating virtual samples x new , y new ð Þgiven two pairs of input images x i , y i ð Þ and x j , y j À Á (see: Section 2). Following original MIMO, MIXMO, and Masksembles papers, data augmentations are performed during training with the probability of 0.5 that a new training sample is generated.

| MixUp
MixUp is a simple data augmentation method which linearly interpolates pixels while manipulating the labels at the same time. The idea behind MixUp is that linear interpolations of feature vectors should lead to linear interpolations of target labels (H. Zhang et al., 2018). By doing so, MixUp extends the training distribution. Given two random samples from training data x i , y i ð Þand x j , y j À Á , when MixUp is applied, we get e x,e y ð Þ by: where λ is sampled from uniform distribution 0, 1 ½ .

| CutMix
CutMix creates new images by cutting patches from images and pasting them among training images. CutMix also mixes the true labels proportional to the area of the patches while patching. So a new training sample e x,e y ð Þ is generated by combining two training samples x a ,y a ð Þ and where M denotes a binary mask indicating where to drop out and fill in from two images, 1 is a binary mask filled with ones, and is element-wise multiplication. Like MixUp, λ is sampled from the uniform distribution 0,1 ð Þ.

| Datasets
We trained all models on the Cifar-10, Cifar-100 (Krizhevsky, 2009), and Tiny ImageNet datasets (Chrabaszcz et al., 2017). Cifar-10 and Cifar-100 datasets both have 60 k images (50 k training and 10 k test images) and 10 and 100 classes respectively. To further push the models we use Tiny ImageNet. Tiny ImageNet (Chrabaszcz et al., 2017) is a downsampled variant of ImageNet as an alternative to the Cifar datasets with 64 Â 64 pixels and with 100 k total images and 200 classes (500 training, 50 validation, and 50 test images per class).
Neural networks encounter a dramatic decrease in their performance when they are tested against out-of-distribution data. After training all models with the matching framework, in addition to IID test sets, we tested all models on corrupted Cifar-10 and Cifar-100 test sets (Hendrycks & Dietterich, 2019). Images in this dataset are perturbed with 19 different common corruption types (e.g., added blur, compression artefacts, frost effects etc.) at five different severity levels. Thus, the Cifar-10 or Cifar-100 test set has 19 Â 5 = 95 different unseen variations emulating out-ofdistribution data. However, a model resilient to a specific type of image corruption at the highest severity level of that corruption (severity level 5) would possibly do so at lower severity levels of the same corruption, that is, severity level 1-4. Likewise, a model with a poor performance against a specific type of image corruption at a low severity level of that corruption would also perform poorly at higher severity levels of that corruption.
Therefore, we set the corruption level at 3 for all corruption types across all images to isolate the effect and prevent any over/under-statement of it. A model which improves performance on this should indicate general robustness gain and better calibration (Hendrycks & Dietterich, 2019).

| Performance metrics: Calibration and uncertainty estimates
Calibration is a notion which measures how a model's predictions match the empirical frequency of the true probabilities (Degroot & Fienberg, 1983). We focus on supervised multi-class classification problems. We say that a model is well calibrated when a prediction of a class with confidence p is correct p% of the time. More formally, we say a model f is calibrated if where f is a function mapping every input X to a categorical distribution with the label k, f(X) is a vector in the kÀ 1 ð Þ-dimensional sim- A model can have high accuracy yet be a miss-calibrated one. That is calibration and accuracy are two distinct phenomena. Measuring the predictive uncertainty estimates and how well a model is calibrated is a challenging task since the ground truth is not known. Therefore, we utilize two different metrics to measure the calibration: Expected Calibration Error (ECE) and Negative Log-Likelihood (NLL). We also use corrupted Cifar test sets to represent out-of-distribution examples to evaluate model calibration from a domain shift perspective.
By binning the predictions to M equally-spaced intervals and taking a weighted average of each bin's accuracy, Expected Calibration Error (ECE) (Naeini et al., 2015) which measures the absolute difference between accuracy and predictive confidence, is widely used in the literature, and defined as follows: where acc B m ð Þ is the average probability of the predicted and true class for the bin m and conf B m ð Þ is the average confidence within B m ð Þ.
Negative log-likelihood (NLL) is a proper scoring rule (Lakshminarayanan et al., 2017). Scoring rules measure the quality of predictive uncertainty and reward better calibrated predictions (Gneiting & Raftery, 2007). So maximizing likelihood (minimizing NLL) increases calibration. Given a probabilistic model π and n samples, NLL is defined as:

| EVALUATION
After setting the experimental design and training all models, we tested all models on the respective test sets. We grouped our results for the metrics we track according to datasets. Moreover, we tested all models on corrupted Cifar-10 and Cifar-100. We report the accuracy metrics as well as the ECE and NLL metrics as discussed in Section 3, which are averaged over three independent runs. As our main concern is what happens to subnetwork ensembling models' calibration performance when combined with data augmentation, we will not compare and contrast individual models, but rather discuss how individual models respond to provide more general guidance and comments.
4.1 | Results on Cifar-10/100 and TinyImageNet Table 1 reports all model results tested on Cifar-10. Subnetwork ensembling frameworks show a performance boost in terms of accuracy compared to the base models. They also improve calibration (lower ECE) and have better uncertainty estimates (lower NLL). When MIMO and MixMo are trained with MixUp and CutMix, model performance across all three metrics also increases. That is, when ensemble models are combined with data augmentation, they better estimate uncertainty (lower NLL) and are better calibrated (lower ECE). We also see a similar trend with Masksembles: metric performance is at least as good, or better. Table 2 reports results for models trained and tested on Cifar-100. Improvement in the metrics for Cifar-100 is similar to Cifar-10.
Combining MixUp or CutMix with one of MIMO, MixMo, and Masksembles makes all models more performant (higher accuracy) and better calibrated (lower ECE & NLL). Combining ensemble models with data augmentation methods results in performance gains across all metrics. Table 3 reports results for models trained and tested on Tiny ImageNet. We see that results still have the general tendency to be improved when ensembling is combined with data augmentation(s). All three subnetwork ensemble frameworks have higher accuracy and lower calibration error when one of the data augmentations of MixUp and CutMix is added to the training. These results represent strong support of the overall results since Tiny ImageNet dataset has larger images, more labels, and more images than Cifar-10 and Cifar-100.
The test metrics for all three datasets imply that combining subnetwork ensemble frameworks with data augmentation improves accuracy, lowers NLL, and lowers ECE, that is, combining them results in better performance and more calibrated models. However, there is a single exception to this conclusion. When combined with MixMo, MixUp augmentation results in a slight decrease in the calibration (higher NLL & ECE). This situation is true for both Cifar datasets. Nevertheless, for Tiny ImageNet, MixUp behaves in line with the general tendency. This behaviour is consistent across all combinations of subnetwork ensemble and data augmentations. Therefore, combining Subnetwork Ensembles with data augmentations methods does not harm calibration when tested on in-distribution data.
T A B L E 1 Performance results for WRN-28-10/CIFAR10.

| Models against image corruptions
Tables 4 and 5 report results when all models are tested against corrupted Cifar datasets. Clearly, compared to IID test sets (uncorrupted), performance of all models across all three metrics degrade. However, still, ensemble models with data augmentations are more calibrated than models without data augmentations.
When compared to the base model, subnetwork ensembles always improve model calibration (lower ECE) with the exception of MIMO (without augmentation). Using an augmentation in addition to subnetwork ensemble almost always improves calibration. The only exception to this is using CutMix with MixMo on Cifar-10, where it also does not improve accuracy. Applying augmentation in addition to Subnetwork Ensembles can boost calibration as much as 3.5Â (e.g., MixMo + MixUp). One contrast to the IID dataset is that MixUp helps to enhance both calibration and accuracy more than CutMix when tested against distribution shift. This implies that having an idea (when possible!) of the test data distribution would help to choose which combination to use in model deployment. This is also due to the fact that different Mixed Sample data augmen- if E k g X ð Þ ð , g X ð ÞÞ k k ½ < ∞ and given k is a matrix-valued kernel as in definition 1 in Widmann et al. (2019), X 0 , Y 0 ð Þis an independent copy of X, Y ð Þ and e i denotes the ith unit vector. Based on this let us define a function h s.t.: Hence, the below estimator becomes consistent and unbiased estimator of the squared kernel calibration error SKCE k,g ½ ≔ KCE 2 k, g ½ : Table 6 shows SKCE values of each model on Cifar-10 and Cifar-100 test sets. Here we see additional evidence that combining data augmentation and Subnetwork Ensembles does not harm model calibration. In fact, we see a marked difference in the KCE estimator (where lower is better) as defined by Widmann et al. (2019). Again, recall that our intention is to not compare across models, but rather to illustrate that for the models we have experimented with that calibration is improving in the presence of data augmentation methods. Note that at this stage we do not claim that the models are in any way "perfectly" calibrated but that in the search for better calibration, data augmentation approaches certainly seem to help subnetwork ensembles.

| CONCLUSION
In this paper, we focused on multi-class classification problems and explore the effect of combining Ensembles with data augmentation on calibration. Our extensive experiments have illustrated that using subnetwork ensemble with data augmentation alone improves model calibration and robustness. More importantly, we find that combining subnetwork ensemble with MixUp or CutMix improves accuracy while not harming model calibration. Thus, adding some clarity to the literature on this point, as we did not observe any trade-off between ensembling and data augmentation for subnetwork ensemble. Rather, in our experiments, we observed that combining subnetwork ensemble and data augmentation improved calibration and uncertainty estimates. Our experiments with benchmark corrupted datasets showed how the findings are also robust with respect T A B L E 6 Calibration error estimates (SKCE) for models on Cifar-10 & Cifar-100.
Hence, combining subnetwork ensemble with data augmentation methods for image classification tasks helps to improve performance without sacrificing calibration. This situation signals a divergence on the effects of combining different methods for ensembling with data augmentation. Models trying to boost performance should consider this discrepancy. Exploring this behaviour divergence (as future research) among ensembling methods when combined with data augmentation could yield a better understanding of seemingly uncorrelated methods.

ACKNOWLEDGEMENT
Open access funding provided by IReL.

FUNDING INFORMATION
This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6183. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.