An interpretable ensemble method for deep representation learning

In representation learning domain, the mainstream methods for model ensemble include “implicit” ensemble approaches, such as using techniques like dropout, and “explicit” ensemble methods, such as voting or weighted averaging based on multiple model outputs. Compared to implicit ensemble techniques, explicit ensemble methods allow for more flexibility in combining models with different structures to obtain different perspectives on representations. However, the representations obtained from different models do not guarantee a linear relationship, and simply linearly combining multiple model outputs may result in a degraded performance. Meanwhile, employing non‐linear fusion mechanisms such as distillation and meta‐learning can be uninterpretable and time‐consuming. To this end, we propose the hypothesis of linear fusion based on the output representations of deep learning models, and design a interpretable linear fusion method based on this hypothesis. This method applies a transform layer to map the output representations of different models to the same classification center. Experimental results demonstrate that compared to directly averaging the representations, our method achieves better performance. Additionally, our method retains the convenience of direct averaging while offering improved performance in terms of time and computational efficiency compared to non‐linear fusion. Furthermore, we test the applicability of our method in both computer vision and natural language processing representation tasks using supervised and semi‐supervised approaches.

In the field of traditional machine learning, numerous ensemble learning methods already exist, such as random forest 2 and AdaBoost. 3As ensemble learning methods have been successfully applied in machine learning, researchers have also turned their attention to ensemble methods in deep learning. 4,5Deep learning models, being over-parameterized, often encounter overfitting problems as the complexity of the model increases, leading to a trade-off between the variance and bias of the model's prediction results. 6In this regard, ensemble learning serves as a solution to balance bias and variance in deep learning models, thereby achieving better robustness and generalization.
The ensemble methods used in deep learning models primarily include "implicit" techniques such as Dropout, 7 Drop-Connection, 8 Stochastic Depth, 9 and Swapout. 10These methods construct multiple homogeneous models by randomly invalidating internal connections within the deep learning model, allowing these models to obtain more complementary knowledge of the data during the training process.Finally, through parameter sharing, the ensemble obtains results that are similar to the average or voting outcomes of the constituent models.However, these methods have limited flexibility as they can only combine homogeneous models with similar structures.In contrast, the "explicit" method provides greater flexibility, enabling the ensemble of not only homogeneous models but also heterogeneous models with entirely different structural designs.This flexibility makes it possible to integrate more models with diverse perspectives. 1ommon and feasible methods in the "explicit" category involve direct voting on heterogeneous model outputs or weighted averages. 1While these ensemble methods are simple and efficient, the linear combination does not always guarantee superior fusion results compared to using a single model.To overcome this challenge, other nonlinear ensemble methods have been developed, such as the method proposed by Shen et al., 11 which utilizes knowledge distillation technology. 12In this approach, pretrained teacher models are first fixed, and a small network called the student model is trained using adversarial learning.Adversarial learning involves the use of a discriminator to evaluate the quality of the student model's learning.The student model learns different knowledge from the various teacher models, forming an ensemble of the teacher models.Unlike adversarial learning, meta-learning 13 is another approach to achieve multi-model ensemble by learning a new network.In addition to the original multiple base learners (also known as primary learners in meta-learning), meta-learning provides a secondary learner (also known as a meta-learner) to learn the integration strategy of the base learners.
In the context of representation learning in computer vision (CV), linear fusion of model outputs often fails due to the lack of shared classification centers in the representations produced by different models.For example as shown in Figure 1, if model A is more sensitive to the presence of cats in an image, a specific part of its output representation, denoted as y 1 , will have a higher value.Conversely, model B may be more sensitive to the presence of flowers in the image, resulting in a higher value for the corresponding part of its output representation, denoted as y 2 .Simply averaging y 1 and y 2 to obtain the final representation, y avg = y 1 + y 2 would lead to confusing results that cannot distinguish cats from flowers.Similar problems arise in natural language processing (NLP), where different models may focus on different aspects of the same sentence, resulting in conflicting representations.Forcefully applying linear combination in such cases would yield subpar performance due to the lack of linear correlation among the results.
Although adversarial learning and meta-learning methods can nonlinearly fuse the results of multi-models through new networks and training, thereby mitigating the issues associated with linear fusion, the addition of network models and training processes can lead to high training costs, especially when it comes to challenging convergence in adversarial training.Furthermore, both the student network and meta-learner are black-box models, lacking interpretability.
To address the issue of heterogeneous models' outputs lacking a linear relationship and keep computational efficiency and interpretability, first, we retrain a pre-trained heterogeneous model using a specific loss function.On the top layer of the model, we can obtain a classification layer composed of class centroid vectors for each category.We select the classification centroid vectors obtained from the best-performing model as the common classification vectors.This is because these vectors maximize the differences between different categories, ensuring that the centroids of different categories are as far away from each other as possible.We replace the classification layers of other models with this new classification layer and introduce a transformation layer between the other models and the new classification layer.We separately train the transformation layer to transform the output vectors of the other models toward the vicinity of the classification centroids, thus satisfying a linearly additive relationship between the representations of different models' outputs.Additionally, the following assumptions form the basis for this approach: are to be effectively linearly fused only when ŷi Where f is the linear projection function, and g is the backbone network.
Finally, we obtain the final representation by taking a weighted average of the results.We test our method for ensembling pre-trained heterogeneous models and demonstrate its superiority over direct averaging methods in both CV and NLP tasks.This showcases the generality of our method in representation learning.
Our contributions in this paper can be summarized as follows.First, we address the challenge of simply averaging the outputs of deep learning models and propose an interpretable linear combination method that is easy to implement.Our experiments validate the assumption underlying the linear combination and provide an explanation for our motivation.Second, we conduct experiments in both CV and NLP domains, employing supervised and unsupervised training, respectively.We illustrate how our ensemble approach effectively fuses heterogeneous models in the "explicit" ensemble scenario and outperforms average combination as well as any single model.Furthermore, we extend our investigations to the "implicit" ensemble scenario and achieve superior performance compared to average combination and any single model by integrating homogeneous models with different initialization parameters.Moreover, our method can be seamlessly combined with other implicit ensemble techniques such as dropout to enhance performance.

Machine learning ensemble
In the field of traditional machine learning, the ensemble methods that are most frequently employed are bagging, 3 random forest, 2 AdaBoost, 14 and gradient boosting, 15 and so forth.The Bagging technique employs sample subsets generated by randomly selecting from the training data set to synchronously train the fundamental models for integration.Similarly, random forest synchronously trains several decision tree models from the sample and the feature dimension.By merging the voting results of many decision trees, it solves the issue that decision trees are prone to over-fitting.AdaBoost focuses on samples that are incorrectly classified and iteratively modifies sample weights to improve the performance of fundamental models for the integration stage.Gradient Boosting is used to produce sample subsets from random samples, and each learner is built and trained to minimize residuals left over from the one before it.Gradient Boosting can therefore force the forecast to be close to the actual value by reducing the sum of the integrated models' final residuals.Both AdaBoost and Gradient Boosting are trained in a tandem way.

Implicit deep learning ensemble
The ensemble methods in deep learning are divided into two types: "implicit" and "explicit."The typical "implicit" ensemble methods include Swapout, 10 Dropout, 7 DropConnection, 8 and Stochastic Depth. 9Typically, these techniques train many homogeneous modes with shared weights before implicit ensembling them for testing.The Dropout approach builds an ensemble out of a single model by removing random groups of hidden nodes after each mini-batch.Each node is scaled according to how likely it was to survive during training to build an exponentially large number of networks with shared weights, which are subsequently implicitly ensembled during testing since no nodes are dropped.DropConnect uses similar techniques to build ensembles during testing by eliminating connections rather than nodes during training.In a recently proposed method called stochastic depth, layers are dropped at random during training to produce an implicit ensemble of networks with different depths during testing.Dropout averages are applied to designs with "missing" units, and stochastic depth averages are applied to structures with "missing" layers.Other effective randomized techniques include dropconnect, which generalizes dropout by terminating connections rather than units.Swapout, a stochastic training technique that generalizes dropout and stochastic depth, it skips layers randomly at a unit level while benefiting from each method.Swapout produces diverse network architectures for model averaging.

Explicit deep learning ensemble
Besides taking an average over multiple instances, 11 propose an explicit ensemble method that uses technology for knowledge distillation.A tiny network called the student model is trained using the adversarial learning training approach, which establishes a discriminator to assess the caliber of student model learning, and then it is fitted to the output of several pre-trained instructor models.The fundamental tenet of this approach is to enable the student model to absorb various skills from various teacher models and create an ensemble of teacher models.Similar studies 16 integrate the "knowledge" of complex ensembles into a single model.Similar to distillation learning, the approach of combining multiple learning stages is referred to as meat-learning.In this approach, the individual inducer outputs act as inputs to the meta-learner, which ultimately generates the final output. 17Meta-learning's classic representative is the Stacking algorithm, 18,19 which follows the basic idea as follows: First, K base learners are trained from the original dataset D.Then, based on the predictions of the base learners, a new dataset D ′ is generated.Finally, D ′ is used to train a secondary learner.In D ′ , the outputs of the K base learners for a sample x i are combined to create a new sample x ′ i , and the label y i of x i is also used as the label for x ′ i .

Supervised representation learning in CV
In the CV field, we train pre-trained models for classification tasks on a specific dataset to obtain representations of images.The quality of these representations plays a crucial role in downstream tasks such as segmentation, recognition, and querying.Given the availability of classification labels, we adopt supervised training for this purpose.Technically, we pre-trained all models on the same batch of image training sets through the arcface loss function 20 to obtain the model f k (g k (x)) where k ∈ [1, 2, … K] and an optimal model f optimal ( g optimal (x) ) . Where f is the linear layer of arcface and g is the backbone network.The arcface technology constructs the following cross entropy-based loss function through the output representation vectors of images, and the corresponding one-hot ground truth label: Where N represents the batchsize, n represents the number of categories, s is the normalization parameter, cos ‖Wj‖ , and we fix the embedding feature ŷi by l 2 normalization.Among them, W ∈ R d×n is the parameters matrix of f , ŷi ∈ R d is the output of backbone network g.W j represents the jth row of the matrix W, it represents a vector representation of a classification center.W y i is the ground truth classification center of ŷi , and m is an added angular margin penalty ground truth.By normalizing W j and the output vector ŷi , the model can just learn more angular characteristics so that the sample representation can be distributed on the hypersphere.cos  j represents the cosine value of the angle between the sample vector and the vector corresponding to the class center of j, s is the radius of the hypersphere.This loss function allows the learned sample representations to be tightly clustered around the classification center and maintain a large inter-class distance.

Unsupervised representation learning in NLP
Our method can also be used in unsupervised tasks, taking representation learning in NLP as an example.We generate balanced negative samples from positive samples based on the contrastive learning method, and use the pre-trained NLP model to learn the similarity of positive and negative samples to enable the model to understand the semantics and make sentences with similar semantics clustered, while sentences with different semantics maintain a larger classification gap.
For various pre-trained NLP models g k , we add a linear layer f as a classification linear layer, which is similar to the arcface layer in the supervised task of CV above, and uses cross-entropy as the loss function: The augmented sample of x i denoted by x + i .Takes x and x + i as the input of backbone network, and y i , y + i represent the outputs T correspondingly.sim(x, y) means the similarity between vectors x and y, such as cosine similarity x T y ||x||||y|| , W ∈ R d×n is the parameters matrix of linear layer f , r i = W T y i and r + i = W T y + i are the outputs of the backbone network y i and y + i multiplied by the parameters f ,and τ is a temperature hyperparameter.Unlike images with a specific classification center, the languages can only be clustered according to similar semantics.And the jth row of W represented by W j which can be shown as a vector representing classification centers with different semantics.

Linearization and combination
Traditional combination directly performs a simple weighted average of the representation results of different models.
Since different models may capture different key features, and different core neurons capture these features in the model, the linear relationship between the outputs of these models is not satisfied.Linear combination forcibly will lead to chaotic representations, resulting in performance loss.Since different models have a nonlinear relationship, in order to convert this nonlinear into a linear relationship, it is necessary to unify all models into the same vector space of the classification center.Our framework is shown in Figure 2, after retrain these pre-trained models, we get the classification layer W for each model, where each column is a vector of classification centers.Therefore, our method uses the linear layer of the optimal model f optimal as the common linear layer and replaces the linear layers of other models f k .At the same time, in order to ensure the knowledge learned by the backbone network g k unchanged, it is necessary to fix the parameters of the g k add a new learnable linear layer l k between the g k and the optimal linear layer f optimal for other models.By fixing g k and f optimal , optimizing the loss function L, and retraining the linear layer l k to obtain a new model ŷk = f optimal (l k (g k (x))).The training method is stochastic gradient descent (SGD).21 Thus, the output satisfies the linearity assumption such that ŷk Especially, for the backbone network g k with the same network structure, which is a specific case of the above case.Since all models have a similar structure, it only needs to fine-tune each model on the common arcface linear layer to obtain a model group that satisfies the assumptions ŷk = f optimal ( g k (x) Finally, average the model group ŷk = f optimal (l k (g k (x))) k ∈ [1, 2, … K] to obtain the final combined representation as follow: Our method is presented as the following algorithm (Algorithm 1): Input: training dataset Dt and validation dataset Dv Training fk(gk) on D t with SGD, based on Equation (1) as the objective function 3. Select the best performing model f optimal ( g optimal ) based on Dv for k ∈ [1, 2, … K] and k ≠ optimal do 4.
f k → f optimal Add a new learnable linear layer l k between g k and f optimal f optimal ((g k )) → f optimal (l k (g k )) Fixing g k and f optimal , fine-tuning l k on D t with SGD, based on Equation (1) as the objective function 5. return average output based on Equation (2)

4
EXPERIMENT AND RESULTS

Metrics
We use mP@5 as the metric to measure the quality of image representation.
where Q is the number of query images.n q is the number of index images containing an object in common with the query image q.Note that n q ≥ 0. rel q (j) denotes the relevance of prediction j for the qth query: it is 1 if the jth prediction is correct, and 0 otherwise.Depending on the value of k in mP@k, there are several metrics that can be used.However, considering the specific circumstances of the H&M dataset, larger values of k will be calculated as n q due to the presence of $min(n q , k)$.Therefore, choosing k = 5 as the metric is deemed most appropriate.Additionally, to maintain consistency, the same metric is also employed in models utilizing Swin-transformer as the backbone. 22argue that Spearman correlation, which measures the rankings instead of the actual scores, better suits the need of evaluating sentence embeddings.For all of our NLP experiments, we report Spearman's rank correlation.
where x i represents the sorted serial number of x ′ i in x ′ space.y i represents the sorted position of y ′ and y ′ space.x ′ , y ′ are two sets of data whose correlation is to be calculated.

Pre-trained model
In order to test the effectiveness of the proposed method, we will conduct experiments in the field of CV and NLP, respectively.Vision transformer (ViT) 23 is currently the best model for image classification, surpassing the best convolutional neural network (CNN).ViT outperforms the best ResNet on all public datasets, provided that ViT is pre-trained on a sufficiently large dataset.The advantage of ViT is more obvious when pre-training on a larger data set.Therefore, we choose VIT as the backbone of our downstream classification task, and the specific implementation uses the pre-trained model provided by. 24n Vit, the sampling rate was directly downsampled by 16 times from the beginning, and the following feature maps also maintained the same downsampling rate.However, an improved model Swin transformer 25 uses a hierarchical construction method (hierarchical feature maps) similar to the CNN.For example, in the feature map size, the image is downsampled by 4 times, 8 times, and 16 times.Such a backbone is helpful for tasks such as target detection and instance segmentation.In addition, the concept of Windows Multi-Head Self-Attention (W-MSA) is used in Swin transformer.For example, in four times downsampling and eight times downsampling, the feature map is divided into multiple disjoint regions (Window), and multi-head self-attention is only performed within each window (Window).Compared with directly performing multi-head self-attention on the entire (Global) feature map in ViT, it reduces the amount of calculation, especially when the shallow feature map is large.Therefore, the Swin-transformer is a considerable comparison model, which may capture features from the different perspectives from ViT and serve as an effective supplement.
We add an arcface layer to the two models 17 as the target space.In order to obtain different encoding models, we train ViT on 224 × 224, 280 × 280, and 290 × 290 three different resolutions, respectively.For the Swin-transformer model, we choose three models of different scales with uniform resolution: Swin-tiny, Swin-small, and Swin-base.The classification accuracy and mP@5 illustrate the effectiveness of our proposed method.Sentence semantic representation is a critical task in the field of NLP.We use the Simcse 26 model, which is currently relatively good in unsupervised text semantic representation, to conduct experiments to verify the effectiveness of our proposed method.
The key to using contrastive learning is constructing positive and negative samples or directly obtaining the final positive and negative representations.Simcse enters a sentence into the neural network twice, uses the dropout layer in the deep neural network to directly obtain the corresponding positive sample representation, and uses other sentence representations in the same batch as negative samples.In addition to this direct construction of different positive and negative sample representations, there are many methods of constructing positive and negative samples, such as synonymous replacement, 27 deletion of words, adjustment of the order of words, 28 and so forth.The following uses BERT 29 as an example to illustrate the model architecture.Specifically, a pre-trained model Bert-base-uncased of Bert is used as the backbone, and then a linear layer is used as the semantic representation space.Before going to the linear layer, we take the encoding of the [CLS] position to represent the entire sentence.
The prediction results of the above model on the STSB 30 evaluation set under different random number seeds are used as the criteria for selecting the semantic representation space.Then retrain for models other than the selected semantic space, add a linear layer between the backbone and the semantic representation space, and train the backbone and the added linear layer in the case of a fixed semantic representation space.
Roberta 31 and Bert are much different in the training process.The static mask method used in Bert is to copy multiple copies of the sentence and randomly cover some tokens in the input sentence.The purpose is to predict the covered words based on their context and finally obtain a representation that can fuse the context information, but there will still be a large number of repeated masked samples in one epoch.To this end, Roberta adopts a dynamic mask mechanism so that the masked part of each sample is different.In addition, Roberta simplifies the training process and greatly increases the batch size of the training dataset.This will allow Roberta to get a wider range and different understanding even if they have the same model structure.Therefore, the change in the pre-training process will have a similar effect to changing the model structure.
We also conduct the contrast experiment by replacing the backbone in the above NLP with Roberta and replacing the data augmentation method by randomly deleting some words 28 in the sentence.It should be noted that data augmentation methods in NLP experiments include duplication and deletion.The former method duplicates a sentence and utilizes the dropout layer in deep learning networks to achieve data augmentation.The latter method randomly removes certain characters or words from a sentence to achieve data augmentation.

Datasets
The model with VIT as the backbone in CV is trained on the Products-10k, 32 Shopee, MET, 33 GPR, 34 GLDv2, 35 and evaluated on the model's performance on H&M.(H&M refers to the dataset provided by H&M company in the Kaggle competition, which is the name of a company itself.)However, the backbone model composed of a Swin-transformer is only trained on the Products-10k.For the NLP models, we trained on Wiki and demonstrate the semantic representation capabilities of the NLP model on datasets of 7 text semantic similarity tasks, namely STS12, 36 STS13, 37 STS14, 38 STS15, 39 STS16, 40 STS Benchmark, 30 and SICK-Relatedness. 41The above seven text semantic similarity datasets contain two sentences in rows and a score indicating the similarity of the sentences.We use the output of the [CLS] position in Bert to encode the sentence.The cosine value between the two sentence representations is calculated as the semantic similarity between the two sentences.After calculating the similarity scores of the two sentences, calculate the Spearman coefficient between the actual score and the predicted score as a criterion for judging the quality of the model.

Results
The VIT model is used as the backbone in the image representation of CV.Then the model's performance obtained by integrating training at different resolutions is verified on the H&M dataset.The evaluated data set in Products-10k is used to conduct experiments to test whether our method can improve the model's performance with Swin-transformer as the backbone.Both ViT and Swin-transformer, when used as backbones, aim to obtain a semantic representation of an image.The ultimate goal is to achieve results where images with similar content have closer semantic representations.Furthermore, we verify the effectiveness of the proposed method in NLP.Under Bert (Roberta) as the backbone, the models obtained by different data augmentation methods are verified on seven text semantic datasets in the natural language text semantic expression field.Using the ViT model as the backbone, the results in Table 1 are obtained by computing image representations at different resolutions.From the results in the table, it can be observed that the semantic representations obtained under the proposed strategy show improvements at the same resolution.Moreover, as the resolution increases and the receptive field of the convolutional kernel in ViT becomes wider, the effect of the model's training becomes more prominent.This indicates that semantic representations at higher resolutions possess more stable semantic expressions, allowing for learning in a more realistic semantic target space after applying the improvement strategy.Additionally, compared to traditional ensemble strategies, the models enhanced by the proposed method for integration achieve more impressive results 0.4510 → 0.4675.

TA B L E 1
The results (mP@5) of the model with ViT as the backbone at different resolutions.By utilizing Swin-transformer models with different scales as backbones and incorporating the proposed improvement strategy, the validation results in Table 2 are obtained.On one hand, the experimental results demonstrate that, under the same backbone model, the adjusted models after applying the improvement strategy achieve higher classification accuracy (Acc@1: 85.05 → 95.82, Acc@5: 95.82 → 97.28).This indicates that our proposed method can better utilize the learned semantic space to guide other sub-models and obtain desired semantic representations.Furthermore, the improved image representation results (0.6496 → 0.6720) indicate that the learned semantic representations exhibit better performance in retrieving similar information, thus this is particularly advantageous for handling intra-class tasks.On the other hand, compared to traditional methods, the models at different scales demonstrate improved performance after the improvements.This suggests that our proposed method exhibits good robustness.Horizontally, the experimental results at different scales, including both accuracy and intra-class recognition, indicate that as the scale of the backbone model expands, the experimental performance also improves.Moreover, from an overall perspective, our proposed method achieves better ensemble effects by optimizing the performance of each individual sub-model.

224
In the field of NLP, where self-supervised learning is used as the learning approach, Bert and Roberta serve as backbones.By combining the proposed method with two data augmentation techniques, namely, duplicating and deleting characters, performance results Table 3 on seven text semantic similarity datasets in NLP are obtained.The experimental results in the table demonstrate that the results achieved by incorporating the proposed improvement strategy are superior to those obtained by traditional ensemble methods.From the perspective of data augmentation, both duplicating entire sentences and deleting certain words or characters yield varying degrees of improvement in semantic effectiveness (avg: 76.88 → 77.71, avg: 76.95 → 77.46).However, when Bert and Roberta are used as backbones, duplicating sentences as the data augmentation method produces better scores compared to deleting words or characters.This is because deleting words or characters can lead to semantic inconsistencies between the augmented sentences and the source sentences, resulting in the learning of artificially constructed interference items during the training process.From the perspective of the backbone model, using Bert as the backbone yields higher performance improvement compared to using Roberta as the backbone.This indicates that our method can fully leverage the advantages of different learners while providing better assistance to weaker learners.Moreover, in terms of the test dataset, our proposed method achieves a 1.3 percentage point improvement on the STSB compared to traditional ensemble methods, and it also shows a gain of 3.48 percentage points on the STS15 dataset compared to the best-performing model.The scores obtained on numerous text semantic similarity calculation datasets demonstrate the strong generalization capability of the proposed method.

Analysis
This section will analyze why the proposed method does not work well on the model with Roberta as the backbone.First, we give the average scores under several random numbers and different data augmentation methods.According to the data in the Note: The first line is the best score under three random numbers; the second line is the integration score of the three models; the third line is the score obtained after retraining according to our method.consistent, losing the meaning of ensemble learning.Additionally, before model ensemble, it is important to determine the appropriate number of sub-models needed.Therefore, we conducted experiments using multiple models.Through experiments on text semantic similarity computation in the field of NLP, we obtained Figure 3, it can be observed that although a 4-model ensemble performs well on STS12 dataset, considering efficiency and benefits together, a 3-model ensemble shows good performance on many datasets.Hence, we ultimately chose to conduct the ensemble experiment with three sub-models.

TA B L E 4
In order to verify that our method is also effective in the case of different model combinations, we assign the number of Bert and Roberta in the ensemble model when the sum is three.Experimental results in Table 5 show that our method can be further improved based on the original integration, which shows that our method is also applicable in heterogeneous environments.

F I G U R E 3
The influence of the number of models on the integration results.M k represents the result of integrating k models.

CONCLUSION
We provide an interpretable and easy-to-implement linearly combination method for deep learning models.Intuitively, based on the assumption of linear combination, our method sets a standard linear classification layer and fine-tunes different models to learn representations on the same classification center, thus obtaining the linear representation results of various models.Finally, more robust performance is achieved by combining the results on average.We have verified the representation learning tasks in two different fields of CV and NLP.In the case of "explicit" ensemble, our method can be used to combine a wide variety of architectures, such as ViT in the CV field, Bert and Roberta in the field of NLP.This method is simpler and easier to implement compared with those that distill different models into a simple model.Moreover, it avoids feature confusion caused by direct averaging with interpretability.We also simulate our approach to the case of an "implicit" ensemble by ensemble models with an identical structure but different parameter initializations.We obtain a better result than the average combination or any single model.For example, the average performance on seven publicly available datasets in NLP and the experimental results (0.6496 → 0.6720 and 0.6578 → 0.6839) on image retrieval in CV demonstrate that our proposed method not only improves the guidance for each sub-learner but also exhibits significant advantages over weak learners.Furthermore, our method can also be used in conjunction with any of these implicit ensembling techniques since all models including ViT, Bert, and Roberta used dropout technology during training in our experiments.There are several avenues for future work, we focus on combining the output results of three independently trained models, dealing with more model output linear combination problems might promote ensemble diversity and improve performance even further.In addition, compared with the average combining the outputs of different models, performance can be further enhanced by optimizing the ensemble weights, as in stacking 42 or adaptive mixture of models. 43 The results on classification accuracy and mP@5 come from the model with the model in the Swin-transformer model family as the backbone.

Table 4
Performance of proposed model ensemble strategies on textual semantic similarity.
, their prediction results are very different in the model with Bert as the backbone (variance 0.057).The models under different random numbers learn semantics from different angles.However, the variance of the predicted results in the model with Roberta as the backbone is about one-twelfth of the variance of the model with Bert as the backbone.This situation causes the features learned by the model under each random number to tend to be TA B L E 3 The average score and variance of the models under different random numbers and backbones on seven text semantic similarity representations.Scores of text semantic similarity for models mixed with different amounts of Bert and Robert.