EMA‐GAN: A Generative Adversarial Network for Infrared and Visible Image Fusion with Multiscale Attention Network and Expectation Maximization Algorithm

The purpose of the infrared and visible image fusion is to generate a fused image with rich information. Although most fusion methods can achieve good performance, there are still shortcomings in extracting feature information from source images, which make it difficult to balance the thermal radiation region information and texture detail information in the fused image. To address the above issues, an expectation maximization (EM) learning framework based on adversarial generative networks (GAN) for infrared and visible image fusion is proposed. The EM algorithm (EMA) can obtain maximum likelihood estimation for problems with potential variables, which is helpful in solving the problem of lack of labels in infrared and visible image fusion. The axial‐corner attention mechanism is designed to capture long‐range semantic information and texture information of the visible image. The multifrequency attention mechanism digs the relationships between features at different scales to highlight target information of infrared images in the fused result. Meanwhile, two discriminators are used to balance two different features, and a new loss function is designed to maximize the likelihood estimate of the data with soft class label assignments, which is obtained from the expectation network. Extensive experiments demonstrate the superiority of EMA‐GAN over the state‐of‐the‐art.

poor generalization ability, but also greatly increase calculation cost, which limits the progress of image fusion technology.
The powerful feature extraction and representation ability of deep learning have attracted the attention of researchers in the field of image fusion in recent years.To overcome the shortcomings of the above methods, researchers have proposed a large number of image fusion methods based on deep learning.According to the structure of the model, these methods can be roughly divided into three categories. [17]Image fusion methods are based on auto-encoder (AE), convolutional neural network (CNN), and generative adversarial network (GAN).
Image fusion methods based on AE generally consist of three parts: encoder, decoder, and fusion rules.In general, a deep learning model is used in the encoder and decoder to extract features and reconstruct images, respectively.In the part of image fusion, manual fusion rules are adopted to fuse the features extracted from different source images.Li et al. [18] proposed a supervised autoencoder network to obtain more deep features by training encoders with dense blocks, so as to make the fusion of infrared and visible images more comprehensive.Based on DenseFuse [18] and Unetþþ, [19] Li et al. [20] proposed a fusion strategy with spatial attention and channel attention, which enabled the extracted multiscale features to retain more significant features in image reconstruction.However, for the fusion of the in-depth features, the manual fusion rules can not complete the task well, which limits the performance of the image fusion method based on AE.
Image fusion method based on convolutional neural network is the main trend of image fusion.In image fusion, the role of convolutional layers is enormous, as they can extract more feature information than conventional methods.The image fusion method based on CNN can learn various weights in the model through training, which avoids the complexity and huge resource consumption caused by manually designed fusion rules.Of course, challenges also come with it.First, a highperformance network requires a large and high-quality label dataset.Second, it is a great challenge to design models for different fusion tasks.Finally, with the deepening of the convolutional layer, the feature loss will be more serious, which will deteriorate the fused result.Liu et al. [21] first proposed a multifocus image fusion method based on CNN in 2017.This method used a CNN model to extract the features of the source image, and then obtained the optimal decision map through morphological processing.Finally, the decision map and the source image are weighted and summed to obtain the final multifocus fused image.Li et al. [22] proposed a fusion method of infrared and visible image based on CNN.The detailed part of the decomposed source image adopts visual geometry group (VGG) to extract multilayer features, and fused the detailed part and the basic part to obtain the final fused image.Although these methods all use convolutional networks to extract feature information, it can be seen that as the convolutional layer deepens, the loss of feature information increases.
Since the image fusion method based on CNN requires a large amount of ground truth in the training process, many fusion tasks lacking ground truth cannot effectively train the network.So unsupervised image fusion methods based on GAN have been proposed one after another.Ma et al. [23] proposed a fusion method of infrared and visible images based on GAN for the first time.Through the adversarial training between the generator and the discriminator, the fused image generated by the generator can retain the thermal radiation information of the infrared image and the texture information of the visible image at the same time.However, during the image fusion process, this method loses a lot of feature information in source images, making the fused result more like infrared images and lacking the detailed information of visible images.On the basis of FusionGAN, Ma et al. [24] designed a GAN with dual discriminators.In this model, the generator and two discriminators were trained adversarial at the same time.The two discriminators aim to make the fused image generated by the generator have the significant features of infrared images and the texture details of visible images respectively.This enables the model to better balance the feature information between infrared and visible images in the fused results.Although the above work has achieved excellent results, there are still some problems.In the field of infrared and visible image fusion, there is a lack of ground truth, so simple design model loss function and simple structure model are often not enough to support the research in this field.
To address the above issues, we propose a novel generative adversarial network based on axial-corner attention (ACA) and multifrequency attention (MFA), called EMA-GAN.Unlike the general GAN model, this is an EM learning framework based on GAN that maximizes the likelihood of fused results and estimates potential variables.A single-scale feature cannot fully represent the spatial information of the information-rich scene, so we use multiscale attention network to extract the deep feature information of the source image.Images with different modalities have different information.To highlight these differences, we design different attention mechanisms for different source images.We have designed an axial-corner attention block, which includes an axial-attention mechanism, corner-attention mechanism, and channel attention mechanism to highlight texture details and edge information in visible images.Different from visible images, infrared images reflect the significance of the target through pixel intensity.Therefore, we use a multifrequency attention block to extract the pixel information in infrared images.At the same time, we also design a loss function suitable for model training to constrain the model to generate higherquality fusion images.The contributions of this work are summarized as follows: 1) A GAN-based EM learning framework is proposed, which is first introduced into the field of infrared and visible image fusion based on deep learning, solving the problem of poor fusion quality caused by the lack of ground truth.Since the latent variables in infrared images are different from those in visible images, we correspondingly design different the expectation networks (E-Nets) to predict such latent variables.
2) For feature extraction of visible images, an ACA mechanism is proposed, which integrates axial attention, corner attention, and channel attention to capture the long-range semantic information of visible images and highlight the texture and edge of visible images.
3) An MFA mechanism is proposed, which strengthens the relationship between different feature layers, and assigns more weight to the salient regions, so that the salient information in the fused results can be retained.
4) Compared with other advanced methods, our proposed model has achieved satisfactory performance in both subjective and objective evaluation.
The remainder of this article is structured as follows.In Section 2, we briefly review related methods of image fusion methods and the application of the EM algorithm in image processing.In Section 3, we present our proposed EMA-GAN in detail.Section 4 presents the experimental results comparing our method with several state-of-the-art methods on TNO dataset.The conclusions of our work are given in Section 5.

Related Works
Since our work is based on an end-to-end model of GAN, this section briefly introduces the basic theory of GAN and the relevant applications of EM algorithm in image processing.

Generative Adversarial Networks
The generative adversarial network was first proposed by Goodfellow et al. [25] in 2014, which can establish a mutual game process between generator G and discriminator D, enabling the generator to obtain a probability distribution that is consistent with the target distribution under the influence of the discriminator.The details of the original GAN are as follows.During the adversarial process between generator G and discriminator D, random noise z is input into the generator to generate fake data GðzÞ in an attempt to fool the discriminator.The purpose of the discriminator is to distinguish the fake data GðzÞ generated by the generator from the real data dataðxÞ.The final goal is to make the fake data distribution P G ðzÞ obtained by the generator closer to the real data distribution P data ðxÞ.The adversarial process of the generator G and discriminator D is defined as follows min In this process, the generator and the discriminator promote learning each other, and continuously improve the fake ability of the generator and the discrimination ability of the discriminator.As the distribution of fake data GðzÞ and real data dataðxÞ becomes closer and closer, the discriminator is no longer able to distinguish between fake data and real data.At this point, the generator successfully estimated the distribution of the real data.
Our work is based on GAN for infrared and visible image fusion, so it is necessary to introduce FusionGAN. [23]This work is the first to apply GAN to infrared and visible image fusion tasks.FusionGAN is a fusion method for infrared and visible images proposed by Ma et al. in 2019.The purpose of this method is to make the generator generate a fused image that contains not only the salient pixel information of the infrared image but also more texture of the visible image.The loss function of its generator is as follows where, the first term on the right side of the equation represents the content loss, and the second term represents the adversarial loss.I f , I r , and I v denote the fused image, the infrared image, and the visible image, respectively.H and W represent the height and width of the source image, ∇ is the gradient operator, and k ⋅ k F represents the matrix Frobenius norm.N denotes the total number of fused images.c is the probability label, and the discriminator is expected to discriminate the fake data generated by the generator as real data.
Its loss function only contains the adversarial loss and the content loss, which causes the loss of a lot of useful information in the training process.After FusionGAN, there are many GAN-based methods for infrared and visible image fusion.For example, Fu et al. [26] designed dense connection blocks in the generator to improve the utilization of shallow features and preserve a large amount of texture information from visible images in the fused image.With the deepening of the research, a single discriminator has been unable to meet the requirements of image fusion.Since the GAN model of single discriminator can cause the imbalance of different modal information in the training process, Li et al. [27] proposed a GAN of double discriminator, and added the attention mechanism into the generator.This method not only maintains the information balance between infrared image and visible image better, but also promotes the network to pay more attention to the areas that need attention through the attention mechanism module.However, in the above image fusion methods, how to effectively integrate the information of different modes and balance the training between the generator and discriminator is still a big challenge.

Expectation Maximization Algorithm in Image Processing
Expectation maximization (EM) algorithm is an iterative algorithm that calculates the maximum likelihood solution in a model with latent variables.The EM algorithm is divided into the expectation step (E-Step) and the maximization step (M-Step).By calculating the expectation of a latent variable in E-Step, it is used to calculate the maximum likelihood estimate of that expectation in M-Step.The above operations of E-step and M-step are repeated until convergence.The EM algorithm is commonly used in conventional methods, such as the infrared and visible light image fusion method based on LatLRR and FPED proposed by Li et al. [28] In this method, EM algorithm is applied to the fusion of high-frequency details to capture small differences in the grayscale, so that the fused results retain more details.Although this method has excellent fusion performance, it still cannot avoid the adverse effects caused by manual fusion rules in conventional methods.
With the emergence of deep learning, many scholars have introduced EM algorithms into neural networks.For the wellknown Gaussian mixture model (GMM), EM algorithm is used to approximate the parameters in GMM.In E-Step, each data X i is assigned the probability from the Gaussian model Y j based on the current parameter, and the distribution of each data X i is estimated by maximum likelihood estimation (MLE) in M-step.Greff et al. [29] introduced EM algorithm into deep learning.They proposed a differentiable clustering process combining neural network and EM algorithm, called neural expectation maximization (N-EM).This procedure implements unsupervised segmentation.
However, in the field of visual image research, a large number of studies have shown that natural images are non-Gaussian in pixel intensity. [30]Therefore, GMM cannot be directly used in natural image processing.On this basis, Zhao et al. [31] optimized the EM algorithm.They proposed to apply GAN to M-step and perform maximum likelihood estimation by predicting the distribution of real samples in E-step.This method achieves similar results with other models in clustering, semisupervised classification, and dimensionality reduction.However, this method uses the original GAN, and the structure of the generator and discriminator is too simple to be suitable for the image fusion task.
Inspired by the study of Zhao et al., [31] we introduce the EM learning framework based on GAN into the field of infrared and visible image fusion.Different from the study of Zhao et al., [31] because infrared images and visible images have different mode information, we designed different multiscale attention networks for infrared images and visible images in the generator to extract important feature information.At the same time, we use two discriminators to balance the information of visible images and infrared images, and design the corresponding E-Net according to different discriminators.The detailed model structure is described in Section 3.

Proposed Method
This section introduces our proposed EMA-GAN in detail.First, we describe the overall architecture of EMA-GAN.Second, the proposed network structure is introduced.Finally, the design details of the loss function are given.

Overview
Considering that EM algorithm is to find the maximum likelihood solution for the model with latent variables, and the generative adversarial network is easy to perform MLE.Based on the above two points, we propose an EM learning framework based on GAN for infrared and visible image fusion.The framework is divided into two parts: M-Step and E-Step, as shown in Figure 2. The M-Step goal is to update the parameters in GAN based on the soft class assignment w provided by E-Step.We designed two multiscale attention mechanisms in the generator.The purpose of the multiscale attention mechanism is to obtain more comprehensive spatial information and help the generator to focus on the foreground target information of the infrared image and the background details of the visible image.In E-Step, since our model is unsupervised, the class label of the real data is regarded as a latent variable, which is estimated by E-Net and used to guide our generator.We use two discriminators to avoid the imbalance of different modal information.According to the dual discriminators, two E-Nets are designed, which have the same structure but do not share data.Its specific structure is shown in Table 1.
In M-step, the GAN consists of a generator and two discriminators.It can be observed from Figure 1  The generator for EMA-GAN is shown in Figure 3.In the generator, there are visible multiscale attention network, infrared multiscale attention network and an image fusion network.For visible images, capturing more texture and edge information is beneficial to retain more visible image feature information in the fused image.We design an axial-corner attention module (ACA) to extract important feature information in visible images.
In the feature extraction part of the visible image, two 3 Â 3 convolutional layers are used to extract the feature F vis of the visible image I vis .Since single-scale features are not sufficient to extract the desired spatial information, we employ a multiscale mechanism.The multiscale feature F k vis ðk ¼ 1, 2, 3Þ ∈ ℝ CÂHÂW is obtained by global average pooling, and then the enhanced feature F e vis ∈ ℝ CÂHÂW with texture and edge information is obtained by ACA.Finally, Att vis ∈ ℝ CÂHÂW is obtained via connecting the in-depth features of each scale in channel wise where H up represents the up-sample operation and ACAð⋅Þ represents the axial-corner attention operation.Different from visible images, we introduce multifrequency attention mechanism (MFA) into infrared multiscale attention network to highlight the significant information of infrared images by mining the relationship between different scale features.As in the visible image extraction branch, multiscale features are obtained using convolution and pooling.Then six multifrequency attention blocks (MFAB) are used to obtain the enhanced feature F e ir ∈ ℝ CÂHÂW .The up-sample operation is used to make the F e ir size consistent with the original input size.Finally, to avoid the loss of useful information, Att ir ∈ ℝ CÂHÂW is obtained via connecting the in-depth features of each scale in channel wise, and its formula is as follows where MFAð⋅Þ represents the multifrequency attention operation.
The Att ir and Att vis obtained above are put into the fusion network together with source images, so that the fusion network can generate fused images that not only retain the significant information of the infrared image but also captures the texture details of the visible image.

Axial-Corner Attention Module
The ACA module integrates axial attention, corner attention, and channel attention, and the specific structure is shown in   vis is obtained after 1 Â 1 convolution and input into axial attention, corner attention, and channel attention modules, respectively.

Axial-Attention Module
The self-attention mechanism has been widely used in the field of computer vision because of its excellent ability to extract longrange semantic information.Given the input x o ∈ ℝ C in ÂHÂW , z o ∈ ℝ C out is the output of self-attention at position (i,j), which can be expressed as where p indicates all possible locations.
, and w v ∈ ℝ C out ÂC in are learnable matrices.However, the global position information in q, k, and v is not taken into account in the above equation.To solve this problem, wang et al. [32] introduced local constraint and positional embedding into the self-attention mechanism.The optimized z is represented by the following formula where N mÂm is a square area with position ði, jÞ as the center and m as the side length.r q pÀo , r k pÀo ∈ ℝ C q , and r v pÀo ∈ ℝ C out are positional embedding, it lets z o keep the original location information.The complexity of this formula is OðHWm 2 Þ.To reduce the computational overhead, we introduce the axial attention mechanism, which is represented by the following formula This reduces the complexity from OðHWm 2 Þ to Oð2HWmÞ by calculating the axial attention on the H and W axes, respectively. [17]The axial attention in the widthaxis (high-axis) is shown in Figure 4b.F 1 vis is processed by axial attention mechanism to obtain F AXI ∈ ℝ CÂHÂW .Algorithm 1. EMA-GAN.

Corner-Attention Module
Due to the gradient in different directions in different regions of the image, there are smooth regions, edge regions, and corner regions.Corner regions contain key information to control edges and textures. [33]Inspired by the above ideas, we introduce corner attention module into multiscale visible attention network.The feature F 1 vis is processed by the corner-attention module to obtain the corner attention map where F 2 vis ∈ ℝ CÂHÂW is obtained by F 1 vis through a 1 Â 1 convolutional layer, and F 1 vis is processed by the Harris algorithm to get feature F m vis ∈ ℝ CÂHÂW .β is initialized to 0 and becomes larger as the network is trained.

Channel-Attention Module
The feature mapping of each channel in visible images is usually different. [34]By associating the feature information of each channel and assigning greater weights to channels with larger feature mapping values, a feature map with stronger representation ability is obtained.Based on the above ideas, we use channel attention to obtain the feature associations between different channels, as detailed in Figure 4a.First, F 1 vis reshaped into F 1 ∈ ℝ CÂðHÂWÞ and F 2 ∈ ℝ ðHÂWÞÂC .Then, the matrix multiplication is applied to F 1 and F 2 , and a softmax function is used to reflect the influence of channel iði ¼ 1, 2, : : : , CÞ on channel jðj ¼ 1, 2, : : : , CÞ finally, F CHA ∈ ℝ CÂHÂW can be expressed as where Φ ∈ ℝ CÂC is a matrix composed of ϕ ði,jÞ .The enhanced feature F e vis is obtained by element-wise addition of F AX I , F COA , and F CHA over the channel.

Multifrequency Attention Module
The research shows that with the deepening of network layers, the frequency of image features also changes, which also reduces the feature information of source images. [35]By combining the frequency features of different scales, the feature information of the image is enhanced.In MFA module, we use the Hadamard product to fuse the features of different network layers, and then use softmax function to get the weight map.The weight map is weighted with the feature map of the infrared image, so that areas with higher pixel intensity occupy a larger proportion.The detailed structure of MFA is shown in Figure 5.It is composed of six MFAB.Multi-scale feature F k ir is fed into three convolutional layers to get F i ir ði ¼ 1, 2, 3Þ ∈ ℝ CÂHÂW .Softmax function is used to process the Hadamard product of F 1 ir and F 2 ir to get the weight map.Finally, the Hadamard product of weight map and F 3 ir is used to get the enhanced feature map where ρ is the weight parameter that can be learned autonomously with training.

Discriminator and E-Net
Our method includes two discriminators (Di and D v ), as shown in Figure 2. The infrared discriminator D i takes the fused image generated by generator and infrared source image I i as input.
The first six layers of the D i are convolutional layers with a convolution kernel size of 3 Â 3, and the seventh and eighth layers are linear layers.Batch normalization is used after the first to seventh convolutional layers, and LeakyReLU is used as the activation function of the first six convolutional layers.Finally, the sigmoid function is used as the activation function to obtain the final output.The structure of E-Net is the same as discriminator, but they do not share data, the specific structure is shown in Table 1.

Loss Function of Generator
In the technical methods based on deep learning, updating the model parameters by backpropagation is a key part of the training model, and the loss function is an important condition for parameter updating.Therefore, it is necessary to design the corresponding loss function to train our model.The loss function of the generator consists of three parts: adversarial loss L adv , intensity loss L intensity , and structure loss L ssim .The formula is designed as where parameter λ and φ are the balance factors that control between different losses.

Adversarial Loss
Adversarial loss is composed of two parts.One part is the adversarial loss between the G and D i , which can distinguish the fused image from the infrared image.The other part is the adversarial loss between the G and the D v , which is used to distinguish the fused image from the visible image.L adv is defined as follows where I f represents the fused image and pI f represents the data distribution of the fused image.The thermal radiation information and detailed texture information are reflected by the pixel intensity and gradient change.Therefore, by constraining the fused image and the source image, the pixel intensity in the infrared image can be well preserved and the loss of texture information in the visible image can be prevented.

Intensity Loss
Intensity loss pushes the generator to generate images that have a similar data distribution to the infrared image.L intensity is designed as follows where H and W represent the height and width of the image, respectively.I f and I ir denote the fused and infrared images, respectively.

Structure Loss
Structure loss is used to compensate for structural information neglected in intensity loss.L ssiml is defined as follows where I vis denotes visible images.SSIMð⋅Þ represents the structural similarity loss, it is defined by the following formula where μ represents the mean value and σ represents the standard deviation.A larger SSIM value means a higher structural similarity between two images.In Equation ( 15), we expect a larger SSIM value, so 1 À SSIM is adopted as our final structural similarity loss. [26]6.Loss Functions of Discriminator and E-Net As mentioned before, to ensure that unsupervised learning can better learn sample features, we use double discriminators to ensure that the intensity information of infrared images and the texture information of visible images can be learned at the same time.The loss function of L Dir and L Dvis are defined as follows where EðIÞ and parameter a denote the labels of infrared (visible) image I ir ðI vis Þ and fused image I f , and N denotes the number of input images.The discriminator is expected to accurately identify the source image as real data and the fused image as fake data.In an unsupervised manner, the class labels of the real data cannot be observed and are treated as latent variables.Thus, the goal of the E-Step is to estimate such latent variables.Specifically, as shown in Figure 2, we take the output of the generator, namely I f , as the input to the E-Net.The loss function of E-Net is as follows where CEð⋅Þ represents the cross-entropy function, h is the onehot vector encoding of the source image.

Experimental Section
In this section, we evaluate our proposed method on publicly datasets and show the experimental results.First, the implementation details are given.Second, we introduce the comparative methods and evaluation metrics.Then we compare our proposed method with the state-of-the-art methods in the subjective and objective.Finally, we provide the ablation experiment, which proves that the proposed module is useful.

Implementation Details
We chose the RoadScene [36] and TNO [37] datasets as the training and testing sets for our model.The RoadScene dataset has 221 precisely aligned image pairs.The main scenes of include streets, houses, vehicles, pedestrians, and traffic signs, among others.That is not enough to train a good model.Therefore, we cropped 211 of them to expand our training set.We set the step size to 20 and the subimage size to 112 Â 112, so we expanded the dataset to 53 069 image pairs.For the test, we used seven image pairs on the TNO.The whole training process is summarized in Algorithm 1.All the training and testing of EMA-GAN were carried out on GPU NVIDIA TITAN RTX and CPU Intel i7-10700 k.
In the quantitative analysis, we used six representative and commonly used evaluation metrics to evaluate the fused images of each method.Including feature mutual information ðFMIÞ, [47] based on the noise of the image metric ðN abf Þ, [48] a lower value indicates less noise in the fused image; structural similarity ðSSIMÞ [49] reflects the ability to retain the structural information of the source images, and the larger the value of SSIM, the more similar the two images are; tone mapping image quality index ðTMQIÞ [50] evaluates the pixel intensity and structure information retained in the fused image; The Chen-Varshney metric Q cv [51] calculates the edge information and information similarity of the image; Peak signal-to-noise ratio ðPSNRÞ reflects the distortion of image fusion by calculating the ratio of peak power to noise power in the image.

Ablation Study
Our proposed fusion model includes two different feature extraction modules.To verify the effectiveness of each module, we performed the following ablation experiments, as shown in Table 2.
We compared EMA-GAN with the degraded version of the generator without ACA module and MFA module.The generator with only multiscale networks is called baseline1 and the model with ACA module is called baseline2.
A subjective evaluation of the ablation study regarding the feature extraction module in our model is shown in Figure 6.It can be observed from the figure that under the influence of ACA module, the texture information and edge information in the fused results of Figure 6d are more prominent.Compared with Figure 6c, the tree texture in the first row of images is clearer and contains more detailed information.The toilets and traffic signs in the second row of images have rich edge information.The fused results of Figure 6e and baseline2 do not show much difference visually.However, it can be seen from Table 3 that most of the metrics of Figure 6 are higher than those of baseline2.This indicates that our model is optimized after adding the MFA module.
The results of the objective evaluation are presented in Table 3. From Table 3, it can be observed that the metrics FMI, N abf , SSIM, TMQI, and PSNR are improved, which metrics that our proposed module can extract more information.

Qualitative Evaluation
We selected seven groups of infrared images and visible images in the TNO dataset for visual contrast display.From the visual effect, our proposed method has a relatively excellent effect.It can be observed from Figure 7 that the GFCF and HMSD method have high brightness, but without considering the long-range semantic information of source image, which makes the brightness of the houses in the fused image vary greatly and lack of details.The fused images of conventional methods such as FPDE, MGIVF, and TSSD are relatively fuzzy and lack of details.The fused results of IFCNN and NestFuse methods have low contrast and lose significant information of source infrared images.Compared with our proposed method, it makes the infrared target more prominent and complete because we use the multifrequency attention mechanism.As can be seen from the fence in Figure 7, the fence in our method is more prominent and contains more detailed information.The fused results of U2Fusion method and RFN method are good.The outline details of trees are well preserved, but the texture information of houses and ground is insufficient.In fused images, it is important to balance the feature information of the source images.FPDE and NestFuse method, for example, although their fused results of infrared information more outstanding, but the visible texture details of inadequate.The imbalance of feature information leads to disharmony in the visual effect of fused images.However, in our method, the fused image not only has significant infrared targets, but also the detailed texture information of the visible image is not lost.The information of the two features is relatively balanced, which makes our fused result not abrupt and more consistent with human visual perception.

Quantitative Evaluation
We take the six metrics FMI, N abf , SSIM, TMQI, Qcv, and PSNR to evaluate each method from different aspects.
Figure 8 shows the line chart distribution comparison between the the state-of-the-art image fusion method and our proposed method on seven pairs of images from the TNO dataset.The number after each method is the mean of the metric, with the red value indicating the best value and the blue value indicating the second-best value.From the statistical results, we can see that our proposed method achieves the best average value in the four metrics of FMI, N abf , TMQI, and PSNR.For the metric Qcv, EMA-GAN only lags behind GFCF and achieves the secondbest value.On the metric SSIM, our method also outperforms U2Fusion, RFN, and GFCF.These metrics show that EMA-GAN is able to preserve the features of the source images to a greater extent.And the artifacts and distortion of the fused result are less in the fusion process.This makes the fused result of our method show higher visual information fidelity.

Qualitative Evaluation
We selected nine groups of infrared images and visible images in the MSRS dataset for visual contrast display.As shown in Figure 9, our proposed method is superior to FusionGAN and GANMcC in terms of vision, and these two methods based on GAN.It can be observed in Figure 9 that the fused results of FusionGAN and GANMcC are more biased toward infrared images.A lot of infrared feature information is retained in Table 3. Quantitative evaluation of ablation study (average of seven images, " means that the larger value is better, # means that the small value is better, the data in bold indicates the best value).the fused result, but the texture information of the visible image is missing.This is also a difficult problem to solve for GAN-based methods.The training of GANs is difficult, so the generator and discriminator do not achieve a good balance during training.However, compared with our method, we not only have sufficient infrared feature information but also well preserve the texture information of visible images.In the third column plot of Figure 9, it can be observed that the details of bicycles in the fused results of EMA-GAN are significantly more than those of FusionGAN and GANMcC.The fused results of GFCF and HMSD methods have higher overall brightness, which makes the salient information in the infrared image not obvious.Most of the above methods only perform well in one aspect.For example, in the fused result of NestFuse, the infrared feature information is well retained, but part of the visible feature information is still missing.The texture information of GFCF is well preserved, but the infrared feature information is not obvious.In summary, the fused results of EMA-GAN balance the information of these two different modalities well.

Quantitative Evaluation
To demonstrate the excellent performance of EMA-GAN, Figure 10 shows the objective analysis results of nine infrared and visible image pairs on the MSRS dataset.Figure 10 shows that EMA-GAN achieves the best average value on FMI and N abf metric.This means that our model transfers more features from the source image to the fused image and introduces fewer artifacts in the process.Meanwhile, our method is second only to GANMcC and NestFuse in terms of TMQI and PSNR metrics, respectively.In addition, in terms of SSIM and Qcv indicators, EMA-GAN is also higher than most comparison methods.We selected seven groups of infrared images and visible images in the M3FD dataset for visual contrast display.As shown in Figure 11, we can see that, the fused results of our method are more suitable for human visual perception compared with other methods.This is not only attributed to our designed infrared feature extraction network and visible feature extraction network, but also benefits from our training method based on the EM learning framework.This can help the network to balance the information of these two different modes well after extracting the infrared and visible features.In the first column of Figure 11, our results not only show the house details and people information behind the smoke, but also enhance the texture information of the surrounding trees and grass.In the third column plot of Figure 11, the texture information of the fused results of EMA-GAN is significantly stronger than that of FusionGAN, such as the trees.It can be seen that our fused results are superior to most methods in terms of visual effect.

Quantitative Evaluation
We selected seven pairs of infrared and visible images from the M3FD dataset and analyzed them quantitatively, and the results are shown in Figure 12.As shown in Figure 12, our method achieves the largest average value on N abf and TMQI metrics.
In FMI and PSNR, our method also shows good performance,  second only to HMSD and IFCNN, respectively.For quantitative analysis, our method preserves the texture information of the visible image and the salient features of the infrared image, and achieves satisfactory results.

Conclusion
In this work, we propose an unsupervised infrared and visible image fusion model based on an adversarial generative network with EM algorithm and multiscale attention mechanism.
Considering the lack of labels in the infrared and visible image fusion task, an EM learning framework based on GAN is proposed in this work.GAN can better train the generator in the case of soft labels provided by E-Net, so that the generator can generate higher-quality fused images.In terms of feature extraction, the axial-corner attention mechanism and multifrequency attention mechanism are introduced, and the detailed texture in the visible image and the intensity information in the infrared image are fully extracted from the space, channel and frequency domains.And the whole network is end-to-end, which greatly reduces the complexity of the model.The use of dual discriminators makes the fused results more consistent with human visual standards.In the experiments, our method is compared with nine state-of-the-art methods.Our proposed EMA-GAN achieves excellent performance in both subjective and objective evaluation.Through ablation experiments, it is proved that our proposed ACA module and MFA module have enhanced feature extraction effects on baseline1, and the model achieves the best performance when combining ACA and MFA modules.

Figure 1 .
Figure 1.A demonstration of infrared and visible image fusion.a) Infrared image.b) Visible image.c) Fused image generated by EMA-GAN.
that the infrared image contains more contrast information and partial structure information, while the visible image has richer texture information.Therefore, in the generator, two multiscale attention networks (infrared multiscale attention network and visible multiscale attention network) and an image fusion network are designed.Two multiscale attention networks are used to generate the attention maps of infrared and visible images respectively, which can help the fusion network pay more attention to the foreground target information in infrared images and the background details and edge information in visible images.During the training process, two discriminators in EMA-GAN, called Dr And Dv, distinguished the fused results from the infrared and visible images, respectively.This enables the final fused result to retain both infrared and visible image information.During the training procedure for E-Step, we take the output of the generator as the input to E-Net and the corresponding soft class assignment as the output.The entire training process is iterated through M-step and E-step until the generator and discriminator converge simultaneously.Finally, the final fused image is obtained by the trained generator.The training process of EMA-GAN is shown in Algorithm 1.

Figure 2 .
Figure 2. The framework of the EMA-GAN.

Figure 4a .
Figure 4a.The multiscale feature F k vis is used as the input of ACA, and F 1 vis is obtained after 1 Â 1 convolution and input into axial attention, corner attention, and channel attention modules, respectively.

Figure 3 .
Figure 3. Architecture of the generator.ACA and MFA represent axial-corner attention module and multifrequency attention module respectively.CNN: convolutional layer, ": up-sample operation, ©: concatenation operation.

Figure 4 .
Figure 4. a) Architecture of the ACA.b) The axial attention applied the high axis (width axis).⊕ means element-wise addition, ⊙ denotes dot product, ⊗ denotes matrix multiplication, and Ⓢ means softmax function.

Figure 8 .
Figure 8. Quantitative comparison of EMA-GAN with 11 state-of-the-art methods on TNO dataset.

Figure 10 .
Figure 10.Quantitative comparison of EMA-GAN with 11 state-of-the-art methods on MSRS dataset.

Figure 12 .
Figure 12.Quantitative comparison of EMA-GAN with 11 state-of-the-art methods on M3FD dataset.

Table 1 .
The architecture of dual discriminator, E-Net, and fusion network.I, O, K, S, and P denote input channel, output channel, kernel size, stride size, and padding size, respectively.

Table 2 .
Fusion model structure for ablation study.