Cross‐domain bearing fault diagnosis with refined composite multiscale fuzzy entropy and the self organizing fuzzy classifier

In this article, the use of refined composite multiscale fuzzy entropy (RCMFE) for cross‐domain diagnosis of bearings is introduced and verified with two publicly available datasets of varying operating conditions, a factor that challenges the diagnostic ability of trained models. For classification, the self organizing fuzzy (SOF) classifier is used. The diagnostic framework which primarily only involves extracting RCMFE feature and training the SOF classifier, is able to detect and isolate faults with over 97% accuracy when the classes are comprised of a single fault type and size. Compared to related works, the proposed approach does not require deep learning for feature extraction nor any domain adaptation technique as the RCMFE feature is robust against changing operating conditions. Furthermore, the method does not need target domain data during training. With regard to fault isolation, when the classes in the training data contain all the available fault sizes instead of a single size, the classifier can distinguish inner race faults from outer race and ball fault with an average accuracy of 96%. However, the accuracy for differentiating ball and outer race faults falls slightly to an average of 86%. Thus even for the latter arrangement which poses a tougher transfer learning problem, the proposed approach still performs very well.

to the use of sensor data in condition monitoring has been the successful development of different types of diagnostic models using a variety of features extracted from condition monitoring data. 1,[3][4][5] Unfortunately, most data-driven classification algorithms are based on the assumption that the training and testing sets come from the same domain where domain here connotes data and its associated distribution. This assumption has been invalidated in several applications such as computer vision and natural language processing which have a long-standing use of data for classification and object recognition. 6 Specifically in bearing diagnostics, the change in distribution may be brought about by the method of fault inducement, for example, artificial fault versus real fault, change in operating conditions, or change on fault size and severity. Changes in operating conditions are easily understood from a practical perspective of the operations in a machine shop. As to the distribution change brought about by training on artificial data, models are usually trained on artificial data as sufficient real-damage data (ie, damage attained during actual operation of machinery) is rare. One reason for this scarcity is safety guidelines in applications such energy and electric transport components are replaced after a determined number of operation cycles and not due to failure. 7 Artificial fault data is then left as the best recourse when developing fault detection algorithms. The artificial faults are seeded by drills, engravers, and electro-discharge machines. Unfortunately, as noted by Lessmeier et al, 8 actual bearing damage is complex because development of fatigue damage or damage caused by solid particles is randomly influenced. Thus, true fault evolution remains largely unknown and production of artificial faults even at different severity may not wholly reflect real fault damage. It is then feasible that the marginal distribution of artificial-fault data may be different from real-fault data. Indeed, the same authors 8 show that a classifier trained with artificial fault data does not generalize well to real fault. With regard to fault size or severity, it is highly probable that models trained with data of a particular size of fault may not recognize data with differently sized faults while the latter is more likely to be the case for incoming test samples in a practical setting.
The problem of transferring the knowledge gained from models trained in one domain to similar tasks in another domain is known as the domain shift problem. 9 The branch of transfer learning that deals with transfer of knowledge from the source (training) domain to the target (testing) domain is known as domain adaptation (DA). 10 DA approaches can be roughly categorized into either feature-based methods or sample-based methods. 11 The former attempts to reduce distribution divergence by learning a new joint feature representation that is typically hidden and may be of reduced dimensions (subspace). 6,12 The latter methods use techniques that utilize source samples that are closer to the target distribution. This may involve re-weighting the source samples to give more importance to those that bridge the distribution gap (landmarks) or projecting the rest of the source samples to the landmarks and then learning a classifier on that subspace. 13,14 DA algorithms usually require complex objective functions combined from several loss functions and consequently, computationally demanding iteration is required to search for the best network architectures, for example, the number of layers and units in neural networks, the ideal weights, and the best hyper-parameters.
Regardless, several published works have reported excellent results in cross-domain fault diagnosis using DA techniques. 9,[15][16][17][18][19] On the other hand, entropy-based features that are nonlinear parameter estimation methods have been successfully employed in bearing fault diagnosis. This is because condition monitoring data such as vibration signals present nonlinear characteristics due to inherent nonlinearities such as friction, tolerance, and stiffness. 20,21 As applied in bearing fault diagnosis, entropy based features indicate the degree of randomness or complexity of a signal whereby the complexity noted here is in terms of meaningful structural richness. 22 A faulty bearing's signal is punctuated by periodic patterns in the signal length every time the fault is contacted. Thus, there is more self similarity in a faulty signal (ie, less structural complexity) than in a normal bearing's signal which means that theoretically, the entropy of the latter signal should be higher. However, despite evidence that entropy based features especially in a multiscale approach are unaffected by operating conditions, 23 there has not been much attempt to utilize entropy features for cross-domain diagnosis. Indeed many of the published works utilizing entropy based features for diagnosis assume the same distribution for the training and testing samples. 20,21,[23][24][25][26][27] In this work, we extend the use of entropy features for cross domain diagnosis of bearing faults. We choose the refined composite multiscale fuzzy entropy (RCMFE) as the non linear feature to use due to its advantages over single scale fuzzy entropy as well as sample entropy. 22,28 For classification, we use the recently proposed self organizing fuzzy (SOF) classifier whose antecedent rules check for the similarity of a test sample to class prototypes instead of using membership functions. 29 The prototypes of a class represent its distribution's multimodal nature and are learnt in a self evolving manner that is able to handle drifts and shifts in the data pattern. 30 The effectiveness of RCMFE for cross domain fault diagnosis is proven using two publicly available datasets by first detecting fault then isolating the type of fault. We show that this diagnostic method is a viable tool for cross domain classification as it only requires extraction of RCMFE features and training the SOF classifier where only the distance metric for determining similarity requires selection. Further, it holds that for any particular machine, this diagnostic model which uses training data from any one single operating condition, will generalize well to any other operating condition.
The remainder of this article is organized as follows. In Section 2, a theoretical review of RCMFE and the SOF classifier is given. In Section 3, the behavior of RCMFE for the two datasets is explored while in Section 4, the methodology to be followed is outlined. In Section5, the results are presented and discussed while in Section 6, concluding remarks and future direction is given.

Refined composite multiscale fuzzy entropy
Given a one-dimensional time series x with N data points x = {x 1 , x 2 , x 3 , … , x N }, template vectors u m i whose lengths are equal to the embedding dimension m are created by where u 0 i is the average of the m elements in vector u m i . The distance between any two vectors u m i and u m j is computed using Chebyshev's distance.
The degree of similarity D m ij of u m i and u m j is then computed using a fuzzy function (d m ij , n, r).
where (d m ij , n, r) is the exponential function with parameters n and r determining its shape/boundary.
A function m is then defined as The entire procedure is repeated for template vectors of length s = m + 1 such that s is similarly given by The fuzzy entropy of the series is finally obtained as the negative logarithm of the ratio of s to m .
FuzzyEn(x, m, n, r) = − ln The selection criteria of m, n, and r are given in References 23 and 28 and in this work m = 2, n = 2, and r = 0.15 * standard deviation of the time series in accordance with the prevalent guidelines.
Machines are composed of many interacting components such as shafts, gears and bearings. Consequently, condition monitoring data are bound to contain several oscillatory modes due to the interaction and coupling between the components suggesting that analysis at a single scale is insufficient for exploring the data. 26 Thus, in machine fault detection, multiscale entropy features are usually computed. 20,21,[23][24][25][26][27] As first introduced by Costa et al, 31 a coarse grained time series y is obtained for any scale from the original time series by calculating the arithmetic mean of neighboring values without overlapping.
This process is depicted in Figure 1. The fuzzy entropy at a scale is FuzzyEn(y , m, n, r). There are two main drawbacks of the aforementioned method of coarse graining. 22,32 The first is that entropy thus computed lacks of symmetry in its dependence on the original series. For instance, although the measure should behave similarly for x3 and x4 compared to x1 and x2 (see Figure 1), for scale 3, x1, x2, and x3 are separated from x4, x5, and x6 thus breaking the symmetry. Second, because for an N point time series the length of the coarse grained time series at scale factor is equal to N∕ , this length may not be sufficient for accurate calculation of fuzzy entropy when is large. In addition, there may be no matching templates meaning Equation (7) is undefined. Inaccurate and undefined fuzzy entropy compromises reliability of the method. Improved methods such as the composite and refined composite coarse graining have subsequently been put forward.
Refined composite coarse graining for calculation of multiscale sample entropy was introduced in Reference 32 and extended for fuzzy entropy in Reference 22. The composite coarse graining procedure is depicted in Figure 2 where it can be seen that coarse graining at any scale results in more than one new time series. For instance at scale 2, two coarse grained series are obtained while at scale 3, three coarse grained series are created.
To compute the fuzzy entropy at a scale , the functions m and s are computed for all the coarse grained time series in that scale, that is, at scale = 2; m ,1 , m ,2 , s ,1 , and s ,2 are computed and their averages m and s also calculated. The fuzzy entropy computed at scale is now referred to as the RCMFE Equation (9) is always defined unless all m ,j and all s ,j are zeros. The length of the original data series N is chosen as 4000 which is sufficient for RCMFE calculation. 23

Self organizing fuzzy classifier
The SOF classifier based on the AnYa antecedent was used for cross-domain classification. 29,33 This antecedent differs from the traditional Zadeh-Mamdani and Tagaki-Sugeno ones in that it is not constructed from membership functions. Instead, a sample is checked for its similarity with the prototypes p c i that represent the multimodal nature of the training data in each class c.
For each class c with a total of N c prototypes, N c fuzzy rules corresponding to each are formulated as ∽ is the similarity operator and is analogous to the degree of membership in alternative antecedents.
To classify a sample x as belonging to one of C classes, the firing strength of the cth class (c = 1, 2, 3, … , C) is computed as where d(x, p) is the distance or similarity between x and p and may be any one of the common distance functions such as Euclidean, Cosine, Mahalanobis, and so on. The final label is assigned using a "winner takes all" strategy according to The following section outlines how the training data are divided into the most representative local modes, each characterized by a prototype.

Empirical data characterization
In their work, 30 Angelov and Gu propose a data-driven method of characterizing empirical data without imposing any restrictive assumptions, for example, presuming the data are Gaussian. Since distribution of real data is often complex and multimodal, the authors define several measures to uncover these local ensembles purely from data.
1. Cumulative proximity as the sum of square distances of a sample from all other samples. This metric provides information about the centrality of a sample.
is a function computing the distance between x i and x j . The shorter this distance metric is, the more similar two samples are. x can be of any dimension, that is, a sample can have any number of features. K is the number of samples in the dataset. 2. Eccentricity as the normalized cumulative proximity which is useful in capturing the properties of samples at the tail ends of a distribution, that is, samples far away from the peak.
The 1 K in the denominator is used to standardize eccentricity to prevent tendency to zero as K grows large. 3. Unimodal data density as a measure describing mutual proximity of samples and is thus the inverse of eccentricity.
From the denominator, it is seen that this measure is inversely proportional to the sum of distances between a sample and all others. Obviously, the closer a sample is to the global mean, the higher its data density, that is, more samples surround it. 4. Multimodal density. In the data, a sample may be observed repeatedly and in this case the set of unique samples is denoted by {U} T = u 1 , u 2 , u 3 , … , u T where each distinct value is only recorded once and T is the number of unique samples. The corresponding frequencies of the unique samples in the data are denoted by f i . Multimodal density is then defined for each unique sample u i in {U} T and is the product of the unimodal density of the unique sample and its frequency f i .
The described data-centric measures are then used to find the local modes in a given dataset. In the case of classification, the modes are learned individually for each class c. Let the K c training samples of the cth class (c = 1, 2, 3, … , C) be denoted by T , remove the sample with the shortest distance to r 1 and let it be the second element r 2 in the ranking matrix r. Similarly, r 3 is the element with the shortest distance to r 2 . This process is repeated until {U} c T is empty. 4. Initial prototypes {p} c 0 are found from the local maxima of multimodal densities ranked in r according to Condition 1.
5. Once the initial prototypes are found, they are used to attract the data samples surrounding them forming data clouds which represent the local modes present in the data. The clouds are analogous to clusters except for the fact that they are nonparametric and do not conform to any particular shape. A point is assigned to the closest prototype by (x i , p)); 6. The initial data clouds must then be filtered to retain only the larger and more representative clouds in order to improve the generalization ability of the model. This corresponds to increasing the level of granularity. The higher the granularity, the more information about the local modes the model is able to capture. The very first prototypes computed are at the zeroth level of granularity hence the subscript 0 in {p} c 0 . The filtering process proceeds as follows. a. Calculate the centers of all the data clouds z i ∈ {z} c 0 . b. Calculate the multimodal density at the centers z i as the product of the unimodal density weighted by the support (number of elements) of the particular cloud, S i .
c. Calculate the average radius of local influence of each protoype at granularity level L as where d c is the average distance between any two data samples in {x} c and Q c, l are the number of pairs of data samples in {x} c between which the distance is smaller than G c, l − 1 . d. For each data cloud center, for example, the ith one, find elements of the set composed of the centers of its neighboring clouds {z} N+ i through Condition 2.
e. The prototypes at the Lth granular level for class c are finally chosen as per Condition 3.
In this work, the granularization level is set at 12.
With the prototypes for each of the C classes, the antecedents in Equation (10 ) can be created.

EXPERIMENTAL DATA ANALYSIS
In this section, RCMFE behavior of normal and faulty data from two publicly available datasets is explored.

Experiment 1
The first dataset used is courtesy of the Case Western Reserve University (CWRU) bearing data center. 34 It consists of vibration data from normal/healthy bearings and as well as from faulty bearings with artificially seeded single point faults. Although faulty bearing data are sampled at both 12 and 48 kHz, only the former set from the drive end was used as the baseline healthy data were only sampled at 12 kHz. 27 Of particular interest in this work is that the CWRU dataset was recorded for four operating conditions related to motor load and speed thus constituting four different data domains. Further, the faults are seeded at four different diameters which also introduce a shift in data distribution. There are three types of fault in the data for all the operating conditions, that is, OR, IR, and ball, all of which have data corresponding to the four fault sizes with the exception of OR data which lacks the 0.711 m dia. The organization of data is listed in measurements at the 6 o'clock position were used since there was no mechanism to convert the dynamometer torque into radial load borne by the bearings, and thus the only effective radial load was gravity. 35 Figure 3 shows some the raw vibration signals of normal and faulty bearings from the CWRU dataset. Although the waveforms show some general differences, the raw data are not sufficiently distinguishable.
The RCMFE of healthy and faulty bearings is shown in Figure 4. Especially when viewed at higher scales, the RCMFE of faulty bearings is lower than that of healthy ones which fluctuate around a constant value. This observation is expected as a fault produces periodic impact in the vibration signal every time it is encountered thus increasing the regularity of the data or inversely, reducing its structural complexity and hence the diminishing RCMFE.
On the other hand, a healthy bearing's RCMFE fluctuates around a constant value across the higher scales indicating both a randomness of the data but also a richness in terms of the information contained about the machine. RCMFE behaves consistently across operating conditions indicating its robustness as a feature where conditions are changing.
RCMFE values are more volatile when compared for different fault sizes but the behavior is still consistent scale-wise as seen in Figure 5. Because in practice a test sample may have any diameter of fault size within some range, the classifier's performance will be boosted by having data from all the fault sizes available present in the training set. Figure 5 also seems to indicate that although IR fault is readily distinguished from the other two faults, OR and ball fault data are less obviously differentiable. We therefore propose performing the diagnosis in two stages. The first stage

F I G U R E 5 RCMFE values for different fault sizes for CWRU data
will be concerned with fault isolation and thus a binary classifier will be trained on healthy and faulty data since from Figure 4 the two groups are readily differentiable. The training data will be extracted from a single operating condition and tested against features from all the other conditions as the RCMFE seems to be unaffected by operating conditions. In the second stage where fault isolation is performed, several classifiers will be built: the first will be a multiclass classifier trained on data from the three faults. From Figure 5, it is expected that this classifier will have high recall for IR fault, that is, this fault will be rarely confused for the other two classes and vice versa. The implication is that if a fault is identified as IR, the verdict has a high probability of being correct. However, since OR and ball faults are likely to be confused, once a sample is classified as either of the two, it will be passed through a binary classifier trained exclusively on ball and OR data in an attempt to raise accuracy.

Experiment 2
The second dataset consists of bearing diagnosis vibration data collected under different time varying speed conditions courtesy of the University of Ottawa. 36 The operating conditions in this dataset are time-varying throughout the recording of the vibration signal. The variations are as follows: increasing speed (Cond 1), decreasing speed (Cond 2), increasing then decreasing speed (Cond 3), and decreasing then increasing speed (Cond 4). The speed ranges spanned are given the dataset's documentation. The data are categorized into healthy, OR, and IR classes. For each operating setting and class, three trials are conducted to increase authenticity resulting in a total of 36 files. Figure 6 shows the first few milliseconds of some the raw waveforms of healthy data as well as IR and OR fault data for the four operating conditions. RCMFE values are shown in the plots of Figure 7 with healthy bearing values fluctuating around a generally constant value at the higher scales while those of faulty data are monotonically decreasing.
Although Figure 7 implies good separability of the classes, we still choose the two stage approach where fault detection is first performed followed by fault isolation.

METHODOLOGY
From the insight gathered from Figures 4,5, and 7, we proposed the diagnostic process as follows 8. For each of the four source domains, SOF classifiers were trained and used to categorize the test samples.
The work was carried out in the Matlab environment on a 16 gb RAM, core i7 computer.

RESULTS AND DISCUSSION
The CWRU and Ottawa datasets were used to verify the proposed use of RCMFE for cross domain classification and the results closely follow the insights drawn from Figures 4,5, and 7. Table 2 shows that in the CWRU dataset, the healthy and faulty data are perfectly separable using RCMFE with a 100% on all the test data regardless of which source domain was used in training. The column headers indicate the source domain of the training data while test data include data from all other conditions not used for training.
For fault isolation, a preliminary classifier was trained on all the three fault classes, that is, IR, OR, and ball (Table 3). Just as indicated by Figure 5 the IR fault is perfectly separable from the other two faults. Further more, IR is never confused for OR and vice versa. However, since some ball faults are categorized as IR faults as seen in the last rows of the confusion matrices in Table 3, it would be worthwhile to train a binary classifier on IR and ball fault data. The accuracy of the IR/ball classifier is shown in (Table 4) where it is seen that the two faults exhibit good separability.
The entries of the matrices of Table 3 concerning OR and ball faults confirm that the two are frequently confused for each other as intimated by Figure 5. Thus, samples categorized as either fault should be passed through a binary classifier trained on ball and OR data only. Table 5 shows the accuracy of such classifiers.
In the end, the accuracy of label assignment to a sample is then an average of accuracies of all the classifiers the samples pass through.  From Figure 5, it was expected that differentiating ball and OR faults would be more challenging than either IR versus ball or IR versus OR. In the figure, RCMFE values for OR and ball data from different fault sizes divide themselves into a high or low range. Further, the fault size that ends up in the high range differs for the two faults, for example, in CWRU 1, the 0.355 m dia fault is in the higher range for OR fault but in the low range for ball fault. On the other hand, the 0.177 m dia and 0.533 m dia sizes, which are in the higher range for ball fault, are in the lower range with OR fault. For the rest of the operating conditions, the 0.355 m dia is always in the high range for OR fault while the other two sizes remain in the lower range. For the ball fault, in CWRU 2, the 0.177 m dia and 0.711 m dia sizes are in the higher range and the rest in the lower range. For CWRU 3 and CWRU 4 ball faults, the 0.177 m dia and 0.533 m dia sizes are in the higher range but the 0.355 m dia size steadily climbs from the low range to the middle. The 0.711 m dia remains in the low range. By observing this behavior across the operating conditions, it is probable that the segmentation of fault sizes in the ball and OR fault may be a peculiarity of the CWRU setup and is thus possible that the phenomenon may lessen in another machine.
The RCMFE features result in high accuracy in the two stages for the Ottawa dataset as seen in the confusion matrices of Tables 6 and 7. As observed in the CWRU dataset, healthy and fault data are easily distinguishable as well as IR and OR data.
Thus, RCMFE features are consistent in detecting fault and in at least differentiating IR and OR faults as verified with the CWRU and Ottawa datasets. The features also perform well in distinguishing IR and ball faults and only suffer a drop in performance when differentiating ball and OR fault.

Comparison with related works
In published works that have tackled cross-domain diagnosis with the CWRU dataset, the domains are constituted by operating conditions as is the case in this article. However, despite the differences in the methods applied to feature extraction and DA, a key similarity in the literature is that each fault size is considered in a separate class with the exception of the 0.711 m diameter fault which is mostly omitted. Thus, there are usually a total of 10 classes, with the remaining three fault sizes and three fault types contributing nine classes and the healthy class bringing the number of categories to 10. As a result of preliminary investigations on the dataset as shown in Figures 4 and 5, it is seen that the RCMFE features are very robust against operating conditions and the challenge in diagnosis would be presented by the varying fault sizes even in the same operating condition/domain. Therefore, the results presented in section 5 focused on the case where each class was constituted by a single fault type but with at least three of all the fault sizes available for the particular fault type. This is because in the OR fault data the 0.711 m diameter fault is missing from the data repository and appropriate adjustments had to be made to avoid class imbalance. In order to compare the method proposed in this article with other works, data were similarly prepared such that a class consisted of a single fault type and single fault size. Table 8 shows some of the results for various fault sizes for the latter arrangement. Once again, the condition shown in the column header is the training domain while the test data were drawn from the other three domains/operating conditions. The results are very promising as noted by the high accuracy of classification. The good separability of the classes when only a single fault size is considered was expected as depicted in Figure 4.
In Table 9, a comparison with other works that have used CWRU data is given. The comparison is made in terms of the method of feature generation, the DA technique, the total number of classes considered, the fault size per class, average accuracy, and whether the target domain data are required during training. The subscript * in the proposed approach indicates the arrangement of one fault size per class. Total direct comparability is not possible due to some minor differences. For instance, Li et al 19 only presented results considering CWRU 1 and CWRU 4 as the source and target domains while in the rest of the entries, each condition was considered as the source in its own turn. Also, in the reported literature, the 0.711 m diameter fault was entirely omitted from analysis. In the approach reported here, diagnosis was divided into two stages, that is, fault detection followed by isolation while in the rest of the entries in Table 9, the diagnosis process was performed in one stage. Therefore the accuracy of the proposed method is an average of accuracy from the two stages. The number of classes are listed as 3 for this article because separate classifiers were built for each of the fault sizes thus the classes only consisted of the three fault types for a particular size.
As is seen from Table 9, in order to handle cross-domain diagnosis, most of the researchers have adopted deep learning for feature extraction as indicated by the use of CNN. Also, MMD that is an optimization problem solved by iteration is commonly used to reduce distribution discrepancy. With the exception of Zhang et al, 16 the other works require that all or part (Li et al 9 ) of target domain data be present during training. This means the learning process must be repeated for In practical implementation, categorizing each fault size in an individual class does not benefit generalization. This is because, rather than an unseen test sample having a fault size exactly equal to one of those used for training, it is more probable that the fault will be in the range of the fault sizes used for training. Thus, in this work, we opted for the harder transfer learning problem where classes were based on the type of fault only and in the training data several fault sizes were included. The more challenging nature of the latter arrangement is clear by comparing Figures 4 and 5.

CONCLUSION
In this article, the use of RCMFE features for cross domain classification has been been introduced and its effectiveness verified using the CWRU dataset where data were collected for different operating conditions and the Ottawa dataset where data were collected under time-varying operating conditions. Because it is impractical to know and have data from all possible operating conditions, the proposed diagnostic procedure only uses training data from a single condition and has been proven to perform well against test data from unseen operating conditions. The approach is straight forward and computationally desirable since primarily, it only involves extracting RCMFE features followed by classification. It takes 2.4 seconds to compute fuzzy entropy at 20 scales for a 4000 point time series, and classification using the SOF classifier takes between 0.2 and 0.5 second. This method compares favorably with deep learning DA approaches which require nontrivial manipulation of data before training the classifiers, for example, reducing the difference in distributions of the source and target or learning a subspace wherein features are transferrable across domains. In contrast, our approach only requires extracting RCMFE from data. The classification task is divided into two with fault detection being followed by fault isolation. Performance is excellent in the former stage but drops slightly in fault isolation due to the fact that ball and OR faults are not very distinguishable when data from all available fault sizes are used for training. By considering several fault sizes simultaneously for training, our work is unlike other published works which consider each fault type and size as a separate class. We argue that although the latter arrangement results in high classification accuracy using deep learning approaches as well as the proposed approach, it is not practical as it is unlikely that a test sample will have a fault size exactly equal to one of those used in training. Rather, it is more probable that the unseen size will be in the range of the sizes used for training. In the case of the CWRU dataset where ball and OR faults are less distinguishable when several fault sizes are combined in training, an avenue of further investigation is opened. Another future direction is to consider using the data from all recording channels available for diagnosis. For instance, CWRU data are collected from the fan end, drive end, and in some cases an accelerometer was placed on the base of the machine making at least two channels available for all the files in the repository. RCMFE has already been used in a multichannel approach in other fields 37 and its suitability for cross domain bearing diagnosis should be explored. Jackson G. Githu: Conceptualization-Supporting, Project administration-Equal, Supervision-Supporting, Writingreview and editing-Supporting.

DATA AVAILABILITY STATEMENT
The two datasets used in this work are publicly available online at https://csegroups.case.edu/bearingdatacenter/pages/ download-data-file for the CWRU data, and https://data.mendeley.com/datasets/v43hmbwxpm/1 for the Ottawa dataset.