A critical analysis of the bat algorithm

This article presents an analysis of the bat algorithm (BA) based on elementary mathematical analysis and statistical comparisons of the first hitting time performance metric distributions obtained on a test set comprising five carefully selected objective functions. The findings show that the BA is not an original contribution to the metaheuristics literature and that it is not generally superior to the Particle Swarm Optimization algorithm when fair comparisons are made. It is also shown that some components of the BA can be either replaced by simpler alternatives or be removed entirely to increase performance. Finally, the results suggest that the best version of the BA is in fact a simple hybrid between Particle Swarm Optimization and Simulated Annealing. To encourage more transparency in metaheuristics research, the entirety of the MATLAB code used in this article is available in a GitHub repository for suggestions and/or corrections.


INTRODUCTION
Metaheuristics are universal stochastic optimization algorithms used to solve problems where objective functions are unknown or infeasible to solve with limited computational resources. Such problems include the Quadratic Assignment and Traveling Salesman problems which are well known in the Operations Research community. More recent examples include the training of convolutional neural networks 1 and on-line load sharing management for web servers 2 among others. This article compares and constrasts two specific metaheuristics, namely the bat algorithm (BA) 3 and Particle Swarm Optimization (PSO). 4 The bat algorithm is proposed in Reference 3, as a novel metaheuristic inspired by the echolocation mechanism of bats. The original article makes two main claims. The first is that BA is a novel metaheuristic and the second is that BA is generally much superior to Particle Swarm Optimization and Genetic Algorithms (GA) based on experiments with 10 benchmark functions. There is no a priori reason to question the bat metaphor given that other metaheuristics like Ant Colony Optimization (ACO) 5 really are inspired by biology. 6 The scientific literature on bat echolocation does not discredit the metaphor, but it shows that it is shallow. BA parameterizes true bat echolocation based on frequency (f ), pulse rate (r) and loudness (A) while real bats use the former plus duration, directionality and a complex motor control system to home in on prey. 7 To be fair, the fact that BA is not a perfect model of true bat behaviour is not a problem per se, but if the research community accepts metaphorical descriptions, it should insist that other possible analogies be differentiated. Case in point, bats are not the only species that use echolocation for navigation and hunting. Some bird species employ a similar strategy and so do certain species of dolphins and whales. Any one of these animals could have been chosen to create the Whale Algorithm for echolocation-based optimization instead of BA. This creates unnecessary confusion since Whale Optimization Algorithm (WOA) 8 does exist and it is not based on echolocation. To repeat one of Kenneth Sörensen's points in Reference 9 the language of metaphors obscures what is truly novel about an algorithm other than its name and says very little if anything about why it works. The reliance on metaphors instead of mathematical descriptions is but one of the methodological flaws commonly observed in the metaheuristics literature.
The purpose of this article is to show that inadequate research methodology has led the research community to believe that BA is an original contribution to the metaheuristics literature. The research hypothesis is that BA is an underperforming variant of PSO instead of the other way around. The reason for singling out BA over other metaheuristics is that BA has obtained more than 10 000 citations since it was published in 2010 as is reported in Web of Science as of November 2019. This makes BA a good candidate for demonstrating that the growing concern that bad methodology is widespread in the metaheuristics research literature [9][10][11][12][13][14] is in fact supported by evidence obtained by following good methodological practices.
The rest of this article is organized as follows: Sections 2 and 3 describe PSO and BA respectively, Section 4 contains a detailed mathematical analysis of BA, Section 5 contains a detailed description of the methodology used to generate and analyse the results of benchmark function minimization discussed in Section 6 and finally Section 7 contains a summary of the findings and proposes future research avenues.

THE PARTICLE SWARM OPTIMIZATION METAHEURISTIC
In historical terms, PSO is the third swarm intelligence metaheuristic after Stochastic Diffusion Search (SDS) 15 in 1989 and ACO 5 in 1992. It is relevant to our discussion to point out that it precedes BA by a decade and a half. PSO was designed by social psychologist James Kennedy and electrical engineer Russell Eberhart.4 The initial purpose was to simulate the social behavior of humans through the imagery of the movements of bird flocks and fish schools. It was quickly realized that the resulting paradigm was well-suited for mathematical optimization and thereafter was used to optimize highly non-convex functions and neural network training. Kennedy and Eberhart did not use names like Bird Flock Optimization and Fish School Optimization precisely because they systematically simplified PSO by removing some of the original nature-inspired components of the algorithm and found that it worked just as well. As such, they decided to use the neutral imagery of particles as defined in Reference 16 and 17. Others like Reference 18 with Migrating Birds Optimization (MBO) and Reference 19 with Fish School Search (FSS) were apparently not as concerned with the impact of using metaphors.
The characteristic equations of PSO are given in Algorithm 1. The position ( − → ) at time step t of the i-th particle are its coordinates in the search space and the velocities ( − → ) are a weighed sum of the relative position vectors between the particles, their personal historical best positions ( − → ) and the current global best position ( − → * ). The velocity term is weighed by what is called the inertia factor ( ). The effect of inertia is to accumulate velocity in a "good" direction over time. The primary difference with BA is the use of the i-th particle's personal best-known location. The parameters c 1 and c 2 are constants that weigh the relative importance between the global best and the personal best positions, respectively. The parameters r 1 and r 2 are random variables drawn from a uniform distribution generally with support [0, 1] represented by U(0, 1). The randomness introduced by these parameters causes the particles to overshoot global best and personal best positions with proportions 1/c 1 and 1/c 2 respectively when t is large. The net effect is to prevent premature convergence by providing what Kennedy and Eberhart called a stochastic "kick" in Reference 4 The key idea behind PSO is that at every time step, the particle is pushed-accelerated in PSO terminology-in the direction of a weighted average of the particle's personal best-known location and the entire swarm's best-known position. The implied assumption is that regions of high fitness are close to other regions of high fitness. 20 There are hundreds of variants of PSO some of which were proposed by the original authors. The original PSO given in Reference 3 is the one used in this article despite there being a standard reference PSO algorithm proposed in Reference 21. The reason for this choice is that the authors believe that BA is only marginally different from the original PSO, therefore Original PSO is a more appropriate basis of comparison.
The frequency of the sonar f is bounded by f min and f max which are generally set to 0 and 1 respectively. The parameter is a random number drawn from a uniform distribution with support [0,1]. The position is updated at each iteration by moving it by an amount − → +1 called velocity. Thus far, there is no practical difference with PSO except that the personal best position is omitted. This is not insignificant, because the authors of PSO explicitly reported in Reference 4 that this component increased performance. It is therefore implicitly stated that the new mechanisms proposed in Algorithm 2 more than make up for this exclusion. Three other parameters are introduced in BA: pulse rate r, loudness A and step size . This last parameter controls the size of the random step taken around the global best position when the conditional statement U(0,1) > r i is true.
Pulse rate and loudness are used as parameters for the two probabilistic acceptance mechanisms of lines 8 and 11. This idea is borrowed from Simulated Annealing (SA) 22 which introduced the notion that non-improving moves should be accepted following some probabilistic mechanism so that the algorithm can escape local minima and go on to search for other promising regions of the fitness landscape. While SA uses what is called a cooling schedule to gradually decrease the probability of accepting non-improving moves, BA uses pulse rate and loudness. Pulse rate (Equation 4) is a function of the initial value r 0 and which controls speed of convergence as shown in Figure 1. A value of r 0 = 1 was used for convenience since r t + 1 −→ r 0 as t −→ ∞.
Loudness is calculated at each time step using the following equation: Equation (5) introduces the parameter which controls the rate of exponential decrease as shown in Figure 2. Possible values of A are bounded by [0, 1] and is bounded by [0, 1). Since t ∈ Z + , the continuous curves in Figures 1 and 2 are shown for illustrative purposes.

COMPARATIVE ANALYSIS
In Reference 3 PSO is said to be a particular case of BA when the effects of pulse rate and loudness are removed. This is incorrect since Original BA does not use PSO's so-called autobiographical memory 4 (ie, c 2 ) to track each particle's personal best position and it also does not use inertia. To better understand the differences between BA and PSO, it is helpful to do a few elementary algebraic manipulations. The first component is represented by lines 4 to 6 of Algorithm 2 as shown in Algorithm 3.
Equation (4) can be rewritten as: Combining Equations (1) and (2) gives: Combining Equations (3) and (7) gives: The same logic applied to lines 4 and 5 of PSO in Algorithm 1 gives: From Equations (8) and (9), the first component of BA is shown to be a special case of its counterpart in PSO when = 1, c 2 = 0 and c 1 is drawn from the following probability density function (PDF) derived from the ratio of two uniform distributions: The resulting PDF is shown in Figure 3 for f min = 0 and f max = 1. The area in blue covers This shows that BA uses a random number generated from the same uniform distribution as PSO about half of the time and an exponentially decreasing function for the rest. The previous conclusion holds for f min = 0 and f max = 1, but it can be generalized. Given f min = 0 and f max ≥ 1, BA will, on average, take larger steps than PSO in the direction of the global best position a fraction of the time equal to [1 − 1/(4f max )]. This also means that BA overshoots the global best position by larger amounts than PSO. More precisely, the resulting long-tailed distribution causes the overshoot to sometimes be a large multiple of the distance between an agent's current position and the global best position. For example, if the distance ‖ − → x * − − → p i ‖ is 1 unit, then the resulting push in the direction of the global best will be larger than 5 units 10% of the time, that is, . ∫ Equation (4) and Figure 1 show that the pulse rate is an exponentially increasing function whose rate is governed by the parameter . Technically, since t ∈ Z + , pulse rate goes from 0 to r 0 geometrically fast instead of exponentially, but the distinction is abandoned hereafter. The main effect of the second component is that it separates BA into two phases. The first phase is when the algorithm starts. When t is small, the second component is almost surely always applied since r begins at zero and U(0, 1) > r t i is likely to be true for most agents. This means that the second component makes the swarm converge on the initial global best value x * init and makes every agent take random steps around it in the beginning. It was pointed out during the peer-review process that this behaviour was observed in Reference 23 with BA-controlled robots. This makes the values of almost every agent's personal best − → p i equal to the global best − → x * in the first few iterations. This is similar to stochastic hill climbing (see Appendix B) because the swarm is centred around a single agent and is effectively sampling its immediate neighbourhood for improving positions. As such, the next following position has to be in its immediate neighbourhood. The direct consequence of this finding is that BA is expected to perform relatively well on convex objective functions and less well on multimodal ones because "bad" initial positions will cause BA to converge to local extrema from the beginning.
The second phase of the BA optimization process is when t becomes larger. If r is considered to be equal to r 0 when it reaches percent of its final value, then at t = ⌈−ln(1 − )/ ⌉, the probability that the second component is applied is simply 1 − r 0 . For example, if = 99% and = 0.5, then it would be true starting at t = 10. Typical values of range from 0.5 to 0.9, 24 so the preceding result could be as low as t = 6. This is a small value considering that the number of epochs is commonly observed in the 10 3 to 10 4 range. Larger values of r 0 cause the second component to be applied less often.
The third component covers lines 11 to 16 of Algorithm 2 as shown in Algorithm 5.
Loudness is governed by the exponentially decreasing function given in Equation (5). The initial loudness for every agent is the same and is represented by A 0 without the i subscript. The probability that the loudness condition is true is given by Equation (11).
The effect is that when an agent finds a better solution than the current global best, there is a 100% probability that it will be accepted until t reaches a value of t * when it starts to decrease exponentially until it is practically zero. As pointed out in Reference 3 plays a similar role as the cooling factor in Simulated Annealing (SA). 22 The resemblance becomes clear when it is realized that t is equivalent to be ct for values of = e [ln(b) − ct ]/t which is a form of SA cooling schedule. 25 To summarize, it is clear from Equation (9) and the preceding analysis that BA does not meet the traditional criteria for being considered an original contribution. The so-called traditional criteria refer to the ones given by the Canadian Intellectual Property Office (CIPO), the European Patent Convention (EPC) and the United States Patent and Trademark Office (USPTO) for patentability. BA uses mechanisms that (a) have been discovered before, (b) can be simplified and (c) can be considered obvious by most individuals with an average understanding of stochastic optimization algorithms. It is believed that this is the fruit of inadequate methodological standards and that the use of metaphors obscures these findings which would have been obvious otherwise. To support these conclusions, the following items will be investigated experimentally: 1. It is hypothesized that BA is particularly sensitive to the initial positions of the swarm. It appears that performance would decrease significantly if the swarm was initialized in a region that did not contain the global minimum. 2. It appears likely that the pulse rate mechanism (line 8, Algorithm 4) is no better than roulette wheel selection since it is asymptotically equivalent. 3. It appears likely that that component 3 of BA (Algorithm 5) is no better than the probabilistic acceptance mechanism of simulated annealing.

METHODOLOGY
This section describes the methodological approach used in this research. The goal is to provide all of the essential information to make it reproducible and easy to understand. This is accomplished by clearly stating the research assumptions. In this subsection, the following elements will be discussed briefly: 1. The first hitting time performance metric. 2. The notion of -neighbourhood.
First, the performance metric used in this research is the FHT which is defined in statistics as the first time a process reaches a specific subset of the state space. In this research, it is defined as the number of objective function evaluations (ie, a measure of time) required to reach an -neighbourhood (ie, a subset of the state space) of the known best solution of a benchmark function. Convergence is an interesting property, but it is often of no practical importance if the swarm converged to the global minimum or not provided the best-found solution is kept in memory throughout the optimization process.
Second, the notion of "hit" is clarified by defining what an -neighbourhood is. An algorithm is considered to have hit target when it finds a position whose fitness is within = 10 −2 or less of the global minimum for a given benchmark. It is assumed that there is no need to make a distinction between 10 −2 and 10 −3 or 10 −30 because the basin of attraction of the global minimum has almost certainly been found with the former. Some judgement is required.
Finally, Section 5.1 contains details about the test set selection process and supports the use of a small number of test functions by emphasizing that selection should be based on problem characteristics rather than the size of the test set. The results obtained on the test set will be compared to the results on two more benchmarks to demonstrate the strength of this approach for future reference. Section 5.2 contains information about the parameter selection process by pointing to the appropriate literature and by describing the grid search procedure employed to find reasonable parameters for fair comparison. Section 5.3 explains how the research hypotheses stated in Section 4 will be explored and describes the nonparametric statistical tests used to compare FHT distributions. The data that support the findings of this study are openly available in https://github.com/iangagn/ENG-2019-11-084.

Test set
The standard practice for the investigation and comparison of metaheuristics is to compare their performances on a series of benchmark problems. This is partly due of the No Free Lunch Theorem (NFLT) 26 which says that the ideal universal optimizer does not exist and that good performance on a set of problems comes at the detriment of others. The result is that researchers often use large sets of test functions in order to find what problems the studied algorithm is good for. This approach is incomplete and redundant for the following reasons: 1. It does not attempt to explain why the algorithms perform the way they do on specific problems or problem attributes, so it largely fails to provide generalizable knowledge. 2. Choosing benchmarks with no reason other than to reach a predetermined number almost guarantees that some functions will provide redundant information already provided by other functions of the set.
To avoid these problems, this research aims to characterize the performance of the studied algorithms when facing the common difficulties encountered in optimization problems and to define a performance profile for each. This is achieved by making use of the well-known De Jong test functions. 27 This choice is motivated by the following items: 1. The small size of the test set (ie, 5 functions) makes the results more tractable. This decision is also supported by the fact that the average empirical linear correlation coefficient ( ) as defined in Reference 28 for the five test functions is only .258 (Figure 4). An instance of two variables that have a correlation coefficient of 0.25 is shown in Figure 5 for reference. Functions f 1 and f 4 are highly correlated ( = 0.91), but this is explained by the fact that they both have bowl-like shapes (note that Figure 9 is inverted) and that their global minima coordinates are the same (ie, the origin). The empirical linear correlation coefficient matrix is an imperfect measure of test set diversity, but it confirms that the functions "look" different without having to manually inspect every pair of graphs. Figure 4 can be generated using ENG2019110845_Empirical_Correlation_Matrix.m from the referenced GitHub repository. The first benchmark is the unimodal convex function called Sphere ( Figure 6). It is the least challenging an optimizer can face and simply serves to measure its general efficiency. Trajectory methods (ie, algorithms that only use one solution per time step) always perform well on this kind of function, because the gradient always points in the direction of the global extremum (ie, there are no local extrema).

F I G U R E 6 Sphere function
The second function is a generalization of the Rosenbrock function ( Figure 7). The minimum is located inside a banana shaped valley where optimizers can get stuck due to the scarcity of improving positions (ie, low gradient).

F I G U R E 7 Rosenbrock function
min(f 2 ) = f 2 (1, … , 1) = 0 The third function is called Step (Figure 8) and the main difficulty it poses is that it is made up of a large number of plateaus (ie, where f = 0). Optimizers fail when their neighbourhood topologies are fixed and/or are too small to search outside the plateaus where they are trapped.

F I G U R E 8
Step function The fourth function is a quartic function with Gaussian noise (Figure 9) represented by N(0, 1). Noisy functions have the property of being rugged. Optimizers can fail when local minima appear to exist where they do not. Note that Figure 9 shows an inverted (ie, multiplied by −1) version of the noisy quartic function for visualization purposes.

F I G U R E 9 Noisy quartic function
The fifth function is called Shekel's Foxholes ( Figure 10). It is a plateau filled with steep basins where optimizers often cannot escape. Note that Figure 10 is inverted for visualization purposes.

Parameter selection
In the current metaheuristics literature, the parameter selection process is often left unmentioned which can leave the impression that the values were either chosen at random or carefully selected to produce a desired outcome. The purpose of this section is to explain how the parameters were chosen for BA and PSO. Since this research is mainly concerned with exploring the claim that BA is generally superior PSO, it is believed that it is not necessary to compare both algorithms on the elusive best possible parameter settings. This is because if general superiority really existed, it should not require extensive parameter tuning to observe. In other words, the dominance pattern should be apparent. Despite this, efforts were made to find good parameter settings to make the comparisons as fair as possible. The parameters for BA were chosen by measuring median FHT for 100 runs on a multidimensional grid of parameter combinations and selecting the combination whose average ranks is the best on the test set. The ranges for the BA parameters are based on a meta-analysis 24 of historical parameter settings. The studied parameter settings are given in Table 1. This results in 2 × 4 3 × 3 = 384 combinations and 384 × 100 = 38 400 runs in total. Each run is limited to 2500 epochs for a total maximum number of epochs of 96 × 10 6 epochs. Since the number of epochs is limited, sometimes BA never hits the target. This is accounted for by adjusting the median FHT values of each configuration by dividing by the hit rate. For example, if the median FHT is 500 function evaluations and the hit rate is 90%, then the actual FHT used for comparisons is 500∕0.9 = 555.5 function evaluations. This is to prevent a bad configuration from winning by having a small number of good hits with low hit rate. The resulting best parameter set for BA is f ∈ [0, 2], r 0 = 0.7, A = 1, = 1.9, = 0.1. These results can be reproduced or improved upon using the following MATLAB file in the referenced GitHub repository: ENG2019110845_Parameter_Selection_BA.m . A similar exercise with PSO yielded = c 1 = c 2 = 0.7 and corresponds to ENG2019110845_Parameter_Selection_PSO.m. The results of the experiments are reported in Appendix A. The final parameter settings are given in Table 2.

Parameter
Value Parameter Value It is interesting to note that the best value is 1.9 since this causes U(0, 1) < A t i to always be true (ie, because A 0 = 1 implies that A t i ≥ 1 for all values of t and i). This suggests that line 11 of Algorithm can be simplified. This finding will be discussed in the results section in light of the hypothesis stated in Section 4 that the loudness mechanism is inferior or equal to the SA probabilistic acceptance mechanism.

Experiments and nonparametric statistical tests
This section describes the experimental procedures used to investigate the hypotheses formulated in Section 4. Section 5.3.1 describes how the research hypotheses will be explored in Section 5.3.2 describes the two-sample Kolmogorov-Smirnov test used to determine if two samples come from the same parent distribution and finally Section 5.3.3 describes Wilcoxon-Mann-Whitney test to compare medians.

Investigating the research hypotheses
This section describes the experimental procedures used to investigate the research hypotheses formulated in Section 4. The hypotheses are repeated here for convenience: 1. BA is suspected of being highly sensitive to initial positions. 2. The pulse rate mechanism is suspected of being no better than roulette wheel selection.
3. The loudness mechanism is suspected of being inferior or equal to the probabilistic acceptance mechanism of simulated annealing.
To measure the effect of different initial positions on BA's performance (hypothesis 1), the two following schemes will be used: To compare the pulse rate mechanism (Algorithm ) with roulette wheel selection (hypothesis 2), a variant of BA (ba_roulette_wheel.m) will be used to generate FHT distributions. A uniform probability roulette wheel with P(true) = .5 and P(false) = .5 will be used as shown in Algorithm . The line numbers on the left-hand side of Algorithm correspond to the line numbers of Algorithm. Algorithm 6. BA with roulette wheel selection mechanism To measure the effects of the loudness mechanism (hypothesis 3), a variant of BA with no loudness mechanism (ba_no_loudness.m) as shown in Algorithm will be used to generate FHT distributions.

Algorithm 7. BA without the loudness mechanism
The resulting FHT distributions will be analyzed and compared using the nonparametric tests described in the next two sections. The reason why nonparametric tests are used is because the normality assumption does not hold for FHT distributions since FHT is naturally greater than or equal to 1.

THE TWO-SAMPLE KOLMOGOROV-SMIRNOV TEST
The FHT distributions generated from the experiments described in Section 5.3.1 will be compared using the Kolmogorov-Smirnov 2 sample test (KS2 hereafter). KS2 is a nonparametric statistical hypothesis test used to determine if two samples have the same parent distribution (see Reference 29 for more details). The procedure poses the following null (H 0 ) and alternative (H a ) hypotheses: H 0 : The two samples come from the same distribution H a : The two samples do not come from the same distribution The corresponding test statistic is based on the maximum absolute difference between the empirical cumulative distribution functions (ECDF i ) of both samples represented by the following equation: The null hypothesis is rejected for a given confidence level when the following condition is true where n and m represent the sample size of ECDF 1 and ECDF 2 respectively: The corresponding p-values are given in Reference 30. This research uses the MATLAB implementation kstest2 (see https://www.mathworks.com/help/stats/kstest2.html).

The Wilcoxon-Mann-Whitney test
To compare performance between BA and PSO, FHT distributions are generated for both algorithms and the one with the smallest median FHT is the winner for a given benchmark. The differences in medians confirmed to a 0.05 confidence level using the Wilcoxon-Mann-Whitney test with MATLAB's built-in ranksum function in the Statistics and Machine Learning Toolbox (https://www.mathworks.com/help/stats/ranksum.html). The technical details of the Wilcoxon-Mann-Whitney nonparametric test (WMW hereafter) are given in Reference 29.
In short, WMW takes two FHT populations, say X 1 and X 2 of size n 1 and n 2 respectively and tests the following hypotheses: If H 0 is true and 1 is the sum of the ranks of the n 1 elements in X 1 ∪ X 2 , then it can be shown that the even 1 = 2 approximately follows provided that n 1 > 20 and n 2 > 20. From this, the inverse normal distribution can be used to reject the null hypothesis if the test statistic = |S X1 − |/ is greater than 1.96 for a 0.05 confidence level or 2.58 for a 0.01 confidence level.

RESULTS
This section contains the results obtained by following the protocol given in Section 5.3.1. Section 6.1 contains the results of the effect of two different initialization schemes, Section 6.2 contains the results of the effect of replacing the pulse rate mechanism with roulette wheel selection, Section 6.3 contains the results of the effect of removing the loudness mechanism and 6.4 contains additional observations.

Sensitivity to initial conditions
The FHT distributions were evaluated for BA and PSO with random uniform initialization on the full domain (Adjusted Median FHT A) and random uniform initialization on a restricted domain (Adjusted Median FHT B) as described in Section 5.3.1. The adjusted medians are calculated as in Section 5.2 and statistical significance is assessed using KS2 (Section 5.3.2). The results are presented in Tables 3 and 4. The results for BA on f 3 and f 5 are consistent with those obtained during the parameter selection process in Appendix A. BA severely underperforms PSO on these functions and initialization does not appear to play a statistically significant role. It is believed that BA mostly finds the global minima for f 3 and f 5 by chance. The mean difference in adjusted median FHT values for f 1 , f 2 and f 4 is 77.5% for BA and 25% for PSO. Both algorithms are similarly affected by the restricted initialization scheme on f 1 and f 4 , so it could be argued that BA's adjusted median FHT difference of 173.1% on f 2 is an outlier, but it is believed that this is not the case. It is proposed instead that the structural resemblance between f 1 and f 4 explain their similar sensitivities to initialization (ie, both are bowl-shaped convex or almost convex functions). This hypothesis is supported by the fact that the same differences for PSO in Table 4 are almost exactly identical (ie, 28.6% and 28.7%). The conclusion is that BA is in fact very sensitive to initialization and that the observed differences are statistically significant to a .05 confidence level using the KS2 test with a sample of 1000 observations.

BA with roulette wheel selection instead of pulse rate
The FHT distributions (n = 1000) were evaluated for BA with the pulse rate mechanism (Adjusted Median FHT A) and with roulette wheel selection (Adjusted Median FHT B) as shown in Algorithm of Section 5.3.1. The adjusted medians are calculated as in Section 5.2 and statistical significance is assessed using KS2 (Section 5.3.2). The results are presented in Table 5. Using a uniform probability roulette wheel instead of the pulse rate mechanism has increased performance by 56.1% on average. The adjusted median FHT has decreased in all 5 test functions and the differences are statistically significant based on the KS2 test with a sample of 1000 observations. The differences are particularly important for the functions where BA did poorly, namely f 3 and f 5 . With this simple change, BA suddenly becomes competitive with PSO (see Tables 4 & A2 for comparison). It is conjectured that this is due to the lessened impact of component 2 during the early phases of the optimization process. If this is true, then the initialization scheme is expected to have less of a negative impact. This was confirmed during a later experiment. Table 6 shows the adjusted median FHT with random uniform initialization (Adjusted Median FHT A) and initialization on a restricted domain (Adjusted Median FHT B) as described in Section 5.3.1. It can be seen that the differences are smaller with an average increase in FHT of 8.7% against 42.7% (Table 3).

BA without the loudness mechanism
The FHT distributions (n = 1000) were evaluated for BA with the loudness mechanism (Adjusted Median FHT A) and without it (Adjusted Median FHT B) as shown in Algorithm of Section 5.3.1. The adjusted medians are calculated as in Section 5.2 and statistical significance is assessed using KS2 (Section 5.3.2). The results are presented in Table 7. Adjusted median FHT decreased by an average of 46.1% for f 1 , f 2 and f 4 and increased by an average of 45.5% for f 3 and f 5 . The difference is not statistically significant for f 3 despite the difference being large because the sizes of the samples that did find the target is small (n A = 134, n B = 119). It is believed that this difference is caused by fewer improving positions being rejected therefore making BA greedier (ie, less explorative). This is supported by the fact that performance decreased for those functions where more exploration is beneficial, namely f 3 and f 5 and improved by a substantial amount otherwise. transparency. The first generation is the most transparent and the most recent generations are the opaquest. These figures show more or less exactly how one would expect a hill climber to navigate the landscape. The stochastic hill climbing behaviour of BA is clearly beneficial for convex functions and ridges with non-zero gradients, but it does appear to pose problems for functions with discontinuities and/or noise as one would expect. Figure 13 shows BA stagnating on the plateaus of f 3 and Figure 14 shows BA centroids stagnating for over 1000 epochs on f 4 in the two regions circled in green.

Additional observations
The Foxholes function (f 5 ) is absent simply because BA consistently fails early on it (ie, the swarm "falls" into one of the local minima and stays there the entire time) so the centroid movements are uninteresting. This observation suggests that BA could possibly be used to navigate objective functions that possess deep ridges like the HappyCat function described in Reference 31 as shown in Figure 15 along with the level curves that inspired its name. This is left as a future research project.

CONCLUSION
The results presented in the previous sections disprove the claim made in Reference 3, that BA is generally superior to PSO.
In fact, the results support the opposite. This research also exposed two weaknesses of BA: plateaus and noisy objective functions. This might be attributable to parameter selection, but the results of the parameter selection process described in Section 5.2 does not support this hypothesis. Instead, it is assumed that BA is simply not well-suited to these objective function characteristics. This being said, the tight swarm configuration of BA observed in Section 6.4 does appear to be good at finding promising directions in regions where they are rare (eg, ridges) and the gradients are low (eg, the banana-shaped valley of f 2 ) and exploiting them. This finding suggests that BA might be beneficial to the study of neural network landscapes since it was reported in Reference 32, that low-energy regions are connected to one another by relatively simple curves. In other words, the optimal regions share similarities with the banana-shaped valley of f 2 . The results also support the research hypotheses made in Section 4. The first hypothesis is that BA is highly sensitive to initial conditions. This was confirmed to be true by comparing the effects of two initialization schemes between BA and PSO. The second hypothesis posited that the pulse rate mechanism is no better than roulette wheel selection. Not only was this confirmed in Section 6.2, but the experiments also showed that BA's sensitiveness to initial conditions almost entirely disappeared when uniform probability roulette wheel selection was used instead of the pulse rate mechanism. The third and last hypothesis was that the loudness mechanism was inferior or equal to SA's probabilistic acceptance mechanism. Section 6.3 proved that, in our study, the loudness mechanism could be abandoned entirely for those functions where BA already performed well relative to PSO. All of this suggests that the best-performing configuration of BA is actually a hybrid between PSO and SA. Further studies could confirm this by opposing this hybrid configuration against BA both with optimal parameters determined with design of experiments and response surface methodology. To conclude, it is important to mention that the findings presented in this article should, in the opinion of the authors, have been made well before presenting BA as an original contribution that outperforms PSO when it is easy to show that both of these statements are incorrect.

PEER REVIEW INFORMATION
Engineering Reports thanks Ke-Lin Du and other anonymous reviewers for their contribution to the peer review of this work.

CONFLICT OF INTEREST
The authors declare no potential conflict of interest. The inference made in Section 4 that BA behaves like a stochastic hill climber was shown graphically. In Figure B1, it can be seen that when t = 100, the swarm is very close to the global best position, but it only hit target ( = 10 −2 ) at t = 1910. It does appear to behave like a stochastic hill climber. Perhaps this could be improved by using an adaptive random step size ( ) or some other adaptive intensification procedure when the swarm appears to stagnate (ie, the centroid does not move by much from iteration to iteration). This could be the subject of future research.