Machine learning models identify gene predictors of waggle dance behaviour in honeybees

The molecular characterization of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes—or ‘neurogenomics’—to select the best candidates that associate with patterns of interest. However, traditional neurogenomic analyses have some well‐known limitations: above all, the usually limited number of biological replicates compared to the number of genes tested—known as the “curse of dimensionality.” In this study we implemented a machine learning (ML) approach that can be used as a complement to more established methods of transcriptomic analyses. We tested three supervised learning algorithms (Random Forests, Lasso and Elastic net Regularized Generalized Linear Model, and Support Vector Machine) for their performance in the characterization of transcriptomic patterns and identification of genes associated with honeybee waggle dance. We then matched the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: boss and hnRNP A1. Overall, our study demonstrates the application of ML to analyse transcriptomics data and identify candidate genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits, and can present some clear advantages compared to the established tools of gene expression analysis, making it a valuable complement for future studies.


| INTRODUC TI ON
The complex relationship between genes and behaviour has fuelled a large body of recent research (Robinson, 2004;Weitekamp et al., 2017) and we now know that gene activity can influence brain function, which in turn may affect behaviour (Robinson et al., 2008). Several studies have shown that behavioural states (distinct and well-characterized behaviours such as foraging or defensive behaviour) can be associated with distinct gene expression profiles in neural tissue, representing the basis for the neurogenomic approach: for example, large gene networks have been associated with foraging and defence behaviour in honeybees (Hunt et al., 2007), and numerous candidate neurological genes have been linked to aggression in a variety of organisms, including honeybees (Liu et al., 2016) and zebrafish (Filby et al., 2010). Nonetheless, most studies have focused on behavioural states that are long lasting or inherent to a species (Zayed & Robinson, 2012), whereas more plastic and transient social interactions among members of the same species (or colony) have been less characterized at the neurogenomic level (Taylor et al., 2021). This is probably due to the challenges associated with combining accurate behavioural observations with complex experimental designs to obtain and analyse large sets of gene expression data (Robinson et al., 2008).
The Western honeybee Apis mellifera has become a model organism for neurogenomics due to its fascinating sociobiology, the ecosystem services it provides as a pollinator and the availability of a fully annotated genome (Weinstock et al., 2006). Honeybees display perhaps one of the most iconic social behaviours in the animal world-the "waggle dance"-where foragers communicate the location of suitable food sources and possible nest locations to nestmates via stereotyped movements (Couvillon et al., 2012). This complex behaviour was described for the first time in the last century (von Frisch, 1967(von Frisch, , 1974 and since then many details of its ecological, evolutionary and physiological underpinnings have been characterized (reviewed in Barron & Plath, 2017;Dyer, 2002;Price & Grüter, 2015). Despite this, we still do not have a complete picture of how the waggle dance is regulated at the brain level. Pioneering studies have started to reveal some of the key players at the levels of molecules (Barron et al., 2007;Kennedy et al., 2021;Linn et al., 2020), cell types (Kiya et al., 2007) and genetic pathways (Sen Sarma et al., 2009) associated with dance communication, but it is unclear what genes in the honeybee brain trigger the performance of dance behaviour once activated.
Traditionally, the neurogenomic approach has consisted of using statistical methods to calculate differential gene expression (Fang et al., 2012), which requires robust data analysis techniques due to the large volumes of sequence reads generated per sample (Kukurba & Montgomery, 2015). An interesting development in the field to address the increased computational needs of these approaches has been the application of machine learning (ML) to genomics studies (Libbrecht & Noble, 2015). ML is a branch of computer science which focuses on the study of algorithms that can improve automatically through experience or by the use of data. These algorithms were proposed to address complex problems which could not be solved through an explicit list of computational steps. Thus, ML methodologies have proved to be powerful resources and have been the focus of extensive research recently to identify the possibilities of new applications to a wide range of fields in biology and medicine (Saeys et al., 2007;Wang et al., 2016;Zhou & Tuck, 2007). Despite the abundance of studies applying ML frameworks to transcriptomic data, its use to characterize the molecular regulation of highly plastic and transient behaviours has not yet been properly explored.
In this study, we sought to identify the genes associated with the performance of dance behaviour in honeybee foragers using an ML approach. We obtained a transcriptomic data set of brain tissues (mushroom bodies) from honeybee foragers that were sampled for another study designed to underpin the molecular basis for learning distance and direction through the waggle dance (Manfredini et al in prep.): mushroom bodies were targeted for this study as they are the best suited brain tissue to explore high cognitive functions in insects (Menzel, 2012;Peng & Chittka, 2017), including spatial tasks (Buehlmann et al., 2020;Kamhi et al., 2020).
We trained three classification algorithms on the expression levels of 15,314 transcripts that equal the total number of currently known genes of the honeybee genome, with the direct goal of classifying honeybees according to whether or not they performed a waggle dance upon their return from a foraging trip (i.e., dancers vs. nondancers). Thereafter, we unified the information obtained from the different ML approaches to identify the genes associated with these complex behavioural states, and we compared these results with more traditional analyses of gene expression based on the quantification of transcript abundance across groups (namely, a Likelihood Ratio Test [LRT] and a Generalized Linear Model [GLM]).
Together, our study provides deeper insight into the molecular regulations of the waggle dance, a plastic and transient behavioural state, and promotes incorporating ML in the analysis of transcriptomic data.

| Experimental setup and initial data set
The transcriptomic data used for analysis were part of an experiment prepared to study the molecular basis for social learning of distance in honeybees through the waggle dance (Manfredini, 2021). In this experiment honeybees from four different colonies were trained to visit a feeder containing a sucrose solution (concentration = 2 m) positioned at the end of a 6-m-long tunnel (Srinivasan et al., 2000), which was used to alter the bee's perception of distance as follows: vertical stripes (with respect to the direction of flight) on the tunnel walls were used to increase the estimated flight distance, while horizontal stripes were used to decrease it ( Figure 1). Honeybees were then marked at the feeder according to perceived distance (similarly to Sen Sarma et al., 2010), yielding two groups: "honeybees perceiving long distance" and "honeybees perceiving short distance." Honeybee colonies were housed in an observation hive, which allowed direct monitoring of the comb where honeybees normally performed the waggle dance after returning from a foraging trip, known as the "dance floor" (Tautz, 1996). Bees were trained during the morning to visit the feeder at the end of the tunnel-it usually took ~5 h to complete this part-and then in the afternoon (between 2 and 4 pm) foragers that regularly visited the feeder were monitored by an observer while foragers were flying from the dance floor to the feeder and vice versa repeatedly. During this 2-h time window, the dance floor was also recorded with a video camera, producing a recording of all waggle dance events that occurred in the focal colony. At 4 pm on each day, marked bees were sampled at the feeder, by gently catching them with tweezers and quickly freezing them in liquid nitrogen. These samples were immediately housed in a −80°C freezer and stored there until we proceeded with the molecular work necessary to isolate RNA samples.
The dance floor footage was carefully analysed to identify marked honeybees that performed waggle dances upon their return from the feeder (hereafter "dancers") and separate them from those that instead were never seen performing any dance ("nondancers") for the whole duration of the 2-h videorecording, despite being visible on the dance floor upon their return from a foraging trip. We allocated to the dancers group bees that were seen performing at least one dance in the 2-h time window, but there was a certain level of variation in the group, with most bees performing multiple dances up to a maximum of 12 dances recorded for one individual. An analysis of the recorded dances confirmed that the manipulation was successful: bees exposed to vertical stripes advertised longer distances on average in their dances compared to bees exposed to horizontal stripes (Manfredini et al. in prep).
This resulted in the following four groups of honeybees: Dancer perceiving Long distance (DL), Dancer perceiving Short distance (DS), Nondancer perceiving Long distance (NL), and Nondancer perceiving Short distance (NS), with eight replicate samples in each of the four groups (N = 32). Brain tissues from all these samples were processed individually for RNA sequencing (RNAseq) analysis (see Appendix S1).
Sequencing files were mapped to the most recent version of the honeybee genome (Amel_4.5) using the intron-aware Star aligner, version 2.6.1a (Dobin et al., 2013), with default settings and using annotation information available for Apis mellifera on NCBI. Read counts were extracted using the featureCounts function from the Bioconductor R package Subread version 1.8.0 (Liao et al., 2019) and following recommended parameters. The final data set, which represents the starting material for this study, included the read counts for 15,314 genes, corresponding to the whole honeybee genome across 32 bees. As we noticed some variation in library size for some of the bee samples (Dataset S3), we normalized read counts by the total library sizes to correct for the effect of possible outliers. To test for the existence of a colony effect, we used the duplicateCorrelation function in the R Bioconductor package limma (version 3.50.0).

F I G U R E 1
Schematic representation of the experimental design. (a) Honeybees visited a feeder through a tunnel, used to alter their distance perception (vertical stripes with respect to the direction of flight increased, while horizontal stripes decreased the distance perceived), and were then marked accordingly. (b) At the observation hive, honeybees were recorded while performing or not performing dancing behaviour and were finally prepared for RNAseq analysis (c). (d) Three machine learning algorithms (SVM, GLMNET and RFE) were trained on the preprocessed sequence reads (Training Classifiers). (e) Key features from each model were compared to identify common elements (genes or predictors)

| Model hyperparameters and data preprocessing
We used the caret package version 6.0-90 (Kuhn, 2008) in the programming language r version 4. While training, we assessed the performance of each classifier on a validation set using repeated k-fold cross-validation (cv) on the 26 samples with 100 repeats per model (Beleites & Salzer, 2008). We chose the number of folds to be 10, a standard practice in ML (Kuhn & Johnson, 2013), which meant subdividing the training set (26 samples) into 10 bins and assessing the model performance on each one of the bins, after being retrained on the remaining nine. For each of these 10 runs, the area under the receiver operating characteristic curve (AUROC) (Marzban, 2004) was accessed using twoClassSummary as the summary function in the train control, and their performance was averaged to give one value for that run. Since we only had a handful of samples, the performance of the models could be highly dependent on the cv splits. Performing k-fold cv repeatedly (100 times) eliminated this possibility and ensured that each model was trained and validated on most (if not all) of the 26 samples.
As cv is performed to find the best parameters for the model, these 100 repeats were executed for each set of hyperparameters.
The optimal hyperparameters were found by caret implicitly, by performing a grid search through the 10 most likely values for each parameter, which were then reported.

| Selected machine learning algorithms
We used principal component analysis (PCA) (Jolliffe & Cadima, 2016) to explore the underlying structure of our data set. As a result of this set of preliminary analyses, we carefully selected the classification algorithms shown in Table 1. For a brief description of these algorithms see the Appendix S1. We also made use of "Feature Selection" techniques (FS) (Saeys et al., 2007;Wang et al., 2016) to identify the most suitable features (genes) at predicting the correlation between gene expression data and dance behaviour.
We explored three fundamentally different approaches with implicit feature ranking procedures based on previous studies (see Table 1 and also the Appendix S1): Random Forests (RF), Lasso and Elastic net Regularized Generalized Linear Model (GLMNET), and Support Vector Machine (SVM). Due to the complexity of the data, we decided to use a radial kernel for SVM, as supported by previous research (Kasnavi et al., 2018). These methods, also known as "em- search (Saeys et al., 2007). More specifically, RFE uses backwards selection to assess the importance of each feature to the model, and discards or keeps them accordingly at each iteration. The best performing subset of features is then reported and the model is refitted on them.
The ranking of the features is done by the underlying algorithm, which can be RF, SVM, GLMNET or others (Granitto et al., 2006;Li et al., 2015;Zhou & Tuck, 2007). Considering the promising properties of RF for genomic studies (Statnikov et al., 2008), we decided to use RF as the underlying model for recursive feature elimination. Even though RF had been explored as an FS algorithm in the early stages of this study, we decided to only include results of RF being run as part of the RFE procedure (RFE-RF), to avoid any overrepresentation of predictors selected by RF in the final set of focal genes.

| Characterization of focal genes
The results of the described approaches were used comparatively to characterize the relevance of the ML models and obtain a final set of predictors. First, we reported the subset of features identified by RFE-RF and compared them with the top ranked features of SVM and GLMNET, in order to contrast the three approaches, and distil an initial list of common features.
As our goal was to identify the most promising set of candidate genes and discuss them in detail, we then focused on a restricted subset of the main output. Namely, we obtained the top 20 most important features according to the individual ranking of each approach, which were then compiled into a single list of focal genes.
To test the statistical significance of the overlaps, we calculated the Jaccard Index and Odds Ratio with the GeneOverlap r package (Shen, 2021). The annotations of overlapping geneS were obtained using NCBI (https://www.ncbi.nlm.nih.gov/). Where NCBI could not provide any information on putative gene function, we used blaSt (https://blast.ncbi.nlm.nih.gov/Blast.cgi) using default settings and the nucleotide-to-nucleotide function. We initially compared the sequence of the transcript of interest against the honeybee genome (A. mellifera) and then, if this did not provide any meaningful results, we compared it against the whole repository, to see whether any significant sequence similarity was detected against orthologues in other organisms (i.e., matches with high score and low E-value).
We then performed overlap analyses to detect candidate genes that were in common among the three algorithms. For comparison with standard analytical methods, we also analysed the same data set of We followed recommended settings for both analyses (normalization performed with the variance stabilizing transformation or vst function in deSeq2, and with Trimmed Mean of M-values or TMM in edger) and we adopted a false-discovery rate (FDR) equal to 0.05 to invoke a statistically significant difference in gene expression. Last, we compared the outputs of these analyses with the list of candidate genes from the ML approaches to identify common genes. The LRT was also used to check for the possibility of any effects due to colony of origin or lane of the sequencer used that could be responsible for driving the observed patterns of gene expression: none of these factors was associated with a significant effect (FDR > 0.05).

| Exploratory analysis
The consensus correlation for the colony effect yielded a negative value (−0.3327), disproving that some colonies were more correlated with the dance behaviour than others, and suggesting therefore the absence of a colony effect in the data. Moreover, PCA was unable to clearly separate the four groups of bees according to the combination of dance behaviour (dancer (D)/nondancer (N)) and distance perceived (long (L)/short (S)) ( Figure 2), or according to the colony of origin.
However, when considering the dance factor alone, we obtained a low-dimensional representation/projection of the data using only few principal components (PCs), which produced easily distinguishable clusters of samples, that is dancers and nondancers (Figure 3). The representation was dominated by PC1, accounting for 51% of the variance in the data, while PC2 only accounted for 14.4% (Figure 3; Figure S1).
PCA clearly showed that dancers were clustered together to-  Figure 2). Overall, the analysis showed a clear underlying structure in the data set with respect to the dance component (dancers vs. nondancers), while no evident structure appeared to be associated with the perceived distance (long vs. short). Based on these findings, we proceeded in our ML analyses focusing on the "dance" factor alone.

| Recursive feature elimination
The model achieved high accuracy even when using only a portion of the available features (genes in this case). The algorithm found the optimum using 5,000 of the original features ( Figure S3, Dataset S2-Sheet 5), which is around 34% of the available data.  (2019) Recursive Feature Elimination (RFE) Wrapper Granitto et al. (2006), Saeys et al. (2007) Zhou and Tuck (2007) Note: The first three algorithms use embedded feature selection (FS) to obtain key predictors from the trained model (Embedded), while RFE requires an underlying embedded approach for the ranking (Wrapper). We report the studies that featured or reviewed these algorithms.
the model was able to represent the data using only limited information, and to generalize from the training data, obtaining 100% accuracy (ACC) on the test set. We then compared and examined the first 5,000 features ( Figure S4), and then further analysed the top 20 of these genes (Figure 4), finding significant overlap with the other methods.

| Embedded methods
We trained two classifiers with the underlying algorithms SVM and GLMNET (see Table 1) using the hyperparameters as described in the Appendix S1 (Dataset S2-Sheet 1 and Sheet 3 was 0.5, as we started from a balanced data set, and the p-value for ACC > NIR was 0.01563. We concluded that both algorithms generalized successfully, as high performance was achieved on both the training and test data sets ( Figure S2).

| Overlap between selected features
The selected 5,000 features from RFE-RF (Dataset S2-Sheet 6) were compared with the top ranked genes from SVM ( included in the optimal subset selected by RFE-RF, with 83 of them selected by SVM as well (Jaccard Index 0.016, odds ratio 28.995, p < .001). The overlap between SVM and RFE-RF was also significant with 3,804 genes in common (see Figure S4), giving a Jaccard Index of 0.613, and an odds ratio of 24.246, with p < .001.

| Genes identified as key predictors
There were 18 genes (predictors) that were shared between at least two approaches (see Figure 4). The largest overlap was observed between RFE and SVM (16 genes) while the overlap between SVM and GLMNET was smaller (four genes). No overlap was detected between RFE and GLMNET. The Jaccard Index between SVM and RFE was the most significant (0.739), while between SVM-GLMNET and GLMNET-RFE it was 0.111 and 0.052, respectively. Similarly, the odds ratio indicated strong association between SVM and RFE (10,265.036, p < .001). Overall, elements in common corresponded mainly to protein coding genes, except for "GB40714," indicating noncoding RNA. We were able to retrieve functional information for most of the genes from annotations of the honeybee genome (see Table 2), with the exception of "GB50940" and" GB45448," where annotations were available only for closely related insects (Apis dorsata and Apis cerana, respectively), and "GB54617" that we could not find any information for.

| Comparison with standard gene expression analyses
We characterized gene expression patterns in the same groups of honeybees as above with standard statistical approaches to identify possible elements in common with the ML approaches that we tested. The LRT approach identified 243 genes that were statistically different between any two groups of bees, while the GLM approach identified 373 genes that were specifically different between dancers vs. nondancers (see Appendix S1 for the lists of these genes). We performed overlap analyses between these two gene sets and the list of 18 genes selected by the ML approaches. This resulted in five genes in common for the LRT approach, and nine genes in common for the GLM approach: both overlaps were statistically significant (representation factors: 17.5 and 20.5; p < .001 in F I G U R E 3 Two-dimensional projection of the data with the first two principal components. The difference between the two classes "Dancer" (red circles) and "Nondancer" (turquoise triangles) became more evident when excluding the distance factor. Ellipses show 95% confidence of the variance for the two class groups both comparisons) indicating more genes in common than expected by chance. Five genes were shared across all analyses that we performed (see Table 2). Interestingly, all focal genes identified by ML approaches (as well as elements in common with gene expression analyses) were expressed at higher levels in dancers, indicating strong consistency in their expression patterns in association with the regulation of dancing behaviour (see Figure 5).
Moreover, we tested the robustness of LRT and GLM by retraining SVM and GLMNET using only the 243 and 373 identified genes.
We found that the models achieved the same performance (100% ACC) on the test set given the genes identified by LRT but were inferior on the 373 genes found by GLM, as SVM achieved only 66.67% ACC, by misclassifying two dancer samples, while GLMNET achieved 100% ACC. We also compared these subsets of genes to the maximally loaded genes for PC1 (top 5,000), and we found 15.23% and 5.9% genes in common, respectively. As this overlap was absent in the case of the genes identified by the embedded methods, this led us to believe that ML methods were more robust to the separation caused by the outliers.

| DISCUSS ION
In the present study, we implemented an ML approach to investigate the transcriptomic signatures arising from a complex plastic phenotype. We explored the unique gene expression profiles of Apis mellifera associated with dance behaviour in order to determine the set of focal genes that could play a key role in the regulation of this complex behaviour. Training one wrapper algorithm (RFE-RF) and two embedded models (SVM and GLMNET), we were able to achieve perfect accuracy in assigning honeybees to the major behavioural response that we tested ("dancer" vs. "nondancer") according to gene expression data. The RFE-RF approach highlighted how the genomic signature associated with the waggle dance is rather heterogeneric and can be traced across a wide portion of the honeybee genome (one third). At the same time, using FS and comparative analyses, we were able to obtain a restricted set of key predictors for each classifier, which were then distilled into a list of genes. Our results show that ML models can be used in addition to standard methods of gene expression analysis and as a complementary approach to characterize the transcriptomic profile associated with the honeybee waggle dance and to identify sets of genes that are promising candidates for the regulation of dance behaviour.
In our initial preliminary analyses (PCA) we were able to clearly separate dancers from nondancers, except for those four nondancer bees that we identified as forming a separate cluster. We exclude the possibility that these samples might represent a set of "outliers" significantly driving the outcome of our analyses. First, there is nothing visibly different associated with them: they came from different colonies (actually from all four colonies tested in our assays), were sequenced in different lanes and produced libraries of different size (which we controlled for in our normalization step). Furthermore, the lack of overlap between the best candidate genes from our ML approach and the maximum loadings for PC1 fully supports the fact that the molecular separation between dancers and nondancers was not driven by these four samples. On the other hand, the PCA approach was unable to detect any major effects of distance perception, which was one of the research questions that we had initially pursued. Although we found that the impact of distance perception on gene expression was too subtle to be detected by our approaches, other studies have succeeded in identifying the effect of distance perception alone on honeybee brain gene expression, using more traditional statistical tools of transcriptomic analyses (Sen Sarma et al., 2010). It is possible that with an increased sample size, we would have been able to investigate this behaviour further.
Alternatively, the transcriptomic signature associated with distance perception might be more significant in honeybees experiencing real distance as opposed to the perceived distance that honeybees experienced through our tunnel manipulation setup. In fact, a larger set of genes was found to differ between foragers experiencing real long distance vs. short distance (Manfredini et al., in prep.) but we cannot exclude that a proportion of these genes might have changed their patterns of expression from one group to the other according, for example, to different metabolic costs of flight.
It is also worth noting that GLMNET has a fundamentally different strategy towards overfitting than the other approaches, as the regularization parameter controls the cost of nonzero coefficients in the model, whereas in SVM it controls the penalty of misclassification, and RF is known not to overfit as the number of trees increases (Breiman, 2001) (see Appendix S1 for further details). However, we did not expect GLMNET to achieve perfect accuracy while discarding most of its variables (~ 99.44%), and the fact that these genes were a subset of the ones selected by other approaches is an exciting result. This also shows how a combination of different tools is the best approach for identifying candidate genes.
The extensive overlap between the three approaches, and the fact that many of the identified genes were also in common with traditional methods of transcriptomic analyses, shows great promise. We hypothesize that these genes are the best predictors for the dance behaviour, as they all appear to be expressed at higher levels in dancers vs. nondancers. In particular, the two genes that are in common to all three ML approaches and were also identified by at least one approach to gene expression analysis deserve special attention. Boss (bride of sevenless) belongs to the group of G-proteincoupled receptors, an important family of genes often associated with expression of behaviour in insects. In particular, boss has been linked to a set of different functions in Drosophila, including sight and eye development, energy homeostasis and response to glucose (Kohyama-Koganeya et al., 2015). Boss might have been co-opted in honeybees to regulate dance behaviour, an energetically expensive F I G U R E 5 Focal genes heatmap. The heatmap shows the expression patterns for the 18 focal genes across all bee samples (Dancers grouped to the left, Nondancers to the right). Expression patterns are shown as the logarithmic transformation (log 10 ) of the number of read counts for the focal genes per million counts total (or logCPM). All focal genes showed higher levels of expression in Dancers (indicated by darker colours). The gene IDs marked with a single asterisk (*) were identified by gene expression analysis (GLM approach only), while IDs marked with two asterisks (**) were identified by both GLM and LRT activity that is highly related to feeding behaviour (and therefore to sugar response) and relies on visual input for orientation purposes during flight to a foraging site. In support of our findings, two previous studies addressing the regulation of the honeybee waggle dance at the molecular level have also identified genes linked to metabolism and energy production-even though boss is not among them (Sen Sarma et al., 2009Sarma et al., , 2010. Interestingly, in one of these studies, genes associated with metabolism were more highly expressed in A. mellifera compared to a different honeybee species (Apis florea) that performs a simplified version of the waggle dance, further supporting the evidence that energy costs of the waggle dance can be quantified at the level of gene expression in the mushroom bodies. As for heterogeneous nuclear ribonucleoprotein A1, studies on the Drosophila orthologue HRP59 have revealed a role for this gene in alternative splicing (Hase et al., 2006), a molecular process that allows the translation of a single mRNA molecule into multiple protein variants , significantly increasing the repertoire of responses to a stimulus. Even though the role of alternative splicing in the regulation of behaviour is largely unknown, this process has started to be characterized in multiple organisms, including honeybees (Foret et al., 2012), hinting at the possibility that the honeybee orthologue of HRP59 might contribute to the high plasticity that is necessary to regulate a complex behaviour such as the waggle dance.
More functional approaches are needed to move beyond correlation and investigate whether a causal link exists between the expression levels of the genes that we identified and the performance of dance behaviour. For example, a recent study has revealed that gene expression associated with sensory perception rather than high cognitive functions is more important for bees following a dance when deciding whether to use personal information vs. social cues (the waggle dance) when engaging in the next foraging trip (Kennedy et al., 2021): it could be tested whether sensory perception has a role in the regulation of dance behaviour as well, by analysing gene expression in other brain parts such as the antennal or optic lobes that are clearly linked to the processing of sensory inputs. We are aware that other internal or external factors could contribute to define the patterns of gene expression in the honeybees that we analysed, as the brain is a complex organ that responds to a wide range of factors and stimuli that we could not control totally, such as bee age or number of dances per- were to support our findings, these results could then be used to test the recruitment potential in a specific colony. By designing a diagnostic tool to directly measure the levels of expression of the focal genes and compare them against a reference, it would be possible to assess the overall ability of a colony at recruiting to a foraging site through dancing.
In conclusion, with this study we provide support for using ML models as a complementary approach to standard gene expression analyses to understand the molecular regulation of a behavioural phenotype. We show the potential of ML models to represent complex patterns in a high-dimensional data set with limited information can be added to a study in future analyses. In the context of the honeybee waggle dance, this feature could be exploited, for example, to understand what other factors influence the performance of the dance behaviour (e.g., age of the bee, colony of origin or foraging patterns) and without the need to carefully monitor every single dance event occurring in the hive, which can seriously limit any experimental design.

ACK N OWLED G EM ENTS
We thank Dr Georgios Leontidis (The School of Natural and Computing Science, University of Aberdeen) for his valuable support during the selection and implementation of the ML models, and the two anonymous reviewers for providing useful feedback that helped improve the clarity and soundness of the manuscript. We are also grateful to NERC (Natural Environment Research Council) for funding this project and supporting MV's salary over 10 weeks This funding also supported FM during the execution of the field and molecular work.

CO N FLI C T O F I NTE R E S T
The author declare no conflict of interests.

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://github.com/Vejni/ Waggl eDance_Machi neLea rning.

DATA AVA I L A B I L I T Y S TAT E M E N T
All codes used in the analyses here reported are visible in a GitHub repository associated with this project: https://github.com/Vejni/ Waggl eDance_Machi neLea rning. The raw sequencing data that represent the starting material for the analyses here described have been deposited on NCBI SRA (Bioproject PRJNA756776).