rphenoscate: An R package for semantics‐aware evolutionary analyses of anatomical traits

Organismal anatomy is a hierarchical system of anatomical entities often imposing dependencies among multiple morphological characters. Ontologies provide a formal and computable framework for incorporating prior biological knowledge about anatomical dependencies in models of trait evolution. They also offer new opportunities for working with semantic representations of morphological data. In this work, we present a new R package—rphenoscate—that enables incorporating ontological knowledge in evolutionary analyses and exploring semantic patterns of morphological data. In conjunction with rphenoscape, it allows for assembling synthetic phylogenetic character matrices from semantic phenotypes of morphological data. We showcase the package functionality with data sets from bees and fishes. We demonstrate that ontologies can be employed to automatically set up evolutionary models accounting for trait dependencies in stochastic character mapping. We also demonstrate how ontology annotations can be explored to interrogate patterns of morphological evolution. Finally, we demonstrate that synthetic character matrices assembled from semantic phenotypes retain most of the phylogenetic information from their original data sets. Ontologies will become important tools for integrating anatomical knowledge into phylogenetic methods and making morphological data FAIR compliant—a critical step of the ongoing ‘phenomics’ revolution. Our new package offers key advancements towards this goal.


| INTRODUC TI ON
Biological realism in models of trait evolution-accurate modelling of biological processes underlying trait changes-is often an overlooked but important feature in phylogenetic modelling (Boyko & Beaulieu, 2021).For example, it is common in statistical phylogenetics to treat each character as an independent realization of the evolutionary process.While this assumption may be questionable for molecular data, it is certainly dubious for morphological data.
Nonindependence among anatomical traits can result from multiple causes (biological, semantic, and ontological dependencies; Vogt, 2018a; see also glossary and definitions in Appendix S1: Table S1) and alternative models have been proposed to deal with them (Tarasov, 2019(Tarasov, , 2023)).While researchers often attempt to deal with such challenges via expert character construction, there is a pressing need for such knowledge to be repeatable and computable.What if we could reliably inform phylogenetic models with knowledge of anatomical trait relationships-ontological (anatomical) dependencies-in a repeatable and computable framework?In this paper, we present a new R package for addressing this challenge-rphenoscate.
The 'dependency problem'-how to code and model dependent traits-is a longstanding issue in phylogenetics with morphological data and has received considerable attention in recent years (Hopkins & St. John, 2021;Simões et al., 2023;Tarasov, 2019Tarasov, , 2023)).Addressing this issue is relevant to improve the realism of evolutionary models for morphological traits since organismal anatomy is highly structured and phylogenetic characters often exhibit complex hierarchical relations.Advances in model-based phylogenetics offer different models and coding strategies to deal with dependencies (Tarasov, 2019(Tarasov, , 2023)), and ontologies can facilitate their detection.
Ontologies are formal representations of domain knowledge using structured vocabularies (Balhoff et al., 2010;Dahdul et al., 2010).In particular, anatomy ontologies allow the expression of knowledge about anatomical concepts.For example, ontologies can formalize that the fish anatomical concept 'dorsal fin ray' is part of 'dorsal fin'.Therefore, any character from a 'dorsal fin ray' (e.g.shape) depends on the presence of a 'dorsal fin'.If trait dependencies are not accounted for, this can result in biologically misleading inferences at internal nodes when performing ancestral state reconstruction with dependent traits (Tarasov, 2019).By automating model specification for dependent traits using ontological knowledge, rphenoscate enables researchers to set biologically plausible models of character evolution for phenomic-scale matrices.
Furthermore, ontologies offer a framework for integrating morphology within and across domains of knowledge prompting alternative schema for representing anatomical data (Vogt, 2018a(Vogt, , 2018b)).By using ontologies, it is possible to represent morphological characters in a more standardized and holistic way consisting of hierarchically structured data annotated with ontology terms and/or hyperlinks to metadata ('semantics-aware'), such as semantically enriched character matrices (Ramírez et al., 2007), semantic instance anatomies (Vogt, 2018a(Vogt, , 2018b) ) or semantic phenotypes (Balhoff et al., 2010;Deans et al., 2012).By representing organismal anatomy in a semantics-aware format, morphological phylogenetics could be revolutionized by allowing for methods that 'align' morphological data across organisms in a computable way (Ramírez & Michalik, 2014;Vogt, 2018aVogt, , 2018b) ) or build synthetic character matrices from multiple sources (Dececchi et al., 2015;Jackson et al., 2018).The Phenoscape project (https://pheno scape.org)has developed key demonstrations of the use of ontologies in phylogenetics and evolutionary developmental biology (Edmunds et al., 2016;Mabee et al., 2020;Manda et al., 2015).They also developed an expert-curated database of semantic phenotypes for more than 6500 species of vertebrates (Phenoscape Knowledgebase).Our package capitalizes on this knowledgebase to explore phylogenetic applications of semantically enriched anatomical data.
In this study, we implemented tools for performing semanticsaware evolutionary analyses and exploring morphological data in a new R package-rphenoscate.These tools include automatically setting up evolutionary models for dependent traits based on ontologyencoded domain knowledge.We also provide tools for investigating relationships among anatomy ontology term annotations and assembling synthetic character matrices from semantic phenotypes from the Phenoscape Knowledgebase (Phenoscape KB).Finally, we showcase the package functionality with taxonomically diverse data sets of bees and fishes.

| Implementation
rphenoscate is one of the two main R packages resulting from the SCATE project (https://scate.phenoscape.org).It is tailored to facilitate comparative analyses of trait data by incorporating knowledge from anatomy ontologies.Its sister package, rphenoscape, is tailored to work with semantic phenotypes from the Phenoscape KB, including tools for quantifying the semantic similarity of phenotype descriptions and algorithms for synthesizing annotated morphological data from published studies.rphenoscate works in synergy with rphenoscape, particularly for accessing semantic phenotypes from the Phenoscape KB, querying absence/presence data with On-toTrace (Dececchi et al., 2015), and calculating semantic similarity.

| Overview
The package includes functions for (a) assessing the dependency structure of anatomical entities based on ontology term annotations; (b) setting up evolutionary models accounting for trait dependencies; (c) assessing the relationships among term annotations using semantic similarity metrics calculated with rphenoscape; (d) visualizing the semantic and phylogenetic structure of the data; and (e) constructing phylogenetic characters and assembling synthetic character matrices using rphenoscape.A scheme of the main components of rphenoscate is presented in Figure 1.Tutorials with examples of applications are available on GitHub (https://github.com/diego sasso/ rphen oscate_tutor ials).

| Package showcase
For showcasing the package, we consider three study cases.In the first case (hereafter BEES), a researcher wants to reconstruct the evolutionary history of multiple traits of bees and understand how they relate to each other in bee anatomy.For example, do traits from different anatomical regions evolve similarly?We employed a modified data set of corbiculate bees (e.g.honeybees and bumblebees) from Porto and Almeida (2021) (Appendix S1: Data Set 1).
In the second case (hereafter CHARA), a researcher has access to the Phenoscape KB and wants to retrieve information for the absence/presence of a set of bones in Characidae.Then, the researcher wants to reconstruct the evolutionary history of these traits to answer a specific question.Do bones from particular body regions get lost more frequently than others?We employed a data set of absence/presence characters inferred with OntoTrace (Dececchi et al., 2015) for species of Characidae (e.g.characids and tetras) retrieved from the Phenoscape KB (Appendix S1: Data Set 2).
In the third case (hereafter ANOST), a researcher also has access to the Phenoscape KB but wants to retrieve all semantic phenotypes of anostomoid fishes to infer a phylogenetic tree.The researcher needs tools for obtaining the semantic phenotypes (Task 1), converting them to phylogenetic characters (Task 2) and assembling them in a synthetic character matrix (Task 3).However, how can the researcher be assured that a character matrix obtained as such actually contains phylogenetic information?To answer this question, a benchmark is necessary; a data set modified from Dillman

| 2535
Methods in Ecology and Evoluঞon PORTO et al.

| BEES and CHARA
Stochastic character mapping was performed to reconstruct trait evolution using corHMM (Beaulieu et al., 2013).For BEES, reconstructions used a tree modified from Porto and Almeida (2021).For CHARA, reconstructions used a dated phylogeny obtained using fishtree (Chang et al., 2019).In both cases, for investigating the semantic patterns of the data, clustering dendrograms of ontology terms were constructed using the Jaccard semantic similarity calculated using rphenoscape.character obtained from semantic phenotypes of the same study using rphenoscape (Tasks 1 and 2) and rphenoscate (Task 3).

| ANOST
Comparisons were made for character matrices and posterior distributions of trees.
Character matrices were compared by calculating the cladistic information content using the package TreeTools (Smith, 2019).Posterior distributions were compared by calculating the generalized Robinson-Foulds distances in reference to the majority-rule consensus of each analysis using the package TreeDist (Smith, 2020).For detailed information on analyses and MCMC settings, see Appendix S1.

| Studying complex traits
One of the main challenges of studying morphological evolution is modelling complex traits-sets of traits often exhibiting multiple levels of dependencies and/or correlations.We demonstrate below that morphological knowledge expressed in anatomy ontologies can be employed for automatically setting up models for ontologically dependent traits.Biologically realistic models for morphology can be used for studying complex traits, for example, in the context of understanding adaptations to particular environments (Tribble et al., 2022) among other applications.

|
What can be learned from the study cases?

| BEES
The sample of phylogenetic characters contained 16 anatomical entities (Appendix S1: Table S2).In cases where multiple characters referred to the same anatomical entity, rphenoscate automatically detected them and set up appropriate evolutionary models; either a standard structured Markov model (SMM-ind) if no ontological dependencies were found; an embedded dependency quality type Markov model (ED-ql) if dependencies based on property instantiation were found; or an embedded dependency absence-presence type Markov model (ED-ap) if dependencies based on parthood relations were found (for details on types of dependencies see Vogt, 2018a and on models see Tarasov, 2019Tarasov, , 2023)).Otherwise, models were automatically assigned to nondependent characters based on the number of observed states (Figure 2).By investigating the semantic patterns of ontology annotations phylogenetic characters, the researcher can learn that some traits with congruent character-state reconstructions (Figure 3a; black triangles) represent related anatomical entities in the bee anatomy-'anterior tentorial arm' and 'posterior tentorial arm' are both part of 'tentorium' (Figure 3b; TEN, purple dashed box)-whereas others not (Figure 3a; red stars)-'hypopharyngeal lobe' and 'furcula'.
This indicates that at least some traits from a given anatomical region might be evolving similarly.However, in the context of phylogenetic inference, it has been demonstrated that the evolution of morphological characters does not necessarily follow anatomical partitions (Casali et al., 2022) or is often incongruent across them (Porto et al., 2021(Porto et al., , 2022)), thus prompting alternative causal explanations.

| CHARA
The data retrieved from the Phenoscape KB contained 420 species with absence/presence data for the sampled anatomical entities (Appendix S1: Table S3).From these, 146 species were available in the phylogeny obtained from fishtree.Ontological dependencies were detected between the pairs 'scapula ' and 'scapular process', and 'coracoid bone' and 'coracoid foramen'; thus, struc-tured Markov models were automatically set up by rphenoscate.
In this case, the model used to account for dependencies was the SMM-sw, but alternative models such as the ED-ap can also be used (Tarasov, 2023).
By inspecting the reconstructed character histories (Figure 4), the researcher can learn that some bones representing structurally related anatomical entities might be evolving independently (e.g.uroneural bones) whereas others not (e.g.infraorbital bones).Some bones were recovered as absent (supraneural bones) or present ('uroneural 1') for all species.Others were lost multiple times ('uroneural 2').For the dependent characters, 'scapula + scapular process' both are present in all species, whereas for 'coracoid bone + coracoid foramen', 'coracoid bone' is present in all species, but 'coracoid foramen' can be absent or present (Figure 4,arrowheads).
Information-poor anatomical entities are not randomly distributed; rather they are predominantly semantically related entities: bones from the supraneural series (Figure 5; red star).This might prompt the researcher to investigate if this lack of information is simply due to a poorly studied anatomical structure or if there are underlying biological causes.

| SYNTH
The matrix inferred from semantic phenotypes of Characidae contained 524 species and 739 phylogenetic characters.Overall data coverage-character-state information available-is around 20% for the entire matrix (Figure 7).From all phylogenetic characters, around 20% are phylogenetically noninformative (invariable for the taxa considered).This result demonstrates that it is possible to construct synthetic character matrices from semantic phenotypes of different studies opening up opportunities for 'phenomic-scale' studies exploring all the data available at Phenoscape KB.

| CON CLUS IONS
Although rphenoscate offers tools for working with external ontologies, some tools work only with semantic phenotypes from the Phenoscape KB and will require additional software development to apply to nonvertebrate taxa.Furthermore, an area for future development is to increase the number of models implemented to account for dependencies and improve the automatic setting-up option of models, currently restricted to linear chains of dependencies with a few hierarchical levels.
Ontologies provide a new framework for studying organismal anatomy and constitute a fundamental step for fully exploiting morphological data in the era of 'Phenomics'.They allow data from different sources and domains of knowledge to be easily integrated and summarized, making it easily findable, accessible, interoperable and reusable by humans and machines, thus compliant with the FAIR principles in data science (Wilkinson et al., 2016).We hope that our new package will offer useful tools in this direction encouraging researchers in the fields of comparative morphology, phylogenetics and ontologies.
et al. (2016) was employed (Appendix S1: Data Set 3).Finally, to demonstrate the use of the synthetic matrix assembling tool, an additional search was performed retrieving all semantic phenotypes from the Phenoscape KB for Characidae (hereafter SYNTH).This family was selected since it has a manageable size (around 600 spp.) to reduce computational effort.F I G U R E 1 Scheme of main components in rphenoscate.(a) Organismal anatomy can be conceptualized using anatomical entities, some of which are valuable for phylogenetic inference (e.g.'maxilla').(b) A systematist may propose a phylogenetic character formalizing the putative phylogenetic evidence; multiple characters are usually organized in a character matrix (d).(c) Phylogenetic characters can be represented as semantic phenotypes by linking the anatomical entities and qualities to concepts in ontologies (e).(f) The Phenoscape Knowledgebase contains expert-curated annotations of semantic phenotypes of vertebrates integrating multiple ontologies.(g) The rphenoscate package allows accessing semantic phenotypes from the Phenoscape KB and incorporating ontological knowledge to automate model specification for dependent traits, perform semantic explorations of data, and assemble synthetic character matrices using rphenoscape.F I G U R E 2 Types of models available in rphenoscate.(a) Standard Markov models for individual characters (Mk), in this case, a binary character.(b) Structured Markov models for groups of independent characters (SMM-ind), in this case, a pair of binary characters.(c) Models that account for dependencies: Structured Markov models of the switch-on type (SMM-sw) and embedded dependency Markov models of the quality type (ED-ql), in both cases, a pair of binary characters.Note that these models treat absences differently (state 0): as two combinations of hidden states in the former or one state in the latter.Alternatively, some transitions in the SMM-sw can be turned off (bold numbers in blue) based on ontological knowledge and thus the model collapses to an embedded dependency Markov model of the absence-presence type (ED-ap, not shown).C1 and C2 indicate characters and '1' within matrices indicates allowed transitions.Colours are used to facilitate visualization.F I G U R E 3 Exploration of the BEES data set.(a) Sample of stochastic character maps from four anatomical entities.Branch colours indicate character states.The black triangles and red stars indicate some clades with congruent patterns of reconstructed states.(b) Clustering dendrogram showing the relationships among ontology terms based on the Jaccard semantic similarity.Dashed boxes indicate clusters based on parthood relations known for the hymenopteran anatomy.GEN, genitalia; MD, mandible; MX, maxilla; TEN, tentorium.
Assessment of phylogenetic information was performed by comparing the original data set from Dillman et al. (2016) to the synthetic F I G U R E 4 Sample of stochastic character maps from 10 anatomical entities of the CHARA data set.Branches in orange indicate the presence and those in grey indicate the absence of anatomical entities; for pairs of entities, red colour indicates the presence of both (arrowheads) and purple colour indicates the presence of the first but the absence of the second entity.
After accounting for the ontological dependencies in the evolutionary models, the researcher can observe that reconstructed trait histories show some patterns co-occurring in the phylogeny (Figure 3a; stars and triangles).Although these patterns are congruent with a scenario of biological dependency between traits, the limited size of the data set-only one instance of co-occurring statesprecludes an assertive interpretation.F I G U R E 5 Phylogenetic and semantic patterns of the CHARA data set.The tree to the left is the fishtree phylogeny.The clustering dendrogram at the top shows the relationships among ontology terms based on the Jaccard semantic similarity.The heatmap shows absences (state 0) or presences (state 1) of anatomical entities; empty cells indicate the lack of information.The dashed box indicates a clade supported by the absence of the bones 'infraorbital 5' and 'infraorbital 6'.The red star indicates a cluster of related anatomical entities with a lack of information.
Assessments of the phylogenetic information of the original and inferred synthetic ANOST data sets.(a) Boxplots of cladistic information content for phylogenetic characters in both data sets.(b) Majority-rule consensus trees inferred from Bayesian analyses.(c) Distribution of Generalized Robinson-Foulds distances for trees in the posterior obtained from the original and inferred synthetic data sets compared with the majority-rule consensus tree of the original data set.(d) Same as (c) but compared with the consensus of the inferred data set.
The original data set fromDillman et al. (2016) contained 463 phylogenetic characters and 173 taxa.With rphenoscate and rphenoscape, it was possible to recover and cluster semantic phenotypes resulting in a synthetic matrix with 422 characters.When assessing the phylogenetic information of both data sets, the cladistic information content for characters in the original and inferred synthetic matrices F I G U R E 7 Heatmap representing the synthetic character matrix obtained from semantic phenotypes of Characidae available at Phenoscape KB.Filled cells indicate information available for a given taxon (state 1, orange colour), irrespective of the actual character state.are almost identical (Figure 6a), indicating the conservation potential phylogenetic information.When comparing the consensus trees (Figure 6b) and posterior distributions from both matrices (Figure 6c,d), trees are almost identical and distributions broadly overlap, demonstrating that the phylogenetic information of the original data set was retained in the inferred synthetic matrix.This result is crucial since the ability to synthesize data from different studies and character types presents a major challenge to data reuse, expansion, and synthesis.