Assessing the quality of comparative genomics data and results with the cogeqc R/Bioconductor package

1. Comparative genomics has become an indispensable part of modern biology due to the advancements in high-throughput sequencing technologies and the accumulation of genomic data in public databases. However, the quality of genomic data and the choice of parameters used in software tools used for comparative genomics can greatly impact the accuracy of results. 2. Here, we present cogeqc , an R/Bioconductor package that provides researchers with a toolkit to assess genome assembly and annotation quality, orthogroup inference, and synteny detection. The package offers context-guided assessments of assembly and annotation statistics by comparing observed statistics


| INTRODUC TI ON
The availability of genomic data in public databases has increased exponentially due to advancements in sequencing technologies.
However, a significant portion of this data is of poor quality, leading to incorrect or unreliable results in common analyses such as genome-wide synteny analysis and gene orthology detection (Feron & Waterhouse, 2022;Liu et al., 2018;Marks et al., 2021;Wang & Wang, 2022).Likewise, the choice of parameters in comparative genomics software can significantly impact the results obtained, as default parameters are typically optimised for particular (usually gold standard) datasets (Buchfink et al., 2021;Emms & Kelly, 2019).
Therefore, careful data validation and quality control is crucial to ensure the accuracy and reliability of comparative genomics results.
Here, we present cogeqc, an R/Bioconductor package that can be used as a toolkit for assessing genome assembly and annotation statistics, orthogroup inference and synteny detection.The package offers context-guided assessments of assembly and annotation statistics by comparing observed values to those of closely related species on the National Center for Biotechnology Information (NCBI), while gene space completeness can be assessed with best universal single-copy orthologs (BUSCOs).The orthogroup inference assessment uses a protein domain-aware orthogroup score to maximise the number of shared protein domains within the same orthogroup.
Finally, the assessment of synteny detection relies on representing anchor pairs as a synteny network and analysing its graph properties.The application of cogeqc to real datasets allowed for an evaluation of multiple parameter combinations for orthogroup inference and synteny detection, providing researchers with guidelines to aid in the selection of the most appropriate parameters for their specific data.

| IMPLEMENTATION
cogeqc is part of the Bioconductor ecosystem and, as such, can be easily integrated with other Bioconductor packages.Input data types are either base R or core Bioconductor classes (e.g.DNAStringSet and AAStringSet objects for DNA and protein sequences, respectively).For integration with external software tools (i.e.BUSCO (Simão et al., 2015) and OrthoFinder (Emms & Kelly, 2019)), we provide users with functions to read and parse their output for downstream analyses in cogeqc.

| Assessing genome assembly and annotation statistics
We propose a context-guided assessment of assembly and annotation statistics that consists in comparing observed values for common metrics (e.g.genome size, contiguity measures, number of genes, etc.) with those of closely related species on the National Center for Biotechnology Information (NCBI).For a particular taxon, the function get_genome_stats() extracts summary assembly and annotation statistics for all genomes on NCBI via the Datasets REST API (https:// www.ncbi.nlm.nih.gov/ datas ets/ ) and returns a data frame with information on 35 variables, such as assembly level, scaffold and contig contiguity measures, number of coding and non-coding genes, and submitter data.In addition to the NCBI-provided statistics, the output data frame includes a variable 'CC ratio' representing the ratio of the number of contigs to the number of chromosome pairs, which has recently been proposed by (Wang & Wang, 2022) as a contiguity measure that compensates for the flaws of N50/L50 and allows cross-species comparisons.
Additionally, users can create a data frame containing assembly and annotation statistics for their own genome projects and pass it to the compare_genome_stats() function along with the output of get_genome_stats().This function will add the user-provided statistics to a distribution of reference statistics from NCBI quality-checked genomes and report the percentile and rank of observed values in the distribution.Of note, statistics for NCBI genomes can also be obtained with the getAssemblyStats() function of the biomartr package (Drost & Paszkowski, 2017), but which is dramatically slower and limited to a single query species.Observed statistics can also be visually compared with reference statistics in publication-ready plots created by the function plot_genome_stats() (Figure 1a).Such context-guided assessments are particularly useful in cases when assembly statistics seem problematic (e.g.genome size is too large, number of genes is too small), but which are in fact due to genomic features of particular taxa, such as smaller genomes for parasitic (Xu et al., 2021), carnivorous (Palfalvi et al., 2020) and aquatic plants (An et al., 2019), and larger genomes for species with higher transposable element contents (Michael, 2014).

| Assessing gene space completeness
To assess gene space completeness, cogeqc relies on the identification of best universal single-copy orthologues (BUSCOs) (Simão et al., 2015).The function run_busco() is a wrapper that takes sequences as input (as FASTA files or as AA/DNA/RNAStringSet objects), runs BUSCO from the R session, and returns a data frame with the frequency of complete (duplicate and single copy), fragmented and missing BUSCOs.Users can also run BUSCO through the command line and read its output as a data frame using the function read_busco().Finally, the function plot_busco() can be used to create publication-ready summary plots for both single-genome and batch modes (Figure 1b,c).

| Assessing orthogroup inference
We developed a protein domain-aware orthogroup assessment score that aims to maximise the number of shared protein domains within the same orthogroup while minimising the number of different orthogroups containing the same protein domain.The rationale for such approach is that genes that share the same protein domain are expected to have evolved from a common ancestor, so they should be assigned to the same orthogroup.Formally, orthogroup scores are calculated as: The homogeneity term is the mean Sorensen-Dice index (Dice, 1945;Sorenson, 1948) for all pairwise combinations of genes in an orthogroup.The Sorensen-Dice index measures how similar two genes are in terms of the protein domains they have, and it ranges from 0 to 1, with 0 meaning that a gene pair does not share any protein domain, and 1 meaning that it shares all protein domains.Formally, where A and B are the set of protein domains associated with genes A and B. Hence, an orthogroup with score 1 would have all genes with the exact same protein domains, while an orthogroup with score 0 would have a different protein domain for each gene.As individual genes in a gene family can lose domains and gain new ones, orthogroup scores can take any value from 0 to 1.
The dispersal term aims to correct for overclustering (i.e.orthogroup assignments that break 'true' gene families into an artificially large number of smaller subfamilies), and it describes the relative frequency of dispersed domains (i.e. the same domain in two or more orthogroups).This term penalises orthogroup assignments with the same protein domains in multiple orthogroups.
We acknowledge that the presence of the same protein domain in multiple orthogroups can occur due to convergent evolution, but since convergent evolution of protein domains is rare (Gough, 2005), we assume that such patterns indicate overclustering of gene families.
Orthogroups inferred with OrthoFinder (Emms & Kelly, 2019) can be read with the function read_orthogroups(), and mean and median scores can be calculated with the function assess_orthogroups().To ensure a higher accuracy in orthogroup assignments, we recommend running OrthoFinder with different Finally, comparative genomics statistics obtained with OrthoFinder can be read as a list of data frames with the function read_orthofinder_stats() and visualised with graphical functions that create publication-ready plots summarising statistics, such as plot_orthofinder_stats(), plot_og_overlap() and plot_og_sizes()

| Assessing synteny detection
We propose a network-based assessment of synteny (or collinearity, used here as synonyms) detection that consists in representing synteny relationships as a graph (i.e. a synteny network) and analysing topological properties of the graph to assess its quality.To infer synteny networks for mammalian and angiosperm genomes, However, there is often a trade-off between the number of nodes and the clustering coefficient, with larger networks being more sparse, and smaller networks being more densely connected.
To account for this trade-off, we use the product of the clustering coefficient and the number of nodes to assess networks.Additionally, as synteny networks and biological networks in general tend to be scale-free (i.e. the degree distribution follows a power-law distribution) (Barabási, 2009;Barabasi & Oltvai, 2004;Ravasz et al., 2002;Venancio et al., 2009;Zhao & Schranz, 2019), we added a term to the network score formula that considers how well the network fits a scale-free topology.Formally, the score of a synteny network is calculated as where C is the network's clustering coefficient, N is the number of nodes, and F is the coefficient of determination (R 2 ) for the scale-free topology fit.

| Orthogroup assignments in different public databases perform equally well
We and HOGENOM (Penel et al., 2009).The rationale for this approach is that domain homogeneity for reference-quality orthogroups should be as high as possible, as indicated by our benchmark with the OrthoBench dataset (Text S2).As the species composition varies across databases, we used Arabidopsis thaliana as a representative species to assess orthogroups.For each database, orthogroups were filtered to include only Arabidopsis genes, and scores were calculated with the function calculate_H() using InterPro domain annotation obtained from PLAZA Dicots 5.0 (Text S3).
We observed that eggNOG orthogroups have lower scores than orthogroups from all other databases (Mann-Whitney U test, p < 0.01).HOGENOM orthogroup scores are higher than OrthoDB scores, but lower than PLAZA.Finally, PLAZA orthogroup scores are higher than all other databases (Text S3, section 3).However, although differences were significant (Mann-Whitney U test, p < 0.01), Wilcoxon effect sizes (r) were small, with the difference between egg-NOG and HOGENOM being the only one with r > 0.1 (Figure 3a).The small effect sizes suggest that the observed differences could be due to large sample sizes, as small p-values can be obtained if the sample size is large enough, even when differences are negligible (Sullivan & Feinn, 2012).Thus, despite some small differences, we conclude that all databases perform equally well in their orthogroup assignments.

| Assessing orthogroup inference under multiple combinations of OrthoFinder parameters
To To test for a possible bias resulting from the species choice, we inspected the distributions of orthogroup scores by OrthoFinder mode for each species separately.We observed that the species choice does not affect scores, as revealed by similar distribution shapes for all species (Figure 3c).
mode (Buchfink et al., 2021).To validate that orthogroup scores are corrected for overclustering, we ran OrthoFinder with mcl = 5 and confirmed that scores are penalised if the granularity is too high (Text S5).
However, large variances in gene copy numbers can result in uninformative parameter estimates, and a common rule of thumb consists in removing orthogroups with ≥100 genes (Mendes et al., 2020).
When comparing OrthoFinder runs with mcl values of 3 and 1.5, we observed a 2.7-fold increase in the percentage of orthogroups with ≥100 genes for the latter (Figure 3f).Hence, to reduce the number of discarded orthogroups, we recommend using OrthoFinder with mcl values of 3.However, since our dataset comprises a single plant family, we advise users to test different parameter combinations for more diverse (e.g.different families and orders) or more restricted (e.g.genus-or population-level) datasets.

| (Re)assessing the effectiveness of OrthoFinder's bit score normalisation
Bit scores are heavily influenced by gene length, which makes them a biased measure of sequence similarity (Emms & Kelly, 2015).To address this issue, OrthoFinder has introduced a normalisation technique that accounts for gene length, removing the correlation between gene length and bit scores (Emms & Kelly, 2015).To verify the effectiveness of OrthoFinder's normalisation, we performed a Spearman correlation test between orthogroups scores and median gene length and observed a significant but weak negative correlation (ρ = −0.19,p < 0.001; Figure 4a).Furthermore, we observed that number of domains in a protein is moderately correlated with its length (ρ = 0.416, p < 0.001; Text S4), indicating that the number of domains can be a confounder.To address that, we calculated partial Spearman's correlation coefficients between orthogroup scores and median gene length while controlling for the number of domains, and we found a spurious correlation between these variables (Figure 4a; Text S4).Collectively, our findings shows that OrthoFinder's bit score normalisation is effective in reducing biases resulting from gene length.

| Functional analyses unveil biological processes associated with rapidly and slowly evolving gene families
We observed that all distributions of orthogroup scores produced  of orthogroup scores into three clusters based on peaks (Figure 4b).
Cluster 1 includes orthogroups with low scores, indicating faster evolutionary rates due to gain and loss of protein domains.Cluster 3 includes orthogroups with high scores, suggesting slower evolutionary rates due to shared protein domains by most or all members.
Cluster 2 includes orthogroups with intermediate scores, indicating neither fast nor slow evolutionary rates.To investigate the functional profiles of each cluster, we conducted enrichment analyses of GO terms, MapMan bins, and InterPro domains using all genes in orthogroups as background.
For cluster 1, we found an enrichment of genes associated with ATP production, water and K + transport, seed oil body biogenesis, and response to nitrate and ethylene (Figure 4c).

| Graph-based assessment of synteny detection with different combinations of parameters
To detect synteny between genomic regions, one must explicitly define the minimum number of genes required to call a syntenic block or segment, and the maximum number of allowed gaps between genes.Here, we used the R package syntenet (Almeida-Silva et al., 2023) to infer synteny networks among Fabaceae genomes on PLAZA 5.0 (Van Bel et al., 2022) with 5 combinations of parameters: a3m25, a5m15, a5m25, a5m35, and a7m25, where 'a' stands for the minimum number of anchors to call synteny, and 'm' stands for the maximum number of gaps between genes (Text S6).We inferred species-specific (i.e.intragenomic) networks and a full network (i.e. all pairwise species comparisons).
For the full network, scaled network scores were very similar, but the parameter combination a3m25 resulted in the best synteny network, with the largest number of nodes, scale-free topology fit, and overall score (Figure 5a).Interestingly, the network obtained with the default parameter combination of the MCScanX algorithm, a5m25, had the lowest score.When analysing each species' network separately, we observed that the best parameter combination depends on the species (Figure 5b), highlighting the importance of performing assessments for each dataset.The combinations a7m25 and a5m15 are typically the worst, leading to zero scores in some species due to clustering coefficients of zero, while a3m25, a5m25, and a5m35 lead to the best scores in 45%, 33%, and 22% of the species, respectively (Figure 5b).Finally, we observed that the parameter combination that leads to the highest score in most species-specific networks (a3m25) is also the best combination for the full network (with all species) (Figure 5b).
Although it suggests that this combination tends to be the best in most cases, we advise users to test multiple combinations whenever possible.

| CON CLUS ION
cogeqc is an R package that can be used to assess the quality of genome assembly and annotation in a phylogenetic context, and to assess the quality of orthogroup inference and synteny detection.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
To ensure full reproducibility, all code and data used in this manuscript are available in a GitHub repository at https:// github.com/ almei dasil vaf/ cogeqc_ paper , and the code used in benchmarks are available in Supplementary Texts.A Zenodo archive of all data and code is also available (Almeida-Silva, 2023).

1
Summary of publication-ready plots that can be created with graphical functions in cogeqc.(a) Summary assembly and annotation statistics for all Zea mays genomes on the NCBI obtained with the function plot_genome_stats().Red data points in the 'Sequence length' and 'Gene count total' panels represent simulated observed values passed by the user.(b) BUSCO summary statistics for a single genome obtained with the function plot_busco().The data used in this figure are a BUSCO output for the Ostreococcus tauri genome.(c) BUSCO summary statistics for multiple genomes obtained with the function plot_busco().The data used in this figure are a BUSCO output in batch mode for the bacteria Herbaspirillum seropedicae and Herbaspirillum rubrisubalbicans.(d) Summary comparative genomics statistics from OrthoFinder obtained with the function plot_orthofinder_stats().(e) Heatmap of orthogroup overlap for pairwise species comparisons as obtained with the function plot_og_overlap().All data used to create the figures are distributed with the package as example datasets.and comparing the distributions of orthogroup scores in each run to select the best.Alternatively, if reference and reliable orthogroup assignments exist, users can compare their predicted orthogroups with reference orthogroups by using the function compare_orthogroups(), which can show the percentage of reference orthogroups that are preserved in predicted orthogroups.

(
Zhao & Schranz, 2019) have run a synteny detection algorithm with multiple combinations of parameters and selected the best combination based on the clustering coefficient and number of nodes of each network.Ideally, a synteny network should have a large number of nodes (i.e.anchor pairs, duplicated genes retained from a large-scale duplication event) and a high clustering coefficient.

(
Almeida-Silva et al., 2023), and their scores can be calculated with the function assess_synnet(), which returns a data frame with the network's score and the observed values for each term of the formula above (C, N, and F).If users have multiple networks stored in a list, the function assess_synnet_list() can calculate scores for multiple networks at once.

3
| RE SULTS AND D ISCUSS I ON 3.1 | Assessing the completeness of Chlorophyta genomes To demonstrate the usage of the functions to assess gene space completeness, we obtained genome sequences for all Chlorophyta genomes on Pico-PLAZA 3.0 (N = 16) (Van Bel et al., 2022) and calculated their BUSCO scores (Text S1).All genomes were stored in the same directory and the function run_busco() was used to run BUSCO in batch mode using the lineage dataset chlorophyta_odb10.BUSCO scores were visualised used the protein domain-aware orthogroup assessment implemented in cogeqc to assess orthogroup assignments in public databases, namely PLAZA Dicots 5.0(Van Bel et al., 2022), OrthoDB(Kuznetsov et al., 2023), eggNOG (Hernández-Plaza et al., 2023), infer orthogroups, OrthoFinder relies on similarity searches with DIAMOND(Buchfink et al., 2021), followed by normalisation of bit Score = CNF, scores by gene length, and graph-based clustering using Markov clustering (MCL).By default, OrthoFinder runs DIAMOND in default mode, which is faster, but less accurate than its ultra-sensitive mode.To cluster genes into orthogroups, an MCL inflation parameter of 1.5 is used by default, with lower values resulting in a smaller number of larger clusters (i.e.low granularity), and higher values resulting in a greater number of smaller clusters (i.e.high granularity).To investigate the effect of different parameter combinations on orthogroup homogeneity, we downloaded the proteomes of 25 Brassicaceae species from PLAZA 5.0, Phytozome v13, BRAD and CoGe(Cheng et al., 2011;Goodstein et al., 2012;Lyons et al., 2008;Van Bel et al., 2022), and ran OrthoFinder with eight combinations of parameters by changing the DIAMOND mode (standard vs. ultra-sensitive mode) and the MCL inflation (mcl = 1, 1.5, 2, and 3).Orthogroup scores for each OrthoFinder run were obtained with the functionassess_orthogroups() (Text S4).A global comparison of the distributions of orthogroup scores shows that using an mcl = 1 dramatically reduces homogeneity scores as compared to every other mcl value (Mann-Whitney U test, p < 0.01; Figure3b).Orthogroup scores for the default OrthoFinder mode (standard DIAMOND, mcl = 1.5) are much larger than runs with mcl = 1, both with standard and ultra-sensitive DIAMOND mode (Mann-Whitney U test, p < 0.001, effect size >0.3;Figure3b).

Furthermore
Figure 3d).Wilcoxon effect sizes for the comparisons between mcl values of 1.5 and above are small (r < 0.1; tables 3 and 4 in Text S4), suggesting that significant differences could be due to large sample sizes (N > 16,000).Likewise, comparisons of orthogroup scores for OrthoFinder runs with different DIAMOND modes revealed significant but negligible differences (p < 0.05, r < 0.04; Figure3e; table5in Text S4), indicating that changing DIAMOND modes has little impact in orthogroup scores and, hence, should not be a concern in orthogroup detection.This has been suggested by DIAMOND developers in their original manuscript(Buchfink et al., 2021), but we now confirm it with supporting data.Thus, we recommend running OrthoFinder with the standard DIAMOND mode, because it offers a 100-fold increase in speed compared to the ultra-sensitive by different OrthoFinder parameters have a similar shape.Using the default OrthoFinder mode, we were able to divide the distribution F I G U R E 4 Validation of OrthoFinder's bit score normalisation and functional analyses of orthogroups clusters.(a) Relationship between orthogroup scores and orthogroup median length.Spearman's correlation coefficient (ρ), partial Spearman's correlation coefficient (ρ partial ), and p-value are indicated in the bottom left part of the plotting area.(b) Distribution of orthogroup scores for an OrthoFinder run with default parameters.Peaks were used to split the distribution in three clusters, with boundaries indicated by dashed red lines.(c) Tree plot of enriched functional terms for each orthogroup cluster.Enrichment analyses were performed with clusterProfiler (Wu et al., 2021) and visualised with enrichplot.Terms are grouped in a tree-like structure based on semantic similarities calculated with the function pairwise_ termsim() from the enrichplot package.F I G U R E 5 Assessment of synteny detection in Fabaceae species with different combinations of parameters.(a) Network scores for a synteny network containing all Fabaceae species.(b) Network scores for species-specific synteny networks.Networks with zero scores are due to clustering coefficients of 0. Colours represent different combinations of parameters.
The package was designed to be user-friendly, easily integrate with other packages and software tools, and provide summary results as publication-ready plots.Applications to real datasets demonstrated how the package can be used to select optimal parameters in orthogroup inference and synteny analyses, providing users with general guidelines.AUTH O R CO NTR I B UTI O N SFabricio Almeida-Silva and Yves Van de Peer conceived the ideas and designed methodology.Fabricio Almeida-Silva collected the data.Fabricio Almeida-Silva and Yves Van de Peer analysed the data.Fabricio Almeida-Silva and Yves Van de Peer led the writing of the manuscript.All authors contributed critically to the drafts and gave final approval for publication.ACK N O WLE D G E M ENTSYvesVan de Peer acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (no.833522).Yves Van de Peer and Fabricio Almeida-Silva acknowledge funding from Ghent University (Methusalem funding, BOF.MET.2021.0005.01).