BUSCO: Assessing Genomic Data Quality and Beyond

Evaluation of the quality of genomic “data products” such as genome assemblies or gene sets is of critical importance in order to recognize possible issues and correct them during the generation of new data. It is equally essential to guide subsequent or comparative analyses with existing data, as the correct interpretation of the results necessarily requires knowledge about the quality level and reliability of the inputs. Using datasets of near universal single‐copy orthologs derived from OrthoDB, BUSCO can estimate the completeness and redundancy of genomic data by providing biologically meaningful metrics based on expected gene content. These can complement technical metrics such as contiguity measures (e.g., number of contigs/scaffolds, and N50 values). Here, we describe the use of the BUSCO tool suite to assess different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome‐assembled genomes where the taxonomic origin of the species is unknown. BUSCO is the only tool capable of assessing all these types of sequences from both eukaryotic and prokaryotic species. The protocols detail the various BUSCO running modes and the novel workflows introduced in versions 4 and 5, including the batch analysis on multiple inputs, the auto‐lineage workflow to run assessments without specifying a dataset, and a workflow for the evaluation of (large) eukaryotic genomes. The protocols further cover the BUSCO setup, guidelines to interpret the results, and BUSCO “plugin” workflows for performing common operations in genomics using BUSCO results, such as building phylogenomic trees and visualizing syntenies. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
With the advances of high-throughput sequencing technologies and their decreasing costs, the production of genomic data is increasing at an exponential rate. Large-scale genome projects such as the Earth BioGenome Project (Lewin et al., 2018), the Darwin Tree of Life Project (Threlfall & Blaxter, 2021), the 5,000 Arthropod Genomes Project (i5K Consortium, 2013), the 10,000 Plant Genomes Project (10KP; Cheng et al., 2018), the GEM (Genomes from Earth's Microbiomes) catalog (Nayfach et al., 2021), and the European Reference Genome Atlas (ERGA) aim at sequencing hundreds of thousands of new genomes in the coming years from all kingdoms of life. Additionally, many other individual studies are generating genomic resources at an accelerated pace, including hundreds of thousands of microbial metagenome-assembled genomes (MAGs). Despite the advances in computational methods, the generation of genomic resources can still be challenging (Richards, 2018;Sohn & Nam, 2018). Contamination and technical issues during sample preparation or assembly procedures can occur and complicate the generation of high-quality genomic resources-for example, in the case of highly complex and heterogeneous metagenomes, species with large genome sizes, or highly heterozygous samples. Intuitive metrics, along with easy-to-use tools, are needed to perform quality assessments on such a deluge of genomic resources. Quality assessments of draft data are required while generating these resources to timely identify possible issues to be solved before finalizing the data, for example, to identify redundancies in a draft genome assembly due to technical issues that can occur during de novo assembly procedures. Knowledge of the quality state of genomic resources is also required before attempting to perform subsequent or comparative analyses, to ensure that the quality of the input data is sufficient for the analysis of interest. This ensures an unbiased interpretation of the results and allows possible caveats and limitations to be considered if sub-optimal resources are included in the analyses. Quality control is therefore of paramount importance. For genome assemblies, metrics such as contigs/scaffold counts and contig/scaffold N50 values (meaning that half of the total assembly span is made up of contigs/scaffolds of length N50 or longer) only offer a summary technical view of genome assembly contiguity. Therefore, there is a necessity for tools that can accurately and quickly assess the readiness of genetic resources being generated for downstream analyses. In an effort to provide researchers with the ability to evaluate their genomic resources, the Benchmarking Universal Single-Copy Orthologue (BUSCO) tool provides evolutionarily sound measures of completeness and redundancy in terms of expected gene content (Manni, Berkeley, Seppey, Simão, & Zdobnov, 2021;Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015;Waterhouse et al., 2018). This is achieved by considering the presence of a predefined and expected set of single-copy marker genes as a proxy for genome-wide completeness. BUSCO implements such quantifications not only for assembled genomes but also for transcriptomes and annotated protein-coding gene sets, and they can be applied to both prokaryotic and eukaryotic data. BUSCO identifies matches to sets of genes that are present as single-copy orthologs in a given taxonomic group, i.e., genes that are evolving under "single-copy control", and thus expected to be present as single-copy genes in the majority of any newly sequenced species from the same taxonomic group. This expectation is defined empirically by identifying near-universal single-copy orthologs in major species clades with numerous sequenced and annotated genomes using the OrthoDB  catalog of orthologs (http:// www.orthodb.org). Exceptions are expected to occur due to gene duplications or losses that can take place along the evolution of a single species within a specific clade. When generating BUSCO datasets, the thresholds used to filter for genes that are almost always present as single-copy genes across a given clade are set to account for these exceptions and to further consider the incompleteness of most of the currently available assembled genomes and predicted gene sets, such as genes that are missing exclusively due to technical reasons. The latest versions of the BUSCO datasets (*_odb10; Manni et al., 2021) include 67 eukaryotic, 83 bacterial, 16 archaeal, and 27 viral datasets (see Supplementary Table 1 in Supporting Information). The datasets are bound to the software version so that the odb10 version datasets are intended to be used with BUSCO v4/v5 . The protocols in this manuscript describe the step-by-step commands to use BUSCO on various data types from different taxonomic origins and describe the use of the novel workflows introduced in BUSCO v4/v5. Based on the input sequence type (genome assembly, annotated gene set, or transcriptome assembly) and the taxonomic domain of origin (Bacteria, Archaea, or Eukaryota), different workflows built upon several third-party tools, each performing one step of the global analysis, are executed. In this article, we have detailed the protocols for running all the currently implemented BUSCO workflows. Basic Protocol 1 describes the steps for the assessment of a single input file (either a genome assembly, annotated gene set, or transcriptome assembly) with a known taxonomic origin. Basic Protocol 2 describes the analysis of an input sequence without specifying a dataset for the assessment, which enables the evaluation of sequences with unknown taxonomic origin. Basic Protocol 3 describes the extension of Basic Protocols 1 and 2 to analyze multiple inputs, and can also be applied to metagenomic bins or MAGs from both prokaryotic and eukaryotic species. The Alternate Protocol describes alternative steps to increase the speed of the analysis described in Basic Protocol 3 by using the Snakemake workflow management system (Mölder et al., 2021). Support Protocol 1 describes the various ways to set up a BUSCO installation. Support Protocol 2 is intended for the visualization of BUSCO results, e.g., for plotting the position of the identified single-copy markers on genome assemblies, or visualizing syntenies between multiple assemblies. Support Protocol 3 describes the use of single-copy genes identified with BUSCO as markers for building phylogenomic trees. The input files and scripts required to follow these protocols are available at https:// gitlab.com/ ezlab/ busco_protocol. $ git clone https://gitlab.com/ezlab/busco_protocol.git In general, a command to run a BUSCO assessment with a dataset manually specified looks like the following: with four main mandatory arguments: -i (or --in): a path to your FASTA file, which is either a nucleotide fasta file or a protein fasta file, depending on the BUSCO mode of assessment. From v5.1.0 the input argument can also be a directory containing fasta files to run the analysis on multiple inputs (see Basic Protocol 3). -l (or --lineage): the manually picked dataset for the assessment. It can be a dataset name, i.e., bacteria_odb10, or a relative (e.g., ./bacteria_odb10) or full (e.g., /home/user/bacteria_odb10) path. It is recommended to use the dataset name alone, as in this way BUSCO will automatically download and check the version of the dataset. In the case of a path, the dataset found in the given path will be used. Note that this argument is ignored when running with the auto-lineage workflow (see Basic Protocol 2). -m (or --mode): which analysis mode to run, either genome (or geno), proteins (or prot), or transcriptome (or tran). -o (or --out): the name of the output directory that identifies the analysis run and that contains all results, logs, and intermediate data. In the following examples, we are going to analyze the genome assembly and annotated gene set of the yeast Torulaspora globosa (assembly accession: GCF_014133895.1). We first need to select an appropriate dataset according to the taxonomy of this species. The lineage datasets used for BUSCO assessments are not packaged with the software. BUSCO will automatically download the required dataset when running the assessment. On the first BUSCO run, a busco_downloads/ folder will be created, containing the necessary lineage datasets within the subdirectory lineages/ (see Critical Parameters for more details on BUSCO datasets).
2. Manually select a dataset for the analysis. The most specific BUSCO dataset available for the yeast Torulaspora globosa is saccharomycetes_odb10. You can explore all available odb10 datasets by running: Here we are using the most specific dataset to perform the assessment with the highest possible resolution. You could also assess this genome with all the other datasets matching the lineage of the organism. In this case, according to NCBI taxonomy (Schoch et al., 2020)

Running BUSCO on a single genome assembly with known taxonomy
In this example, we are going to assess the genome assembly of T. globosa using the sac-charomycetes_odb10 dataset. In the downloaded repository, you can find this gene set at BUSCO_protocol/protocol1/Tglobosa_GCF_014133895.1_genome. fna.
3. Enter the busco_protocol testing folder: where {path_to_busco_protocol_folder} is the path to the busco_protocol/ testing folder you downloaded.
4. To launch an assessment of a genome assembly (which can be in the form of contigs, scaffolds, or chromosomes) the genome mode needs to be specified. Run the following command:

of 41
Current Protocols --download_path argument, with which you can specify the location on your machine where you wish to store the downloaded datasets and additional files required to run the software, and the --out_path argument with which you can specify the location on your machine where you wish to store the output folder, are set to the current working directory. Feel free to change these paths to your preferred location on your machine. For directions on using additional parameters, see the Critical Parameters and Advanced Parameters sections in Commentary. You then need to unpack and decompress the dataset before running BUSCO, e.g., with: $ tar -xfv arthropoda_odb10.2020-09-10.tar.gz Running this assessment on 12 CPUs with otherwise default options should take approximately 1 min. After BUSCO has completed the run, the summary scores are printed to the standard output and reported in the short_summary*.txt file that you can find in the main output folder, in this example busco_out_Tglob_genome/. The summary file is named after the dataset used for the assessment and the output name. In this case, it will be named short_summary.specific.saccharomycetes_odb10.busco_out_ Tglob_genome.txt, and will be similar to the following: This text file contains the classification of the identified BUSCO markers into categories of Complete (C), Complete and single-copy (S), Complete and duplicated (D), Fragmented (F), and Missing (M) BUSCOs as percentages and counts, and additional information such as the dataset used and the versions of the dependencies. In our example, BUSCO evaluates this genome assembly as of high quality, i.e., containing almost all the expected single-copy genes with a low duplication score. In the output folder, you can also find the logs/ subfolder containing the logs of BUSCO and its dependencies, and the run_<odb_dataset_name>/ folder (in this case run_saccharomycetes_odb10) containing all the other outputs and intermediate files. See Guidelines for Understanding Results for details on all BUSCO outputs and their interpretation. When assessing a prokaryotic species (i.e., by using a prokaryotic dataset) or virus, BUSCO uses the gene predictor Prodigal (Hyatt et al., 2010)

of 41
Current Protocols assessment on the 10-Gbp genome assembly of the wheat Triticum dicoccoides with the poales_odb10 dataset (4896 markers) using 50 CPUs takes approximately 9 hr with the BUSCO_Metaeuk workflow, whereas it needs several days using the BUSCO_Augustus alternative. The two workflows yield comparable but not identical results, as they are based on different methodologies; for details see Manni et al. (2021). Depending on the input size and your specific needs, you may opt for the BUSCO_Augustus workflow, e.g., if you need to obtain AUGUSTUS retraining parameters (also see Suggestions for Further Analyses).

Running BUSCO on a single annotated protein-coding gene set with known taxonomy
The command for analyzing an annotated gene set is similar to that used for genome assemblies, except for the --mode argument, which is set to proteins (or prot). The input file is a FASTA file of the amino acid translations of the protein-coding genes. To properly evaluate the amount of BUSCO gene duplications (which can be due to technical artifacts or true duplications), isoforms must be filtered from the gene set before running the assessment. Otherwise, each BUSCO gene with isoforms will score as a duplicated BUSCO. There are different criteria to select isoforms; for our purposes we selected the longest isoform per gene (steps in Support Protocol 3 also describe how to select the longest isoform per gene using a GFF file and the corresponding genome).
5. For analyzing the gene set of the yeast T. globosa using the manually specified dataset, run: The result folder looks similar to the one obtained for assessing genome assemblies, without the intermediate files related to the gene prediction steps. The gene set analysis is faster than the analysis on the corresponding genome assembly.

Running BUSCO on a single transcriptome assembly with known taxonomy
Transcriptome assessments are launched with the same four mandatory options for assessing genome assemblies and gene sets. You just need to change the value of the --mode (or -m) argument to transcriptome (or tran) and provide a transcriptome assembly as input file. The transcriptome mode requires MetaEuk when assessing eukaryotes and tBLASTn when assessing prokaryotes (see Support Protocol 1 for more details on BUSCO dependencies). As for the gene set, to obtain true estimates of the numbers of duplicated BUSCOs for transcriptomes, these should be pre-processed to select just one representative transcript per gene. Here we analyze an assembled transcriptome from the female reproductive tract of the Asian tiger mosquito Aedes albopictus (NCBI BioProject: PRJNA223166; Boes et al., 2014) using the diptera_odb10 dataset. For this analysis, the transcriptome was not pre-processed to remove isoforms, and thus we may obtain a high number of duplicated BUSCOs which just reflects the presence of different isoforms.
6. To launch the transcriptome assessment, run: The main content of the result folder is similar to the one obtained for assessing genome assemblies, with some differences (see the BUSCO output subsection in the Guidelines for Understanding Results). If the transcriptome comes from a specific organ/tissue or time point, it may contain only a fraction of the full BUSCO sets. Our test transcriptome harbors approximately 38% of the 3285 diptera_odb10 markers, which is not surprising given the specialized function of the tissue under consideration. See Guidelines for Understanding Results for details on the output files and their interpretation.
In v5, BUSCO takes advantage of the MetaEuk program also to find BUSCO markers in transcriptomes.

Plotting BUSCO scores
It is common to plot BUSCO scores as a bar chart from a single run, or from different runs side-by-side to visualize like-for-like comparisons, e.g., of different species/strains or assembly versions. To encourage the use of a standard and distinctive color scheme in publications, BUSCO includes a dedicated R (R Core Team, 2020) script to produce a figure and its source code that can be further edited, allowing one to customize the resulting bar chart (labels, fonts, axes, etc.). To run the script, Python is required. To produce the image, R and the ggplot2 library need to be available on the system. To plot the scores of one or multiple runs, it is sufficient to provide in a single folder the short_summary.txt files of each BUSCO run.

ASSESSING AN INPUT SEQUENCE WITH A DATASET AUTOMATICALLY SELECTED BY BUSCO
This protocol describes the case in which you do not specify a BUSCO dataset for the assessment. Instead, BUSCO attempts to automatically select the most appropriate dataset on the basis of a phylogenetic placement onto precomputed trees of marker genes extracted from the input file. This workflow is activated by using the --auto-lineage flag, and can be applied to all data types (genomes, gene sets and transcriptomes). With this option, no assumptions are made on the taxonomic origin of the input sequences, and it can be useful for inputs with unknown taxonomic origin, as in the case of metagenomeassembled genomes (MAGs), or on large sets of inputs where specifying each dataset manually can be tedious (see Basic Protocol 3 for running this workflow on multiple inputs). If the user knows the domain of origin of the input, the --auto-lineageprok or --auto-lineage-euk flags can be specified for prokaryotic and eukaryotic species, respectively. BUSCO will run faster when one of these two options is selected instead of the full --auto-lineage. In general, if the taxonomy of the species is known, it is preferable to specify the dataset manually, as the time and resources needed increase substantially with the phylogenetic placement. Also, the placement procedure may resort to a broader, less-specific dataset for the assessment if not enough markers are placed on the tree to select the most specific one. In this case, the selection may roll back to one of the general domain datasets (i.e., archaea_odb10, bacteria_odb10, or eu-karyota_odb10). See the Guidelines for Understanding Results for more details on the automatic selection of the datasets.

Necessary Resources Hardware
As described in Basic Protocol 1. Note that, usually, the RAM required for using the --auto-lineage option will not exceed 13 GB when assessing prokaryotic species; thus, the assessment can also run on a laptop. If the placement involves eukaryotic species, you might need more memory to run the assessment.

Software
As described in Basic Protocol 1. Note that for running the auto-lineage workflow you need a working installation of SEPP and pplacer.

Files
As described in Basic Protocol 1. The auto-lineage workflow can be applied to genome assemblies, gene sets, and transcriptomes. To perform the phylogenetic placement procedure, additional files are required, e.g., the precomputed super-alignments and trees. These are automatically downloaded by BUSCO during the analysis if a connection to the Internet is available. If not, you will have to download and put these files manually in the busco_downloads/ folder as described in the annotation to step 4 of Basic Protocol 1.

Running BUSCO on one input without specifying a lineage dataset
In this case, you do not need to specify a lineage dataset. BUSCO will first run the three root datasets (bacteria_odb10, archaea_odb10, and eukaryota_odb10) to figure out the domain of the input file, and then attempt to phylogenetically place the markers extracted in this first run on a precomputed superalignment and phylogenetic tree.
1. In this example, we are going to assess the same genome assembly of T. globosa assessed in Basic Protocol 1, but this time without specifying the saccha-romycetes_odb10 dataset. Enter the BUSCO_protocol testing folder downloaded in Basic Protocol 1: 2. Run the command as: Running this assessment on 12 CPUs with otherwise default options should take approximately 4 min, which is longer than the time needed to run the assessment described in Basic Protocol 1 when the dataset was manually specified (∼1 min), as additional steps are required here for the phylogenetic placement procedure. You can enter the output directory busco_out_Tglob_genome_auto/ to check the results and inspect which dataset was automatically selected for the assessment. If the run was successful, you should find the summary file short_summary.specific.saccharomycetes_odb10.busco_out_ Tglob_genome_auto.txt, named by the same rule described in Basic Protocol 1. As you can see, BUSCO was able to select the correct and most specific dataset for the assessment (saccharomycetes_odb10), corresponding to the manual choice we made in Basic Protocol 1. Additionally, you will find the file short_summary.generic.eukaryota_odb10.busco_out_Tglob_ genome_auto.txt, which corresponds to the short_summary file obtained by running the parent "root" domain dataset (in this case eukaryota_odb10) as the first step of the auto-lineage workflow. See Guidelines for Understanding Results for more information on the outputs produced by the auto-lineage workflow and how to interpret the results. In this example, you could specify the --auto-lineageeuk option instead of --auto-lineage, as we already know that our input file belongs to a eukaryotic species. You can apply the auto-lineage workflow to all the input types selecting the different modes covered in Basic Protocol 1.

ASSESSING MULTIPLE INPUTS
BUSCO v5 and higher can be run in "batch" mode on a collection of input files. These must be present in a single folder and must be of the same type, i.e., being either all Manni et al.

of 41
Current Protocols genomes assemblies, transcriptomes or gene sets. As for analyzing a single input file, there are two ways to run BUSCO on multiple files: a) with a BUSCO dataset manually specified (described in Basic Protocol 1), and in this case the same dataset will be applied to all the input files, e.g., a set of insect genomes analyzed with the insecta_odb10 dataset; or b) in combination with the auto-lineage workflow (also see Basic Protocol 2). In this case each input file is analyzed independently with a specific dataset automatically selected by BUSCO, and thus can be applied on a set of taxonomically heterogeneous inputs, such as metagenomic data that include MAGs of both eukaryotic and prokaryotic origin.

Necessary Resources Hardware
As described in Basic Protocols 1 and 2.

Software
As described in Basic Protocols 1 and 2.

Files
As described in Basic Protocols 1 and 2.

Running BUSCO on multiple inputs using the same dataset specified manually
Collect all the input files you want to analyze in a folder, and simply specify this folder as input. Here we analyze a set of bacterial genomes with the mycoplasmatales_odb10 dataset. You can find the folder containing the input genomes in the subfolder ./protocol3/bact_genomes/.
1. Open the terminal and enter the following command: BUSCO will figure out automatically that it must run on multiple files. For each run, a standard BUSCO result folder named after the input file will be created in the main output folder. In the main output folder, an additional text file summarizing the score of all the runs is provided. This batch mode can be applied to the other two data types by changing the attribute of the -m option.

Running BUSCO on multiple inputs without specifying a lineage dataset
As with the previous command, collect all the input files you want to analyze in a folder. This protocol is suitable for estimating the quality of metagenomic bins or "metagenomeassembled genomes" (MAGs) of unknown taxonomic origin, e.g., obtained from binner programs such as MetaBat (Kang et al., 2019), MaxBin2 (Wu & Singer, 2021), and MEGAN (Bagcı, Patz, & Huson, 2021;Huson et al., 2018). In such cases, the input file is a folder containing one FASTA file per bin. Here we analyze a mix of prokaryotic and eukaryotic genomes with known taxonomy, so we can compare the dataset selected by BUSCO with the ground truth represented by the taxonomic lineage of the inputs. As each input sequence can be either coming from a prokaryotic or eukaryotic species, we use the more general --auto-lineage flag to cover both cases.
2. Uncompress the fasta files: $ cd ./protocol3/genomes_mix/ && gunzip *.gz && cd ../../ 3. Run the auto-lineage workflow by entering: As with the previous example, the result folders for each run are written to the specified output directory (here busco_out_mix) and named after the corresponding input file name. An additional text file summarizing the scores of all the runs is also written in the main output directory. Table 1 shows an example of the summary file listing the BUSCO results for all the inputs analyzed. The first column reports the input file name; the second reports the most specific dataset that was selected by BUSCO on the basis of the placement procedure; the third to eighth columns report the standard BUSCO scores as percentages; and the ninth column reports the total number of markers for the corresponding dataset. In our example, the inputs are a mix of bacterial genomes, one green algal and two fungal genomes. The datasets selected by BUSCO are in agreement with the taxonomic lineages of the inputs. The table additionally reports the score obtained by running the three main domain datasets used to establish the most likely domain of the input file during the first step of the workflow. These scores might be useful for spotting heavy cross-domain contamination issues. However, note that these "by-products" scores need to be carefully interpreted (see Guidelines for Understanding Results for how to interpret these scores). This workflow can be run on the other data types (gene sets and transcriptomes) by changing the attribute of the -m argument (see Basic Protocol 1). This analysis should take approximately 30 min to complete when using 12 CPUs. On small to medium-sized genomes, increasing the number of CPUs will not speed up the analysis substantially, as the current implementation of parallelization in BUSCO is not optimized on small genomes. To speed up the analysis when analyzing a large number of prokaryotic or small eukaryotic genomes, you can take advantage of a workflow management system as described in Alternate Protocol.

DECREASING ANALYSIS RUNTIME WHEN ASSESSING A LARGE NUMBER OF SMALL GENOMES WITH BUSCO AUTO-LINEAGE WORKFLOW AND SNAKEMAKE
This protocol describes the use of a workflow management system, specifically Snakemake, to increase the speed of a BUSCO auto-lineage analysis on multiple genomes, and it is an alternative to Basic Protocol 3. Currently, BUSCO parallelization is not optimized for small genomes; therefore, using a workflow management system such as Snakemake with multiple CPUs can considerably reduce the runtime for analyses involving a large number of inputs, e.g., when assessing metagenomic bins or MAGs. Note that this Alternate Protocol is intended for analyzing prokaryotic genomes, small eukaryotic genomes, or a combination of both. It is not intended to be used on medium/large genomes, as for these the parallelization is already optimized in the standard BUSCO run.

Necessary Resources Hardware
As described in Basic Protocol 3.

Files
As described in Basic Protocol 1.
Manni et al.

of 41
Current Protocols

Running multiple BUSCO assessments with the auto-lineage workflow and Snakemake
To prevent collisions of simultaneous BUSCO runs attempting to write to the same location when using this alternate protocol, BUSCO will be run with the --offline flag, which does not allow BUSCO to connect to the Internet. Therefore, before running this workflow, you need to make sure to have all the required BUSCO datasets and files downloaded and updated on your machine.
In this example, we assess a mix of prokaryotic genomes, so we can just download the prokaryotic datasets.
1. To download the datasets in bulk, enter the location on your machine where you want to store the them: If you have exactly followed the previous commands, the relative paths will be as follows: wdir_path: "." data_path: "../../alternate_protocol1/genomes_mix" # specify the path to the folder file_suff: ".fna" # specify the suffix of your input files out_path: "busco_out_mix_alter_protocol" # specify the output folder # configuration BUSCO_threads: 5 autolineage: "--auto-lineage-prok" # or "--auto-lineage" or "--auto-lineage-euk" download_path: "{path/where/you/stored/BUSCO_downloads}" # specify the path of the busco_downloads folder where the datasets and files are stored 6. In this example, we are going to use a job submission management system to submit our Snakemake workflow to a cluster. In the following example, we use SLURM, but a similar command can be launched with other submission management systems such as PBS. From the current working directory (i.e., BUSCO_batch_analysis_with_snakemake/), you may run: $ snakemake --cluster "sbatch --ntasks-per-node=1 --cpus-per-task=5 --job-name=busco_runs" --jobs 6 Here the --jobs argument specifies the total number of jobs to launch simultaneously. If in the config.yaml file we specify 5 CPUs for each BUSCO analysis, the overall command will use 30 CPUs in total. We suggest restricting the number of CPUs to 5-8 for each BUSCO analysis, as it is the optimal setting for small genomes (e.g., <10 Mbp) in the majority of cases.
If you specify the --use-conda flag in the Snakemake command, Snakemake will download (if needed) and use the BUSCO conda version specified in the envs/busco.yaml config file. Otherwise, the BUSCO version available on your system will be used.
Alternatively, you can run the workflow without using a job submission management system, as in the next command (where we also specify the --use-conda flag), e.g., with: $ snakemake --cores 30 --use-conda Here --cores defines the maximum number of CPUs that can be used by the workflow. With 5 CPUs per BUSCO analysis, as defined in the config.yaml file, the workflow should run six BUSCO analyses at the time.
The outputs of the analysis will be written to the busco_out_mix_alter_ protocol/ folder. For the 25 prokaryotic genomes of this example, the analysis submitted through SLURM on a cluster using 30 CPUs completes in approximately 4 min.

BUSCO SETUP
This protocol describes the different ways to install BUSCO and its dependencies. BUSCO was implemented using Python 3 and tested on Linux operating systems. It is therefore recommended to use a Linux machine for running BUSCO. There are currently three options to obtain BUSCO: using the Docker container, through Bioconda, or by installing BUSCO and its dependencies manually.

Necessary Resources Hardware
As described in Basic Protocol 1.

Software
Python 3.3 or higher, the BUSCO tool, and its dependencies (refer to step 4, below). Optionally, a working installation of Docker or Singularity if using the Docker container option. Optionally, a working installation of conda if using the Bioconda option. where x.

Installing BUSCO from Bioconda
x.x refers to the BUSCO version and _cv1 to the container version for a given BUSCO version. There should be only one container per version unless an issue with the container was fixed without changing the BUSCO code.
3. To run the BUSCO container, you will need to control the user that is run within the container. You can pass your own user id with the -u argument, e.g., -u $(id -u). Note that if you do not specify a user, the container will create files under a different user you do not control. With the mount option -v you specify the location on your filesystem on which you can write and transfer files between your filesystem and the container filesystem. In our example, we will use the current working directory from where you are launching the assessment. You could use another folder, but be careful not to specify a host folder that does not exist, as Docker will create it using the root account and without asking for confirmation. It is safer to use the current directory as in our example. Run the container as follows:  • R and ggplot2 library: These are required for plotting BUSCO results using the commands described in Basic Protocol 1.

VISUALIZING BUSCO RESULTS
This protocol describes how to use some of the BUSCO results to create charts for displaying the location of markers on the input genome and visualizing syntenies between genomes.

Necessary Resources Hardware
Any machine supporting an installation of Python and R.

Files
One or multiple full_table.tsv files for plotting markers and visualizing syntenies.

Visualizing BUSCO markers on genomes
Sometimes it can be useful to visualize the location of BUSCO markers on the input genome, for example to show the distribution of markers, or highlight problematic regions, e.g., those with a high density of duplicated markers. This can be performed by extracting the BUSCOs' coordinates from the full_table.tsv file and using a program for plotting genomic coordinates. There are several options available to plot these; here we use the R package RIdeogram (Hao et al., 2020 Alternatively, you can provide the path to the genome assembly used in the assessment, and the workflow will generate this file automatically from the assembly. Here we are using a full_table.tsv obtained by running the diptera_odb10 dataset on the D. melanogaster genome. You can find this file at support_protocol2/plot_markers/data/full_table/Dmel_ full_table.tsv in the BUSCO protocol repository. Note that when using the workflow on your inputs, you need to name the full_table.tsv using the convention {sample_name}_full_table.tsv, and keep the same {sam-ple_name}keyword across all the corresponding files related to the same sample. For example, if the full_table file is named Dmel_full_table.tsv, the karyotype file (if provided by the user) must be named Dmel_karyotype.txt, and the genome (if provided), Dmel.fna. You will need to edit the config.yaml file, which you can find in the plot_markers/ folder, to specify the paths to your own inputs. In this config file, you can also provide the path to a text file with the IDs of a subset of sequences you wish to plot, named as {sam-ple_name}_selected_sequences.txt. If not provided, the script will attempt to plot all the sequences present in the input genome. Bear in mind that for a clean visualization, it is not possible to plot too many sequences, e.g., this approach will not work on a fragmented genome with hundreds or thousands of scaffolds. Alternatively, you can select some key sequences of interest to plot.
In the following Snakemake command, by specifying the --use-conda flag, Snakemake will install the dependencies through conda, which can be a bit slow. A faster way of using Snakemake's conda integration is by using the mamba package manager (https:// github.com/ mamba-org/ mamba), which is an extremely fast and popular conda replacement. Therefore, we recommend first installing mamba with conda install -n base -c conda-forge mamba. If you prefer to use Conda, you can enforce that by adding --conda-frontend conda to the Snakemake command.
3a. To run the workflow on the test data, just enter: $ snakemake -cores 2 -use-conda This workflow will: (a) extract the status, sequence IDs, and coordinates of each marker from the full_table.tsv; (b) extract the sequence lengths from the genome file to build a karyotype.txt file, if this file is not provided by the user.
4a. The plotting step is performed in the Rstudio environment where you can easily customize the appearance according to your needs and readily visualize the changes (you can also generate the plot from the command line using the Rscript command). Open Rstudio and set the working directory to the sup-port_protocol2/plot_markers/ folder of the BUSCO plugins repository. From the Rstudio console run: > setwd("/path/to/support_protocol2/plot_markers/") 5a. Open the plot_markers/scripts/plot_markers2.R script within Rstudio. The paths to source the necessary files are already set for the test data. In this script, you may customize the R code to change colors and labels. Run the commands of this script within Rstudio to produce the plot. You can do this by selecting all the commands of the script and clicking on the "Run" button.

Visualizing syntenies between genomes using BUSCO markers
The genomic position of each single-copy ortholog identified by BUSCO on two or more genomes can be used to infer syntenies between genomes. For visualizing syntenies between two or more genomes, you will need the full_table.tsv files obtained by running BUSCO on the input genomes. Note that you need to use the same dataset for all the assessments. In this example, we infer syntenic relationships between Drosophila melanogaster and D. pseudoobscura chromosomes using the diptera_odb10 single-copy orthologs. There are various programs to plot syntenies; here we use the R package RIdeogram as in the previous example.
1b. If not done already, download the busco_protocol repository from the GitLab: $ git clone https://gitlab.com/ezlab/busco_protocol 2b. Enter the support_protocol2/plot_syntenies subfolder: You can find the two full_table.tsv files obtained by running BUSCO with the diptera_odb10 dataset on D. melanogaster and D. pseudoobscura assemblies in the plot_syntenies/data/full_table/ folder. To plot your own data, remove these files and add your full_table.tsv files. 3b. Then, run the script for creating the files required for generating the plot with: $ snakemake -cores 2 --use-conda The script will pre-process the full_table.tsv files to create the file required for plotting the syntenies with RIdeogram. A karyotype.txt file can be provided to the command, in case you want to plot only a subset of sequences from the input files. You need to create this file with following format: If not provided, the script will extract this information directly from the genomes for all the sequences in the inputs. You may edit this file to keep only a subset of sequences of interest and rerun the workflow by specifying the edited file.
4b. Open Rstudio, and set the working directory to the support_protocol2/ plot_syntenies/ folder inside the busco_protocol directory. From the Rstudio console, run: > setwd("/path/to/support_protocol2/plot_syntenies/") 5b. Open the plot_syntenies/scripts/plot_syntenies2.R script within Rstudio. The paths to source the necessary files are already set. Here you can customize the R code to change colors and labels. Now run the commands of this script within Rstudio to produce the plot. You can do this by selecting all the commands Manni et al.

of 41
Current Protocols

BUILDING PHYLOGENOMIC TREES
BUSCOs, being near-universal single-copy genes, can represent reliable markers to be used in phylogenomics studies. BUSCO assessments on multiple organisms can be used to quickly and easily identify single-copy markers to generate multiple sequence alignments (MSA) from which to infer the species phylogeny. BUSCO datasets can identify large sets of genes from genomic data of variable quality, avoiding the tedious and possibly biased manual selection of orthologs. In this protocol we illustrate a workflow to build a phylogenomic tree from annotated gene sets (with a GFF and genome assembly available for each species). In this case, we concatenate the MSAs into a superaligment and use a Maximum Likelihood (ML) approach to build the phylogenomic tree. The analyses presented here are by no means exhaustive, and you should choose the most suitable methods and parameters (e.g., using partitions, different models etc.) according to your specific goals and inputs.

Necessary Resources Hardware
As described in Basic Protocol 1.

Software
As described in Basic Protocol 1 and additionally: conda (and optionally mamba), Snakemake, AGAT, MAFFT, TrimAl, IQ-TREE2, AMAS. Note that these tools will be automatically installed by running the Snakemake workflow (otherwise you can install these manually if you prefer).

Files
A set of genome assemblies and their corresponding GFF files.

Building phylogenomic tree from a set of genome assemblies and GFF files
1. If not done already, download the busco_protocol repository from GitLab: $ git clone https://gitlab.com/ezlab/busco_protocol 2. Enter the support_protocol3/ folder: $ cd support_protocol3/ 3. Collect the GFF and the genome assembly files you want to analyze in the GFFs/ and assemblies/ folders, respectively. In this example, we are going to build a small species tree using six insect species selected from various published sources:  Terrapon et al., 2014). Using the assembly accessions, you can download these with the NCBI "datasets" command-line tool (https:// www.ncbi.nlm.nih.gov/ datasets/ ), e.g., datasets download genome accession GCF_014905235.1 or manually from the NCBI portal or the ftp site using the wget command, e.g.: 4. Edit the config/config.yaml file in which you need to specify the paths to the GFFs and genomes folders, and additional parameters such as which BUSCO dataset you want to use to identify the single-copy orthologs that encompass all species. If you exactly follow the steps in this protocol, all parameters in the config.yaml file are already set for this example. Analyses with higher-resolution datasets identify more markers for inferring the species phylogeny. In our example, we are using the insecta_odb10 dataset.
Manni et al.

of 41
Current Protocols Figure 4 Example of a phylogenomic tree obtained by using single-copy orthologs identified by BUSCO on a test data of six insect species. The phylogeny was inferred from a set of 1154 concatenated BUSCOs under the Q.insect+R5 substitution model using IQ-TREE 2.
5. Run the Snakemake workflow, which comprises all the steps required for obtaining the species phylogeny. Briefly, the workflow consists of the following steps: i. For each species, isoforms are first filtered to keep only the longest protein-coding sequence for each gene according to the corresponding GFF file. This is achieved using the AGAT toolkit (Dainat J.), which also produces a useful report that can be used to verify there are no issues with the GFF entries. ii. BUSCO is then run in protein mode on the resulting filtered gene sets using the dataset specified in the config.yaml file. iii. The single-copy genes identified from each run are extracted and selected based on user-defined parameters specified in the config.yaml file, e.g., to select only genes that are shared across 100% of the species and with no duplicates across all species. iv. For each orthologous group, proteins are aligned using MAFFT (Katoh & Standley, 2013) and trimmed with trimAl (Capella-Gutiérrez, Silla-Martínez, & Gabaldón, 2009). v. Alignments are concatenated with AMAS (Borowiec, 2016). vi. The resulting superalignment is used to infer a maximum likelihood phylogeny using IQ-TREE 2 (Minh et al., 2020). To run the workflow, enter: $ snakemake -cores 16 --use-conda Several intermediate files and the final phylogenetic.nwk tree file are written in the output folder. Assessments of the six insect species with the insecta_odb10 dataset identified 1154 complete and single-copy BUSCOs found in all species. Note that in the config.yaml file, you can also specify to use markers found as duplicated (the best scoring duplicate will be selected) and markers missing in a certain number of species, e.g., missing in 10% of the inputs. To visualize the Newick file, you can use any program for viewing phylogenetic trees, e.g., Dendroscope (Huson & Scornavacca, 2012). Figure 4 shows the resulting tree obtained in this example.

GUIDELINES FOR UNDERSTANDING RESULTS
BUSCO attempts to provide a quantitative assessment of completeness in terms of expected gene content, and can be applied to genome assemblies, assembled transcriptomes, and annotated protein-coding gene sets. The main results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs. BUSCO completeness results make sense only in the context of the biology of your organism and your data types. For example, you need to understand whether missing or duplicated genes are due to biological or technical reasons. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or redundancies due to technical issues, e.g., when the assembly procedure fails to correctly collapse (or separate) haplotypes. Transcriptomes and protein sets that are not filtered for isoforms may lead to a high proportion of duplicates. These and other aspects to consider when interpreting the BUSCO scores are described in this section, along with a detailed list of the outputs produced by each assessment workflow.

BUSCO outputs
Every BUSCO assessment produces an output folder named according to the value specified under the --out (or -o) argument. This folder contains all the results, consisting of several subfolders with lots of files including intermediate results. These are kept for your benefit, so that you can examine the results in more detail and, if needed, use the data for downstream analyses. The content of the result folder can change according to the workflow that has been run, but a successful assessment will always contain the following files and folders: • short_summary*.txt file: a simple summary result file that reports the percentages and counts of "complete" ("single-copy" and "duplicated"), "fragmented", and "missing" BUSCOs. This file is named after the dataset used for the assessment and the name given to the output folder, e.g., if the dataset used is diptera_odb10 and the output name out_fly the file will be named short_summary.specific.diptera_odb10.out_fly.txt. The keywords "specific" or "generic" are added to distinguish the most specific assessment from the generic assessment with the domain-level dataset (that is created only when using the auto-lineage workflow). The following is an example of the content of a short_summary*.txt file: The file also contains the version of the dependencies used to run the assessment.
• logs/ folder: containing the BUSCO log file and logs of the dependencies. Take some time to check the log files; these are there for your benefit in order to highlight potential problems that may have occurred during your BUSCO run. This is the first place to look when a run crashes. If you need help, submit a request on the GitLab issue board attaching the BUSCO log file or part of it. • run_<dataset_odb10>/ folder: a folder containing all results and intermediate files relative to the assessment performed with the corresponding dataset Manni et al.

of 41
Current Protocols <dataset_odb10>, e.g., for an assessment performed with the diptera_odb10 dataset, this folder will be named run_diptera_odb10. • full_table.tsv file: the detailed list of all BUSCO genes and their predicted status in the genome, gene set, or transcriptome. It can be found in the run_<dataset_odb10>/ folder. This file is slightly different for the three modes. For "genome" mode, it displays the BUSCO orthologous group ID, followed by the status of the gene identified in the input sequence, along with its position (sequence name, start/end coordinates and strand), score, and length. The table additionally reports a keyword description and a link to the OrthoDB webpage of the corresponding orthologous group. For the protein mode, the identifier of the sequence is displayed and there are no start/end coordinates. With the transcriptome mode, it contains the identifier of the transcript, and the start and end coordinates refer to the start and end of the predicted match on the transcript. The full_table.tsv from a genome run looks like the following: • busco_sequences/ folder: containing the amino acid sequences in FASTA format of all BUSCO genes identified by the assessment procedure, and organized in three subfolders (single_copy_busco_sequences/, frag-mented_busco_sequences/ and multi_copy_busco_sequences/) according to the gene status. • hmmer_output/ folder: contains a tabular format of each HMMER output, one for each candidate protein evaluated, named after the BUSCO profile HMM used, e.g., 25101at4891.out.1. Note that these are all initial candidates, including those that are not retained as positive matches in the final classification. • missing_busco_list.txt file: listing the identifiers of missing BUSCOs.
If BUSCO is run on multiple inputs (i.e., a directory is provided to the input argument), the top-level output directory contains a number of subdirectories corresponding to each input file and named after the input file name. These directories contain the same output files as described above. An additional file is written in the top-level directory, listing all the resulting scores for each input file. A single busco.log file for the entire batch analysis is stored in the logs/ folder in the top-level directory. The result folder on a run assessing two input files look like the following: The result folder also contains the outputs produced in each intermediate step depending on the workflow that has been run. Specifically, for genome assessments using the BUSCO_Metaeuk workflow or for the transcriptome assessments on eukaryotic species: • metaeuk_output/ folder: results of the MetaEuk gene predictor.
In v5, BUSCO takes advantage of MetaEuk also to find BUSCO markers in eukaryotic transcriptomes. However, when assessing prokaryotic transcriptomes, BUSCO uses the previous transcriptome mode that relies on tBLASTn. The pipeline consists of a tBLASTn run taking BUSCO amino acid consensus sequences as queries and the input transcripts as a database to obtain a subset of sequences harboring potential matches to each BUSCO gene. In this case, a six-frame translation is blindly performed over the transcript length (which might include untranslated regions) and HM-MER is run to assign a score to the candidate amino acid sequences. The output folder contains the hmmer_output/ directory, the short_summary.txt, the miss-ing_buscos_list.tsv, and the full_table.tsv files similar to those of the eukaryotic transcriptome mode. Additionally, there will be the blast_output/ and translated_proteins/ folders containing the six-frame translations of every transcript having a match to a BUSCO marker during the tBLASTn analysis, including discarded candidates not present in the final classification. These sequences should not be considered genuine protein sequences without a thorough validation using a dedicated method.
For the genome assessment using the BUSCO_Augustus workflow the output contains the following specific folders: • blast_db/ folder: containing the database created from the input file for the tBLASTn search. • blast_output/ folder: containing the raw output of the two tBLASTn runs and the corresponding coordinates as defined by BUSCO to represent candidate regions. • augustus_output/ folder: containing the results of the AUGUSTUS gene predictor. It includes the list of single-copy genes that were used to retrain AUGUSTUS, and, in the predicted_genes/ subfolders, one gene model for each candidate region evaluated named after the BUSCO block profile used, e.g., 241586at4751.out.1.
In the extracted_proteins/ subfolder, there is one nucleotide and one amino acid sequence for each gene model, e.g., 241586at4751.fna.1 and 241586at4751.faa.1. The sequences in these subfolders represent all the initial candidates, including those that were not included in the final classification. The retraining parameters produced by BUSCO to be used by the second AUGUSTUS run are stored in the retraining_parameters/ subfolder. Other intermediate files that are produced during the analysis can be found in the remaining folders, e.g., gb/ and gff/ folders containing the files in GenBank and General Feature Format (GFF), respectively.
For the prokaryote/virus genome assessment: • prodigal_output/ folder: containing the results of the Prodigal gene predictor.
Results obtained by using the auto-lineage workflow additionally include the files related to the assessments using the three generic "root" datasets and the outputs from the phylogenetic placement in the following subfolder: • auto_lineage/ folder: containing the BUSCO results of the assessment using the three generic "root" datasets. They represent either incorrect or less specific choices of lineage. A symbolic link in the main output folder is made to the generic "root" dataset that was selected as most likely, in the following example, eukaryota_odb10/. This folder also contains the placement_files/ subfolder with the files of the phylogenetic placement of the extracted eukaryotic markers onto the precomputed eukaryota tree (not displayed in the tree-like structure below).
For example, the structure of the main output folder for an auto-lineage assessment on an eukaryotic genome looks like: In general, these intermediate outputs may be useful for other subsequent analyses or for in-depth investigation of the results.

How to report BUSCO results
You can report results in simple BUSCO notation, e.g.: C:89.0%[S:85.8%, D:3.2%],F:6.9%,M:4.1%,n:194, along with the name of the dataset. In addition, always report the BUSCO command and versions of the dependencies used for the run as supplementary information. If you use the BUSCO container, specifying the container version is sufficient to identify the versions of the dependencies and to safely reproduce a run. You can use the generate_plot.py script to produce a simple bar chart summarizing the results of one or multiple BUSCO runs (see Basic Protocol 1). Remember to report the versions of the genome assemblies, annotated gene sets, or transcriptomes assessed in the analyses.

Considerations on BUSCO score interpretation
To judge whether a score is satisfactory, the user will have to consider the type of sequence assessed and the biological context of the species of interest. A very good genome assembly should contain the majority of the BUSCO genes that were not lost during evolution. The number of gene losses for each specific species inside a clade cannot be precisely defined a priori, and it is possible that a small fraction of the expected single-copy genes is missing in certain species because of true gene loss, or present as multi-copy genes due to true gene duplications.
When comparing BUSCO scores of a genome assembly and its corresponding gene set, differences in the resulting scores may be due to several factors. When results from genome assembly assessments appear to be better than for their gene sets, it likely indicates that the approach taken by BUSCO to identify the subset of single-copy genes on the genome has produced better gene predictions than the one used in the annotation pipeline (at least for the subset of genes that make up the BUSCO lineage dataset). In this case, it may be possible to improve the annotation pipeline using different approaches or additional evidence. Conversely, if a gene set appears more complete than its corresponding genome, this may suggest that the custom annotation pipeline, which is usually tailored to that specific organism (e.g., by using multiple sources of evidence such as RNA-seq data), has resulted in generally better annotations than the general approach implemented in BUSCO. The user should assess both the assembly and the annotation results to judge whether the gene prediction strategy can be improved in light of the expected gene content in the genome. It is important that the user should never complete the annotated gene set with the BUSCO genes recovered by the genome mode prior to assessing it, as it would bias the evaluation of true annotation efficiency.
For the evaluation of transcriptome assembly, you need to consider the source material of the transcriptome. Not all BUSCO genes are necessarily expressed in the tissue/organ under consideration or at specific time points; therefore, transcriptomes are expected to show varying degrees of completeness depending on the sample. For example, with assessment of transcriptomes derived from specific tissues or life stages where the repertoire of transcripts is expected to be highly specialized, a low completeness score can be expected, and would in fact offer support that such targeted sampling was achieved.
Generally, a good genome assembly and its corresponding gene set should show low levels of duplications. Duplication of a few BUSCO genes in a genome is compatible with biological reality, as their evolution under single-copy control may be relaxed in some sublineages or specific species. However, a high duplication score could denote technical issues during the assembly procedure, e.g., indicating that the procedure has failed to correctly collapse (or separate) haplotypes, resulting in numerous pairs of highly similar duplicated genes. High duplication scores can also signal the presence of contaminant species, e.g., the presence of sequences from a species related to the target one. In such cases the single-copy markers from both species will score, and appear as duplicated markers. High duplication scores could also reflect biological reality, as for example with whole-genome duplication events. As mentioned above it is important to know the evolutionary context of the species under analysis to correctly interpret the results.
Users should be aware that the classification procedure that results in labeling the expected single-copy genes as "complete", "fragmented", and "missing", is based by necessity on classification approximations that reflect the most likely state of the gene. For example, the label "missing" is applied to BUSCOs with no matches (probably truly absent from the dataset), but also to matches that do not meet the HMM score cutoffs. These could be in fact present as fragments that fail to align or that do not reach these cut-offs, therefore not having enough matching sequence to classify the partial match as "fragmented". Classification of each BUSCO gene as complete is also based on a simplified classification based on the comparison of the length of the candidate sequence to length thresholds determined based on the distribution of gene lengths in the species used to build the datasets. It may be possible that a gene that is an outlier in terms of length or score will be classified as fragmented when it is in fact complete, or vice versa. Nonetheless, a high rate of fragmented BUSCO markers likely indicates issues in the sequencing or assembly procedures, or the inability of the annotation pipeline to fully capture the complexity of some gene models. Turning fragmented BUSCO gene matches into complete matches is a good indicator of a significant improvement of the quality of an assembly, especially when supported by changes in other metrics such as N50 values.
Other incongruences can arise when close homologs to BUSCO genes are present in the sequences under analysis. In such cases, the BUSCO classifier will give a better score to the true copies and therefore be able to discard the other sequences. However, if the actual BUSCO is missing, close homologs may sometimes reach a sufficient score to be considered as positive matches to a BUSCO gene that is in fact not present. An experienced Manni et al.

of 41
Current Protocols user, manually investigating the outputs, may be able to spot such situations. Nevertheless, the overall quality of the input data will be clear from the overall score. Generally, inputs with low completeness scores and high duplication scores would warrant further investigation to determine if these are due to technical reasons that can be resolved by employing different assembly or annotation strategies.

Consideration on the automated selection of the dataset
The BUSCO auto-lineage workflow introduced in v4 first runs BUSCO on the three most generic datasets (archaea_odb10, bacteria_odb10, and eukaryota_odb10) to determine the most likely domain of origin of the input sequence. Once the most likely domain is selected, BUSCO automatically attempts to find the most appropriate BUSCO dataset to be used for the final assessment by performing a phylogenetic placement of the markers identified and extracted from the initial run. BUSCO evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. It is possible that the automated placement procedure selects suboptimal datasets for the assessment, e.g., when not enough markers are placed to confidently select a higherresolution dataset. In that case, the BUSCO assessment is still valid but just at a lower resolution, e.g., when a bacterial species from the Actinobacteria clade is assessed with the most generic bacteria_odb10 dataset rather than the most specific actinobacteria_odb10 dataset. Though an infrequent scenario, it is also possible that BUSCO automatically selects, on the basis of the placement procedure, an incorrect dataset for the assessment. In this case the BUSCO scores might not be reliable. In our tests and benchmarks, we see that this event occurs rarely, and in the majority of such cases it is due to existing incongruences between the official taxonomy of the species and its actual phylogenetic signal.
When running the auto-lineage workflow on multiple files, the scores related to the assessments with the three general datasets (run to determine the domain of origin of the input file during the first step of the placement procedure) are also shown along with the final scores in the summary table listing the results of individual runs. These scores are there for your benefit, as they may help identify heavy contamination from other domains. However, you need to interpret these values with extra care. BUSCO matches in another domain do not necessarily mean that your genome/proteome contains sequences from that domain. These can be due to a certain fraction of overlapping markers or spurious matches among markers from different domains. However, high completeness scores in multiple domains might help you spot possible contamination issues.

Background Information
The BUSCO assessment process consists of running a computational pipeline to identify and then classify sets of genes expected to be found as single-copy in the majority of species within a certain taxonomic group. BUSCO matches can be identified from genome assemblies, annotated gene sets, or transcriptomes, and are classified based on custom sets of HMM (hidden Markov models) profiles using HMMER. For genome assessments, gene models are first built using the gene predictor MetaEuk with a collection of singlecopy protein-coding genes as evidence (default BUSCO_Metaeuk workflow) or using ab initio gene prediction with AUGUSTUS on the potential matches identified by tBLASTn searches (BUSCO_Augustus workflow, available by specifying the --augustus flag). Matches that meet the BUSCO HMM score cutoffs are classified as "complete" if their lengths fall within BUSCO profile length expectations, and if found more than once they are classified as "duplicated". Those that do not meet the length requirements are considered as partial matches and are classified as "fragmented". BUSCOs without matches passing the thresholds are classified as "missing". The assessments provide an intuitive quantification of the completeness of different data types in terms of expected gene content. BUSCO v4/v5 features novel workflows that