GSCtool: A Novel Descriptor that Characterizes the Genome for Applying Machine Learning in Genomics

Machine learning (ML) is one of the core driving forces for the next breeding stage, and Breeding 4.0. Genotype matrix based on single‐nucleotide polymorphisms (SNPs) is often used in ML for genome‐to‐phenotype prediction. Genotype matrix has an inherent defect, as the feature spaces it generates across different individuals or groups are inconsistent, and this hinders the application of ML. To overcome the challenge, a genome descriptor, Genic SNPs Composition Tool (GSCtool) is developed, which counts the number of SNPs in each gene of the genome so the dimension of the feature vectors equals the number of annotated genes in a species. Compared to using the genotype matrix, using GSCtool significantly decreases the model training time and has a higher accuracy of phenotype prediction. GSCtool also achieves good performance in variety identification, which is useful in crop variety protection. In general, GSCtool will help facilitate the application and study of genomic ML. The source code and test data of GSCtool are freely available at https://github.com/SZJhacker/GSCtool and https://gitee.com/shenzijie/GSCtool.

populations with consistent dimensions or vectors.A new feature extraction tool could describe the genome in gene-level precision unlike the base-level precision of genotype matrices.Based on these concepts, we developed the Genic SNP Composition Tool (GSCtool), which records the number of SNPs present on each gene in the genome.The application of the tool in genomic ML was demonstrated by G2P and accurate identification of rice varieties.

For Phenotype Prediction
Previous research concluded that grain length (GL) and grain width (GW) are relatively stable in different environments, indicating that these traits are primarily determined by genetics. [6]For G2P, we collected GL and GW data of 3057 rice samples from three published papers. [6,7]7b] These data were divided into training sets and test sets with a ratio of 4:1.

For Identification of Crop Variety
We collected rice cultivars, which had been resequenced multiple times, from online databases, such as the National Center for Biotechnology Information (NCBI). [8]We searched "Oryza sativa" on NCBI, set filters such as "Genome", "paired", "Illumina", and so on, and obtained 40 067 records.Records with T-DNA, gene edit, ems mutant, etc. with genome modification were deleted.After manually checking and removing the potentially wrongly labeled samples, i.e., those with the same ID or variety name but that were unclustered together on a phylogenetic tree, a total of 41 cultivars (at least with three replicates, i.e., genomic dataset ≥3) were retained to demonstrate the application of GCStool in variety identification (Figure 1, Section 3, Table S4 and Figure S1, Supporting Information).

GSCtool
Conserved sequences are both prevalent and highly similar across individuals or genes, [9] and these sequences play a crucial role in performing essential biological functions. [10]As it is commonly known, genic regions tend to be more conserved than nongenic regions.The number of genic SNPs reflects the conservativeness of genes to a certain extent, and captures most of the biologically important variants.Additionally, the number of SNPs located in the gene region is closely related to gene function.Keeping these biological properties in mind, we have developed the GSCtool, which records the number of SNPs present on each gene in the genome.GSCtool serves as a novel genomic descriptor for ML applications in genomics.
As shown in Figure 1, Section 1, GSCtool counts the number of SNPs located in a gene region based on GFF3 and VCF.The GSCtool is defined as follows: (1) where G i represents the SNP number (Count ðSNP i 0 , SNP i 1 : : : SNP i N Þ) of Gene i and F includes the SNP number of all genes (G 1 , G 2 : : : G n ) in the genome.

ML Models
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks that have extensively improved a wide range of ML problems. [11]BiGRU is a sequence processing model that consists of two GRUs.The neural network architectures of this study included two GRU or BiGRU layers with 64 units, one GRU or BiGRU layer with 32 units, one dense layer with 16 rectified linear units (ReLUs), and the output layer is a dense layer with 1 unit (Figure 1A, Section 2).EarlyStopping callbacks monitored validation loss to control the model training progress.The optimizer was "RMSProp," and the loss function was "mean of squares of errors" (MSE).
LightGBM is a gradient-boosting framework for ML originally developed by Microsoft. [12]We adopted GRU, BiGRU, and LightGBM models to train the data.Optuna is an automatic hyperparameter optimization framework software. [13]In this study, the parameters of the three models used in data training and phenotype prediction were tuned by Optuna.
All models were trained on a Tesla V100-SXM2 32 GB graphics processing unit.

Evaluation Metrics
MSE and coefficient of determination (R 2 ) are commonly used evaluation metrics for regression models, where lower MSE and higher R 2 values indicate better model performance. [14]14a] The formula is as follows: where n is the number of data points.A lower MSE value indicates better model performance, with zero representing a perfect match between predictions and actual values.14b] The formula for calculating R 2 is as follows: where y i represents the actual values of the dependent variable, ŷi represents the predicted values from the regression model, and y represents the mean of the observed values of the dependent variable.
Area under the curve (AUC) is a popular evaluation metric for classification models. [15]It quantifies the model's ability to distinguish between positive and negative classes by measuring the area under the receiver operating characteristic (ROC) curve.
where n pos represents the number of positive samples, n neg represents the number of negative samples, and R i represents the rank of the ith positive sample among all the negative samples.
Similarly, R j represents the rank of the jth negative sample among all the positive samples.A higher AUC value (closer to 1) indicates the better discriminative ability of the model.

Whole Genome Sequencing Analysis
All raw FASTQ files were cleaned by fastp (v0.12.4). [16]Cleaned FASTQ files were aligned against the rice reference genome (IRGSP-1.0)by BWA-MEN (v0.7.17-r1188). [17]GATK (v4.2.5.0) was used to complete the sorting of the aligned files and mark duplicate reads. [18]GATK was also used to call SNPs from the bam file, which had been sorted and marked.To obtain a set of high-quality variants, the SNP calls were filtered according to the following parameters with customized scripts: QD 3) assembled the above tools into a whole genome sequencing (WGS) analysis workflow. [19]VCFtools (v 0.1.15)filtered the SNPs of the population based on minor allele frequency ≥ 0.01 and max missing > 0.5. [20]All WGS analyses were conducted at the Yazhou Bay Science and Technology City Advanced Computing Center.

Results
The sustainable growth of the breeding industry relies on both protecting and creating new varieties.The former involves variety classification, which can be achieved through ML classification.
The latter relies on G2P, an application of ML regression.We applied GSCtool in G2P and variety identification in rice to demonstrate the power of the tool and its application in plant genomic ML.

Comparison about G2P
We first compared genotype matrices with GSCtool in the training of G2P models for the prediction of rice GW and GL (Figure 1, Section 2).The filtered SNPs of examined datasets A, B, and C (see Experimental Section) were converted into genotype matrices, with dimensions of 4 980 226, 5 277 056, and 3 779 105, respectively.GSCtool generates the features with a dimension of 37 865 (i.e., the number of annotated genes) for all three datasets.We adopted the GRUs network to fit these data.
For the GW prediction, the MSE values of the trained models, when using genotype matrices, for datasets A, B, and C were 0.111, 0.086, and 0.079, respectively (Figure 2A-C).In comparison, the MSE values of the trained models, when using GSCtool, for datasets A, B, and C were 0.070, 0.037, and 0.025, respectively (Figure 2D-F).By using genotype matrices, the R 2 values of the models for datasets A, B, and C were 0.2, À0.52, and À1.07, respectively (Figure 2A-C), whereas by using GSCtool, the corresponding R 2 values of the models for datasets A, B, and C were 0.35, À0.11, and 0.2, respectively (Figure 2D-F).
From these two indicators, we concluded that the trained models performed better for GW when using GSCtool features.Similar conclusions could be made for the performance of the trained models based on genotype matrices and GSCtool in GL prediction (Figure S2, Supporting Information).These results demonstrate that the models based on GSCtool had a better performance than the models based on genotype matrix.

Model Training Time Comparison
Training time directly impacts the efficiency and productivity of the model development process.ML models often require  S5, Supporting Information).Therefore, GSCtool is more suitable for finding hyperparameters.

Better Model Prediction
Considering the better performance of GSCtool, we used it to find the better ML model for prediction of GW and GL (Figure 1, Section 2).We selected the GRU, BiGRU neural network, and LightGBM models to train the data and searched for the optimal hyperparameters of each model by fine-tuning hyperparameters using Optuna (Table S6 and S7, Supporting Information).Among the three models, GRU achieved the best results for prediction of GW and GL in two of the three populations.For GW, GRU was also the overall best model, and for GL the overall best model was LightGBM (Table 1).This result implies that choosing the best ML model for phenotype prediction depends on both traits and populations.

A New Strategy for Variety Protection
Molecular assay DNA fingerprinting is the most common technology for variety discrimination. [4,21]It uses molecular markers that can detect differences in specific regions of different species, which are ultimately achieved through electrophoretic spectra. [21]hylogenetic trees are commonly used for studying evolutionary relationships and affinities between different species; [22] however, differences within varieties are more subtle and complex, rendering traditional phylogenetic trees less conducive to precise variety classification.There is currently a scarcity of computational tools specifically designed for crop variety classification, and to date, there has been a lack of relevant research on ML Table 1.Performance of the fine-tuned models for phenotype prediction in different populations.Bold font indicates the results of the best performing model for GL or GW prediction in each of the three populations; b) The results of the overall best performing model for GL and GW prediction are shown in underlined bold font.
applications in identification of crop variety, an area related to the classification issue of supervised ML.GSCtool provides the possibility for ML classification to be applied in variety protection.We used publicly available genome sequence data of the rice varieties, which had been resequenced multiple times, to explore this possibility.
To have a consistent number of datasets for each rice variety (Figure S1, Supporting Information), we used the Under Random Sampling algorithm to select the varieties, with each having been resequenced 3 times.Then, we employed LightGBM, fine-tuned by Optuna, to fit the data (Table S7, Supporting Information).The accuracy of threefold crossvalidation is 100.0%,100.0%, and 97.56%, respectively.The area under the receiver operating characteristic curve (ROC AUC) score of the best model was 1.0 (Figure 4).This result showed that GSCtool achieved great classification performance and can be applied in the protection of crop variety by authentic identification of different varieties.

Conclusions
Existing methods for phenotypic prediction or varietal identification utilize genomic variation information in a very crude way, resulting in these methods having inherent disadvantages.For G2P, the genotype matrix is simply a numerical representation of the different variant types.The specificity of the location and number of variants on each individual produces a different genotype matrix.This inconsistency hinders the application of ML in genomics.3a,3b,3d] For varietal identification, molecular markers are used to distinguish varieties by detecting differences between individuals.However, such differences are not universal among individuals, and the location of molecular markers (primer binding sites) may also vary, rendering some markers applicable only to a limited number of varieties.Therefore, new molecular markers are still being developed. [21]o conclude, the development of descriptors offers a solution to the significant limitations of existing methods due to various biological properties.We developed a genome descriptor GSCtool for the application of ML in genomics.The use of features generated by GSCtool sped up the model training time and improved prediction accuracy compared to using the genotype matrix as input features.Compared to using genotype matrix as the input feature, using features generated by GSCtool decreased the model training time and increased prediction accuracy.Furthermore, GSCtool generates consistent feature dimensions across different populations of a given species, which is crucial for the application of ML in G2P.Molecular marker methods have limitations in identifying large populations of varieties.Furthermore, the development of molecular markers and the identification of varieties require significant human and financial resources.There are no computational tools available for variety classification.GSCtool also achieved great performance on variety identification, and provides a fundamental tool for ML in variety identification, and it is an inevitable trend to replace molecular markers with ML in variety identification in the future.
Feature engineering is the process of transforming and combining data to create new and useful features that improve model performance.It is an essential part of ML research.GSCtool is the first attempt at feature engineering for ML in the genomic domain.It creates new features based on SNPs in the genome and facilitates ML applications such as G2P and variety classification in the genomic field.

Figure 1 .
Figure1.Schematic diagram of the design and application of GSCtool.Section 1 shows the variants of populations A and B (the left panel), and the variant features used by the genotype matrix (the upper panel) and GSCtool (the lower panel).Section 2 schematically describes the process used in G2P predictions of rice GW and GL.Section 3 describes a pipeline for the classification of rice varieties.

Figure 2 .
Figure 2. Residual scatterplots comparing the model performance for GW prediction with genotype matrix and GSCtool.A-C) Model performance for GW prediction using genotype matrices as the input: (A) dataset A, 525 samples; (B) dataset B, 1138 samples; and (C) dataset C, 1494 samples.D-F) Performance for GW prediction using SNP time series as input: (D) dataset A, (E) dataset B, and (F) dataset C. R 2 , coefficient of determination; MSE, mean squared error; r, Pearson correlation coefficient.The black solid line is the best linear model fit and the gray solid line represents the linear model fit.

Figure 3 .
Figure 3.Comparison of the run time required for GRU model training with genotype matrix and GSCtool for A) GW and B) GL on dataset A, dataset B, and dataset C.

Figure 4 .
Figure 4. AUC of the classification model.The numbers in the brackets are the AUC values, which indicate model precision.