Statistical Genomics Analysis of Simple Sequence Repeats from the Paphiopedilum Malipoense Transcriptome Reveals Control Knob Motifs Modulating Gene Expression

Abstract Simple sequence repeats (SSRs) are found in nonrandom distributions in genomes and are thought to impact gene expression. The distribution patterns of 48 295 SSRs of Paphiopedilum malipoense are mined and characterized based on the first full‐length transcriptome and comprehensive transcriptome dataset from 12 organs. Statistical genomics analyses are used to investigate how SSRs in transcripts affect gene expression. The results demonstrate the correlations between SSR distributions, characteristics, and expression level. Nine expression‐modulating motifs (expMotifs) are identified and a model is proposed to explain the effect of their key features, potency, and gene function on an intra‐transcribed region scale. The expMotif‐transcribed region combination is the most predominant contributor to the expression‐modulating effect of SSRs, and some intra‐transcribed regions are critical for this effect. Genes containing the same type of expMotif‐SSR elements in the same transcribed region are likely linked in function, regulation, or evolution aspects. This study offers novel evidence to understand how SSRs regulate gene expression and provides potential regulatory elements for plant genetic engineering.


Additional file 1 :
Figure S1.Length of six SSR repeat types in the P. malipoense transcriptome....

Figure S3 .
Figure S3.SSR length of different motif sizes in the three transcribed regions....
Figure S1.Length of six SSR repeat types in the P. malipoense transcriptome.

Figure S2 .
Figure S2.Length of SSRs in the three transcribed regions.

Figure S3 .
Figure S3.SSR length of different motif sizes in the three transcribed regions.* indicates a significant difference at p < 0.05, *** indicates a significant difference at p < 0.001 by the Kruskal-Wallis test and Dunn's pairwise test.

Figure S4 .
Figure S4.Adjusted standardized residual from chi-square test on frequencies between standardized mononucleotides in the P. malipoense transcriptome.This figure shows the deviation from the expectation of counts assuming no biased distribution of motifs among transcribed regions in the transcriptome.

Figure S5 .
Figure S5.Adjusted standardized residual from the chi-square test on frequencies between standardized dinucleotides in the P. malipoense transcriptome.This figure shows the deviation from the expectation of counts assuming no biased distribution of motifs among transcribed regions in the transcriptome.

Figure S6 .
Figure S6.Adjusted standardized residual from chi-square test on frequencies between standardized trinucleotides in the P. malipoense transcriptome.This figure shows the deviation from the expectation of counts assuming no biased distribution of motifs among transcribed regions in the transcriptome.

Figure S7 .
Figure S7.Comparisons of SSR characteristics among SSR-containing transcripts with different expression signatures.(a) The average SSR density among TPMmax (left) and TPMCV (right) levels.(b) The average GC content of SSRs (within SSRcontaining transcripts) among TPMmax (left) and TPMCV (right) levels.(c) The

Figure S9 .
Figure S9.Distribution patterns of SSRs in lncRNAs at various expression levels.(a) Trends of the proportion of SSR-containing sequences and SSR abundance of lncRNAs as TPMmax (left) and TPMCV (right) decreased.(b) Trends of the GC contents of SSRs in lncRNAs and their sequence contexts as TPMmax (left) and TPMCV (right) decrease.(c) The abundance of mono-, di-, tri-, tetra-, penta-, and

Figure S10 .
Figure S10.Comparisons of SSR characteristics among SSR-containing lncRNAs with different expression signatures.(a) The average SSR density among TPMmax (left) and TPMCV (right) levels.(b) The average GC content of SSRs (within SSR-

Figure S11 .
Figure S11.Regression models based on lnTPMmax and SSR density of fully standardized expMotifs within the 5'-UTR.(a) Regression model for the correlation between the density of A/T repeats in 5'-UTR and lnTPMmax.(b) Regression model for the correlation between the density of C/G repeats in 5'-UTR and lnTPM max .(c)Regression model for the correlation between the density of AG/CT repeats in 5'-

Figure S12 .
Figure S12.Regression models based on lnTPMmax and SSR density of fully standardized expMotifs within CDS.(a) Regression model for the correlation between the density of AAG/CTT repeats in CDS and lnTPMmax.(b) Regression model for the correlation between the density of AGG/CCT repeats in CDS and lnTPM max .(c) Regression model for the correlation between the density of ATC/ATG repeats in CDS and lnTPMmax.The asterisks indicate the optimal model (p < 0.05).The overfitting models and expMotif-region combinations with a sample size of less than 15 were excluded.

Figure S13 .
Figure S13.Regression models based on lnTPMmax and SSR density of fully standardized expMotifs within the 3'-UTR.(a) Regression model for the correlation between the density of AG/CT repeats in 3'-UTR and lnTPMmax.(b) Regression model for the correlation between the density of AT/AT repeats in 3'-UTR and lnTPM max .(c) Regression model for the correlation between the density of ATC/ATG repeats in 3'-UTR and lnTPMmax.(d) Regression models for the correlation between the density of AAAAT/ATTTT repeats in 3'-UTR and lnTPMmax.The asterisks indicate the optimal model (p < 0.05).The overfitting models and expMotif-region combinations with a sample size of less than 15 were excluded.

Figure S14 .
Figure S14.Regression models based on lnTPMCV and SSR density of fully standardized expMotifs within three transcribed regions.(a) Regression model for the correlation between the density of AT/AT repeats in 5'-UTR and lnTPMCV.(b) Regression model for the correlation between the length of ATC/ATG repeats in 5'-UTR and lnTPMCV.(c) Regression model for the correlation between the length of A/T repeats in CDS and lnTPMCV.(d) Regression models for the correlation between the density of A/T repeats in 3'-UTR and lnTPMCV.The asterisks indicate the optimal model (p < 0.05).The overfitting models and expMotif-region combinations with a sample size of less than 15 were excluded.

Figure S15 .
Figure S15.Regression models based on lnTPMmax and SSR density of actual expMotifs within the 5'-UTR and CDS.(a) Regression model for the correlation between the density of T repeats in 5'-UTR and lnTPMmax.(b) Regression model for the correlation between the density of T repeats in 5'-UTR and lnTPM max .(c) Regression model for the correlation between the density of CT repeats in 5'-UTR

Figure S16 .
Figure S16.Regression models based on lnTPMmax and SSR density of actual expMotifs within the 3'-UTR.(a) Regression model for the correlation between the abundance of A repeats in 3'-UTR and lnTPMmax.(b) Regression model for the

Figure S17 .
Figure S17.Regression models based on lnTPMCV and SSR characteristics of actual expMotifs within transcribed regions.(a) Regression model for the correlation between the density of A repeats in 5'-UTR and lnTPMCV.(b) Regression model for the correlation between the abundance of CT repeats in 5'-UTR and lnTPMCV.(c)

Table S4 .
Kruskal-Wallis test results of TPMCV among transcribed regions with different motif sizes of SSRs. .

Table S12 .
The optimal models of actual motif characteristics and lnTPMmax...

Table S13 .
The optimal models of actual motif characteristics and lnTPMCV....

Table S14 .
Groping criteria for unigenes based on TPMmax and TPMCV values.

Table S3 . Kruskal-Wallis test results of TPM max among transcribed regions with different motif sizes of SSRs.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicate this group was excluded from the test due to the sample size being less than ten within CDS or less than 20 within UTRs.

Table S4 . Kruskal-Wallis test results of TPM CV among transcribed regions with different motif sizes of SSRs.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicate indicates this group was excluded from the test due to the sample size being less than ten within CDS or less than 20 within UTRs.

Table S5 . Fully standardized motifs distributed in different transcribed regions had significantly different TPM max values.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicate indicates this group was excluded from the test due to the sample size being less than ten within CDS or less than 20 within UTRs.

Table S6 . Fully standardized motifs distributed in different transcribed regions had significantly different TPM CV values.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicate indicates this group was excluded from the test due to the sample size being less than ten within CDS or less than 20 within UTRs.

Table S9 . Summary of the results of statistical tests and regression analyses of 29 candidate expMotifs.
The statistical test was the Kruskal-Wallis test (and Dunn's post hoc tests) or Mann-Whitney test.The * indicate statistical significance of corresponding to p value < 0.05.The TPMmax-and TPMCV-associated expMotifs were highlighted in yellow and blue respectively.

Table S10 . Actual motifs distributed in different transcribed regions had significantly different TPM max values.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicates this group was excluded from the test due to the sample size being less than ten.

Table S11 . Actual motifs distributed in different transcribed regions had significantly different TPM CV values.
Different superscript letters represent significant differences, and the same superscript letters represent no significant difference (one-side test, BH adjusted p value < 0.05).The asterisks indicates this group was excluded from the test due to the sample size being less than ten.