RNA‐Seq Data Analysis: A Practical Guide for Model and Non‐Model Organisms

RNA sequencing (RNA‐seq) has emerged as a powerful tool for assessing genome‐wide gene expression, revolutionizing various fields of biology. However, analyzing large RNA‐seq datasets can be challenging, especially for students or researchers lacking bioinformatics experience. To address these challenges, we present a comprehensive guide to provide step‐by‐step workflows for analyzing RNA‐seq data, from raw reads to functional enrichment analysis, starting with considerations for experimental design. This is designed to aid students and researchers working with any organism, irrespective of whether an assembled genome is available. Within this guide, we employ various recognized bioinformatics tools to navigate the landscape of RNA‐seq analysis and discuss the advantages and disadvantages of different tools for the same task. Our protocol focuses on clarity, reproducibility, and practicality to enable users to navigate the complexities of RNA‐seq data analysis easily and gain valuable biological insights from the datasets. Additionally, all scripts and a sample dataset are available in a GitHub repository to facilitate the implementation of the analysis pipeline. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
The development of high-throughput sequencing technologies has made it possible to uncover many organisms' transcriptional profiles at the whole-genome level.The technology of RNA-seq, or messenger RNA (transcriptome) sequencing, has permitted researchers to explore gene expression patterns with great precision in hundreds of model and non-model organisms.Even without a reference genome, a wealth of understanding related to processes such as cellular development, gene function, and responses to environmental stimuli, among others, has been uncovered (Stark et al., 2019).The RNA-seq methodology has often set the basis for developing molecular genetic analysis in nonmodel organisms and has become an essential tool.Researchers focused mainly on wet lab and field work sometimes struggle to exploit the data available from "next-generation sequencing" because they lack experience in bioinformatics, which is perceived to require in-depth computational and programming skills.This guide is intended to help bridge this gap.
Conventionally, an RNA-seq experiment involves subjecting organisms or cells to at least two experimental conditions and subsequently freezing the tissue or cell group of interest to halt transcriptional activity and initiate the RNA extraction process.Once the RNA is purified, cDNA libraries are constructed for all transcripts present at a given moment in the cell.These transcripts commonly include different classes of RNAs, such as tRNAs, ribosomal RNA, noncoding RNA, and messenger RNAs (mRNAs; Stark et al., 2019).However, given their importance, researchers are mainly interested in mRNAs that give rise to proteins.In recent years, there has also been a growing interest in determining the differential expression of other types of RNAs, such as small interfering RNAs (siRNAs) and microRNAs (Kasschau et al., 2007;Liu et al., 2019;Mehdi et al., 2021), which also play important roles in the regulation of different cellular processes.Once the cDNA libraries are built, they are sequenced using any of various different possible sequencing technologies; some of the technologies most widely used currently are those available commercially from Illumina.
In general, these protocols present the quality analysis of the FastQ files, the trimming and mapping process, and the differential gene expression analysis.Furthermore, we have included global analyses to facilitate the interpretation of the analyzed data, such as functional category enrichment analysis and data visualization through easily interpretable graphics.We also present a processing approach for non-model organisms for which no assembled and annotated genome is available.We encourage users of these protocol to share their scripts and output data on freely accessible platforms such as GitHub, providing the scientific community with the steps to obtain the results presented in scientific reports such as theses or indexed papers.All scripts and flowcharts provided here are documented in a GitHub repository, making it easier for users to implement these protocols and ensure repeatability when analyzing this type of data.

of 30
Current Protocols

STRATEGIC PLANNING
When planning a differential gene expression experiment using RNA-seq, it is essential to consider several factors.First, the experimental design must clearly define the conditions that will be contrasted to correspond to the biological question of interest: the more precisely defined these are, the better the results.Second, having at least three biological replicates per condition is advisable.It is crucial to strive for as much consistency as possible between these replicates during the experiments.Given that RNA-seq is highly susceptible to small fluctuations in experimental conditions, experimental noise can often be reflected in the results.
Additionally, it is essential to consider the number of reads obtained after sequencing.Typically, having around 10 million reads per library is sufficient for most organisms.However, in the case of plants with large genomes, such as maize, which is ∼2.4 Gb, it is advisable to aim for at least 20-30 million reads for a meaningful analysis of differential gene expression.
The sequencing files for each sample are recorded in FastQ format, which typically contains the read identifier, the sequence expressed in the four bases (CGAT; and in some cases "N" to indicate an undefined base), the read position (+ or −), and the quality with which each base was read.The quality is expressed in ASCII code and refers to Sacred values (Ewing & Green, 1998).These values are used to assess sequencing quality, and conventionally, RNA-seq analysis begins by examining read quality using the FastQC program.After this analysis, a trimming process to eliminate low-quality sequences and remove adapters is highly recommended.For this process, Trimmomatic has been one of the most versatile and valuable programs developed so far (Bolger et al., 2014).After data are trimmed and high-quality FASTQ files are obtained, the next stage is mapping the RNA-seq sequences to an assembled and annotated genome if this option is available.When these resources are not available, it is necessary to perform a de novo assembly.
Once the read counts per gene are obtained, the user can analyze differential expression levels to identify those genes whose expression changes (or differs) significantly between the two conditions being compared.These protocols present a differential expression analysis based on the DESeq2 program, which runs in the R environment (Love et al., 2014).As an alternative, we also recommend using edgeR (Robinson et al., 2010), which provides various examples for different cases in its manual and essentially performs the same processing as DESeq2.We have not extensively covered this program, for practical purposes, but the variations are minor.Figure 1 proposes a general workflow for model and non-model organisms.

ANALYSIS OF DATA FROM A MODEL PLANT WITH AN AVAILABLE REFERENCE GENOME
This protocol meticulously outlines RNA-seq data analysis using the Arabidopsis thaliana reference genome.A typical RNA-seq analysis involves quality control and filtering of reads based on quality measures.Additionally, it includes alignment with the reference genome and, ultimately, the generation of the count matrix.These steps are executed using various programs, and their key characteristics and essential parameters are detailed below.Following this protocol as described will enable the analysis of differential expression in the dataset and the visualization of results through graphs such as principal component analysis (PCA) plots, M (log ratio)-A (mean average) (MA) plots, volcano plots, and heatmaps, though which differentially expressed genes can be identified.

of 30
Current Protocols Efficient progress in these tasks can be achieved by utilizing the command line within a Unix/Linux environment.We conducted our work on a 64-bit Ubuntu 23.04 operating system.For Windows users, a Windows system employing the Windows Subsystem for Linux (WSL; https:// learn.microsoft.com/en-us/ windows/ wsl/ install) is a viable option.For macOS users, it is feasible to execute identical commands in the terminal because of the shared Unix foundation of both operating systems.

Hardware
At least 4 GB of RAM and 50 GB of storage space Scripts and Docker containers (detailed in the steps) available at the authors' GitHub repository, https:// github.com/jmvillalobos/ RNA-seq-protocol Dataset(s) to be analyzed: the datasets used in the example here (detailed in the steps) are available in a public repository on Zenodo: https:// zenodo.org/records/ 10537097 Part 1: Installing packages and creating directories 1.To install the bioinformatics tools that we will use in this protocol, go to the terminal and create a general folder in which the analysis will be carried out: 2. Move into the folder project_2023 using the cd command, and create subdirectories using the mkdir command:

of 30
Current Protocols 8. To install the complete R system, use: user:∼/project_2023/bioinformatics_tools$ sudo apt-get install r-base You have now finished installing all the tools needed to carry out Basic Protocol 1.

Part 2: Data download
The data we will use as our example here pertain to a study published in The Plant Journal (Villalobos-Escobedo et al., 2020) characterizing the roles of the Nox genes of the fungus Trichoderma atroviride during its interaction with A. thaliana.With this command, we instruct FastQC to perform quality analysis for all files with a .gzextension.
12. If you need to determine the quality of a particular file, simply run fastqc followed by the file name: user:∼/project_2023/raw_data$ fastqc SRR10207204_1.fastq.gz The quality analysis carried out by FastQC will yield two files, a .zipand a .htmlfile.The .html file contains the quality analysis performed by FastQC, which can be viewed using a web browser.
13. Maintaining order in your working directories is essential, so we suggest creating a folder for the raw data quality into which the files generated from the quality analysis can be moved:
14. Trimmomatic: With Trimmomatic, perform smooth read trimming into the trim-ming_data folder.The command is the following: The results of read trimming with Trimmomatic are shown in Figure 2.

of 30
Current Protocols

Part 5: Alignment of reads
Once the read-trimming process is completed, the next stage in RNA-seq analysis is to align reads to a reference.When RNA-seq data come from an organism for which a reference genome is available, it is possible to infer the identity of the expressed transcripts by mapping the reads to the genome or transcriptome (Conesa et al., 2016).To align reads to a reference genome, tools such as HISAT2, STAR, TopHat2, Bowtie 2, and Rsubread can be used (Dobin et al., 2013;Kim et al., 2013;Kim et al., 2019;Liao et al., 2019;Langmead & Salzberg, 2012).Conversely, Kallisto or Salmon can be employed for pseudoalignment/quasi-mapping toward a reference transcriptome (Bray et al., 2016;Patro et al., 2017).In this protocol, we will only use HISAT2 (version 2.2.1), but we have provided an alternative protocol for the use of Kallisto and Salmon on GitHub (https:// github.com/jmvillalobos/ RNA-seq-protocol).

Part 5.1: Alignment to a reference genome using HISAT2
The aligner we will use is HISAT2, which employs a graph-based data structure and alignment algorithm for rapid and sensitive alignment of sequencing reads to a genome and a comprehensive collection of minor variants (Kim et al., 2019).Before commencing read mapping, it is essential to have an index of the reference genome.For this purpose, the specific reference genome is necessary; for the example data, we use the TAIR10 version of the A. thaliana genome, which can be obtained from Ensem-blPlants (https:// plants.ensembl.org/info/ data/ ftp/ index.html).Subsequently, proceed to decompress the genome using the gunzip command and create the index into the genome_arabidopsis folder.17.Download genome data: If you want to keep the original file, the ".gz" adds a "-k" between gunzip and the file to be decompressed.
19. Create the index in the index_hisat2 folder with HISAT2: Pola-Sánchez et al.

of 30
Current Protocols user:∼/project_2023/index_hisat2$ hisat2-build -p 4 ../genome_arabidopsis/Arabidopsis_thaliana.TAIR10.dna.toplevel.fagenome Where: hisat2-build is the HISAT2 command used to build the index; -p specifies the number of computer cores to be used-in this case, four are used; ../genome_arabidopsis indicates the genome path, including the specific file (Arabidopsis_thaliana.TAIR10.dna.toplevel.fa);genome is the name assigned to the files resulting from the index.20.The process should generate a directory containing eight files: Where: -x is the index path and base name of the reference genome index; -1 y -2 are paired-end files to be aligned with their respective paths; -S is the output of the SAM alignment file.
22. Generate a loop to speed up the analysis: The alignment result produces SAM (sequence alignment map) files, which are text files containing sequence alignment information in the reference genome (Li et al., 2009).BAM (binary alignment map) files are compressed versions of SAM files containing the same information but in a non-human-readable format.BAM files are more efficient and smaller, making them easier to process.After generating a BAM file, it is essential to sort and index it because reads are randomly assigned in the original file.The choice to sort by sequence identifier or genomic coordinates depends on the application, with sorting by coordinates being common for genomic data.Samtools is a commonly used tool to perform these tasks (Danecek et al., 2021).
24.To convert SAM to BAM format and sort it, execute the following command in the folder containing the SAM alignment files (per sample):

of 30
Current Protocols 29.We can use a "for" loop to automate the quantification of all samples, following a similar approach to that used with Trimmomatic, HISAT2, and Samtools.To simplify the process further, we propose creating a shell script that dynamically operates on the sorted.bamfiles in the alignment_hisat2 directory.This script can be inserted into a text editor such as Nano or Vim and saved with the name featureCounts.sh.Subsequently, necessary permissions are granted using the command chmod +x featureCounts.sh, and finally, the script is executed in the quantification_featureCounts directory.32.To achieve this, run a series of commands that generate a count matrix for all samples with their respective identifiers.For this purpose, install two data manipulation packages: user:∼/project_2023/quantification_featureCounts$ sudo apt install moreutils user:∼/project_2023/quantification_featureCounts$ sudo apt install parallel As with other installations, we recommend using the Conda repository for macOS systems.

of 30
Current Protocols  Now, it is necessary to edit the headers of the count matrix; we can do this manually, matching the accession number with the treatment (Table 1).The result should look as shown in Table 2.

Part 8: Differential gene expression analysis
Differential expression analysis is essential in interpreting RNA-seq data, allowing the determination of quantitative changes in gene expression levels between experimental groups (Fig. 3).Specialized tools such as edgeR and DESeq2, based on negative binomial distributions, are prominent in differential expression analysis (Love et al., 2014;Robinson et al., 2010).In this specific protocol, we will focus on implementing DESeq2 to explore changes in gene expression of A. thaliana plants inoculated with T. atroviride (Treatment) versus non-inoculated plants (Control).
To perform the differential expression analysis, we will use R and the packages indicated in the script described here, provided on GitHub, as a supplementary script (https:// github.com/jmvillalobos/ RNA-seq-protocol).

of 30
Current Protocols > library("ggplot2") # Loading the library for the volcano plot > library("EnhancedVolcano") # Loading the library for heatmap > library("pheatmap") 38. Load the contingency table of the accounts from featureCounts into the R session:

PCA reveals sample variations and flags potential inconsistencies among replicates (see the Understanding Results section for details).
# Generating a PCA Plot 41. Filter the genes to be considered in the analysis by the number of reads.
It is not necessary to perform filtering before applying DESeq2 functions; however, the developers of DESeq2 recommend pre-filtering low-count genes before applying some of its functions.This practice offers two significant advantages: first, by removing rows with a very low number of reads, it reduces the memory size of the dds data object, thus improving computational efficiency.Second, this pre-filtering can enhance the clarity of visualizations by eliminating features that do not provide relevant information.Therefore, in our protocol, we have chosen to pre-filter to retain only those rows that have a count >10.

of 30
Current Protocols > head(res) 42. Set the comparison of interest.
In this section, we define the comparison of interest as "Treatment vs. Control."If working with additional conditions, the user would need to adjust the corresponding values in the function "contrast"; for example, contrast = c("condition," "Light," "Control") or contrast = c("condition," "Infection," "Control").It is also possible to adjust cutoff parameters such as p adj , which is a transformation of the p-value after conducting multiple tests.The "padj" values are adjusted using methods such as Benjamini-Hochberg correction, which controls the false discovery rate (FDR) when multiple hypothesis tests are conducted simultaneously.Additionally, it is possible to adjust the fold change (FC), which does not necessarily have to be treated as a filter but rather is a widely used reference in the literature, for example, log 2 FC = 1.However, it is not necessary to impose a specific FC value if you decided against that.This allows the analysis to be adapted to different experimental designs without altering the code structure.

of 30
Current Protocols

GENE ONTOLOGY ENRICHMENT ANALYSIS
In Basic Protocol 2, we will perform a functional enrichment analysis to comprehensively understand the underlying biological processes (Gene Ontology or GO terms; Fig. 4).Various tools are available for this type of analysis, such as CAMERA, GOseq, topGO, clusterProfiler, and gprofiler2 (Alexa & Rahnenfuhrer, 2023;Robertson, H., & Robertson, N., 2024;Kolberg et al., 2020;Wu & Smyth, 2012;Wu et al., 2021;Young et al., 2010).In this protocol, we focus on using clusterProfiler because of its user-friendly interface and efficient integration with the differential expression output generated by DE-Seq2.clusterProfiler performs over-representation analysis (ORA; Boyle et al., 2004), a widely used approach to determine whether known biological functions or processes are over-represented (enriched) in an experimentally derived list of genes, such as a list of differentially expressed genes.To perform this analysis, follow the R script described here and on GitHub (https:// github.com/jmvillalobos/ RNA-seq-protocol).
In terms of systems resources, efficient progress in this task can be achieved by utilizing the command line within a Unix/Linux environment.We conducted this work on a 64-bit Ubuntu 23.04 operating system.For Windows users, a Windows system employing the Windows Subsystem for Linux (WSL; https:// learn.microsoft.com/enus/ windows/ wsl/ install) is a viable option.For macOS users, it is feasible to execute identical commands in the terminal because of the shared Unix foundation of both operating systems.This protocol is compatible with the R desktop application, contingent on your operating system.We annotated the differentially expressed genes, obtaining functional categories, using the "athaliana_eg_gene" dataset.

of 30
Current Protocols Part 2: Over-representation analysis (ORA) with clusterProfiler 5. Change the working directory: > setwd("C:/project_2023/quantification_featureCounts") 6. Read all Arabidopsis genes from a file (universe): The universe_arabidopsis file is used as the set of genes considered the "universe" in the functional over-representation analysis.The term "universe" refers to the complete set of genes used as a basis for comparison with a subset of differentially expressed genes (DEGs).This comparison is designed to determine whether specific biological terms or functions are overrepresented in the DEGs compared to the entire set.
7. For compatibility with the enrichGO function, annot_universe genes must be characters, not integers, so it is necessary to convert them: > annot_universe$entrezgene_id <as.character(annot_universe$entrezgene_id) 8. Perform ORA for the Gene Ontology Biological Process class: Pola-Sánchez et al.

BASIC PROTOCOL 3 DE NOVO ASSEMBLY OF DATA FROM NON-MODEL PLANTS
In Basic Protocol 1, we analyzed the transcriptome using a reference genome; in the absence of a reference genome, it is necessary to construct the transcriptome de novo.For this purpose, in Basic Protocol 3, we use a dataset from Sarwar et al. (2019), who generated it while exploring drought tolerance in Agave sisalana.Employing Trinity (Grabherr et al., 2011), we perform de novo assembly on a curated dataset of 10,000 reads for efficient execution on personal computers.Part of the workflow used in this protocol is shown in Figure 1.The dataset is available on GitHub (https:// github.com/jmvillalobos/ RNA-Seq-protocol).
In terms of systems, efficient progress in this task can be achieved utilizing the command line within a Unix/Linux environment.We conducted our work on a 64-bit Ubuntu 23.04 operating system.For Windows users, a Windows system, employing the Windows Subsystem for Linux (WSL; https:// learn.microsoft.com/en-us/ windows/ wsl/ install) is a viable option.For macOS users, it is feasible to execute identical commands in the terminal because of the shared Unix foundation of both operating systems.It is important to note that accurate data analysis demands substantial computational resources to execute effectively (see below).Before commencing de novo assembly, create the specified directories and deposit the clean data in the trimming_data_agave folder.
3. Organize data in preparation for de novo assembly using Trinity: Before initiating the de novo assembly, we consolidate sequencing data from the treatment and Pola-Sánchez et al.

of 30
Current Protocols control conditions into a unified input file for Trinity (--samples_file).To accomplish this, furnish Trinity with a list of FASTQ files organized by treatment/control and replicate name, adhering to the structure outlined in a file named samples.txt,as explained below.The CPU and max_memory parameters should be set according to the resources available on the user's computer.It is important to highlight that it is essential to carry out the assembly using a high-performance computing server for real data analysis.
5. After the assembly, you will obtain two files and a directory; the file trin-ity_out_dir.Trinity.fastarepresents the assembled transcripts: The "gene" identifier corresponds to the prefix of the transcript identifier (e.g., TRINITY_DN992_c0_g1), and distinct isoforms of that "gene" will have varying isoform numbers in the identifier suffix (e.g., TRINITY_DN992_c0_g1_i1 and Pola-Sánchez et al.

of 30
Current Protocols TRINITY_DN992_c0_g1_i2, representing two different isoform sequences reconstructed for the single gene TRINITY_DN992_c0_g1).
Part 2: Evaluating the assembly After completing the de novo assembly, it is crucial to evaluate its quality, meticulously considering potential errors.Before proceeding to further analyses, it is vital to perform a comprehensive assessment of the de novo transcriptome assembly, as a subpar assembly may lead to misinterpretation of gene identification and differential expression analysis (Raghavan et al., 2022).Key quality metrics include read the alignment rate, ExN50 statistic, and the count of universal genes with matches according to the BUSCO values (Simão et al., 2015).Notably, in this protocol, we exclusively visualize the basic statistics of the assembly.In a high-quality assembly, one would expect the reads used to construct such assembly to exhibit a high alignment percentage.An alignment rate of 70% could indicate satisfactory transcriptome quality.For BUSCO, however, there is no consensus about the optimal value, as it is a relative metric that can vary between different assemblers or assembly configurations.Nevertheless, the general expectation is to find the most complete genes.In this context, the N50 value is another important metric indicating that at least half of the assembled bases are contained within contigs equal to or longer than that value.However, this metric is not suitable for transcriptome assemblies because there the goal is the recovery of many assembled sequences and not just the construction of a few long contigs.A more appropriate statistic for evaluating transcriptome assemblies is the ExN50 statistic, which excludes contigs with low expression that tend to be very short because of insufficient read coverage.This statistic often yields higher values than the conventional N50 statistic and is more representative in this context.The ExN50 value in an RNA-seq assembly conducted with Trinity is crucial and can vary significantly depending on several factors, such as the species under study, the quality of the sequencing data, and the transcriptome complexity.In general terms, a high ExN50 value suggests a more complete and contiguous assembly, which is highly desirable.However, it is important to note that there is no universally "good" threshold for ExN50, as its assessment depends heavily on the aforementioned factors.Consequently, it is advisable to interpret the ExN50 value in the specific context of each project and to consider other assembly quality parameters for a more comprehensive and accurate evaluation.In general, all of these metrics depend on several factors, such as the species studied, the quality of the sequencing data, and the assembly tools used.
8. To determine the count of assembled transcripts, employ the following command:
12.After this clustering process, the user can revisit the statistics, as mentioned previously, and observe a reduction in the number of contigs.

of 30
Current Protocols Total trinity transcripts: 1904 Percent GC: 49.19

Part 4: Transcript expression quantification with Kallisto
After eliminating redundancy from the assembly, the subsequent step entails estimating the expression values of the transcripts.This task will be performed using Kallisto, executing a script seamlessly integrated with Trinity.
13.The quantification process can be carried out according to the following details for each replicate within a particular condition: You have now produced count matrices at both the gene and isoform levels.You can proceed with the differential expression analysis using the scripts integrated into Trinity or opt to perform Basic Protocol 1 manual with minimal adjustments to the conditions.

COMMENTARY Background Information
This article offers a comprehensive guide for performing differential gene expression analysis using two distinct approaches.In the first approach (presented in Basic Protocols 1 and 2), HISAT2 is employed to align with the reference genome; in contrast, in the second approach (detailed in an alternative protocol on GitHub at https:// github.com/jmvillalobos/ RNA-seq-protocol), Kallisto is utilized, relying on pseudoalignment to predict transcripts of the organism.Given the nature of the analysis, we strongly recommend using HISAT2, as it tends to yield more precise results.However, our comparison of results at the level of differentially expressed genes highlights the remarkable similarity between the results obtained with the two strategies (Fig. 5).Therefore, for users with limited computational resources, opting for Kallisto Pola-Sánchez et al.

of 30
Current Protocols can be an excellent strategy to reduce time and computational load significantly.
Additionally, Basic Protocol 3 addresses the application of the Trinity software in a scenario without a reference genome, conducting de novo assembly for a dataset from a nonmodel organism.Although the process we describe here was expedited by limiting the reads to only 10,000, we must emphasize the substantial demand for computational resources when working with more realistic datasets.For this reason, in real-world scenarios, we recommend performing the assembly on servers with higher RAM resources.We encourage the use of services such as Amazon Web Services.
This protocol is a practical tool for experienced researchers and a valuable resource for students seeking to delve into RNA-seq analysis.Discussing differences between mappers, models for differential expression analysis, and considerations when working with or without a reference genome provides a deeper and more contextual understanding for new users.This inclusive approach facilitates the entry and participation of students in this research field and contributes to the continuous development of skills in genomic data analysis.

Critical Parameters
A crucial step in executing this series of protocols is carefully selecting a suitable version of a Linux-based system.When opting for Ubuntu, using at least Ubuntu version 20.04 or higher is strongly recommended.Another vital consideration is to run command lines with appropriate computational resources, adjusting them to the user's computer capabilities and avoiding exceeding the established limits for RAM and threads.
Although many of the commands in these protocols are executed with default parameters, it is highly advisable to review the list of parameters to be used thoroughly.These may vary in some cases depending on the organism or project objectives.Nevertheless, these protocols are entirely functional and highly interpretable for various organisms.

Troubleshooting Table
For a list of possible problems, causes, and solutions, see Table 4.

Basic Protocol 1
In this protocol, we tested the differential gene expression analysis of A. thaliana inoculated with spores of the fungus T. atroviride (Treatment) as compared to non-inoculated plants (Control condition).The principal component analysis (PCA) plot reveals consistent clustering between biological replicates for each condition (Fig. 3A).This type of analysis is of fundamental importance in RNAseq experiments because it provides valuable insights from the initial stages of the investigation.PCA visualizes sample variability and Pola-Sánchez et al.

of 30
Current Protocols Next, an MA plot showing the log base 2 (log 2 ) value of fold changes is obtained (Fig. 3B).In this graph, we set a significance value cutoff (alpha) of 0.05, so points below this threshold were colored blue.On the other hand, the volcano plot in Figure 3C represents differentially expressed genes in the Treatment versus Control contrast, including those induced and those repressed.This graph is informative because it highlights the genes with the most significant changes in expression levels based on the log 2 (fold change) and adjusted p-value criteria.The visual representation of this information provides a precise and rapid perspective on the magnitude of changes in gene expression, allowing the efficient identification of the most relevant genes in the Treatment versus Control contrast.Finally, Figure 3D shows the top ten differentially expressed genes of this contrast, in which a well-defined group of negatively regulated genes is observed in the control condition, whereas in the treatment condition, they are overexpressed.These differential expression results are consistent with those reported in the article from which these data are taken (Villalobos-Escobedo et al., 2020).

Basic Protocol 2
After identifying the differentially expressed genes, a quick method to grasp the overall landarabidopscape of the biological phenomenon in experimental comparisons is to conduct a functional category enrichment analysis.This analysis relies heavily on the annotation level of the genome used.In the case of Arabidosis, the genomic annotation is quite comprehensive, making the use of GO terms highly convenient.Our protocol employs the clusterProfiler tool due to its versatility in enrichment analysis and in generating graphs that aid in result comprehension.We present three useful types of graphs (Fig. 4).Firstly, a dot plot allows visualization of the overall change quantity between comparisons (GeneRatio), the significance of enrichment (p-adjusted, p adj ), and the count number.A similar graph is the bar plot, though the GeneRatio is not shown here to simplify the figure and enhance comprehension.Lastly, we introduced the emap plot, which displays the relationship between GO terms and indicates how related these terms are, potentially revealing linked and enriched processes.These graphs are valuable tools to demonstrate result consistency and comprehensively highlight significant processes.

Basic Protocol 3
This protocol involves the analysis of data derived from an organism that lacks a reference genome (a non-model organism).It shares with Basic Protocol 1 the quality analysis phase with FastQC and the trimming of reads with Trimmomatic.Subsequently, an assembly is carried out using Trinity with a curated data set.Examining assembly statistics such as alignment rate, ExN50, and BUSCO statistics is essential.However, this protocol does not address these statistics because of the low quality of the assembly obtained, which we attribute to the use a minimal data set to run the process on a personal computer in this demonstration.Eliminating redundancy in the assembly is highly recommended; here, CD-HIT was used successfully to reduce redundant transcripts.Subsequently, an abundance quantification analysis was performed using Kallisto, generating count matrices at the isoform level (Kallisto.isoform.counts.matrix)and at the gene level (Kallisto.gene.counts.matrix).Because of space limitations, Basic Protocol 3 does not include details of the differential expression analysis and transcriptome annotation steps.In general, however, these are very similar to those discussed in Basic Protocols 1 and 2.

Time Considerations
The time required for the analyses presented varies depending on the protocol used and the amount of data to be analyzed.Basic Protocols 1 and 2 will take ∼1 day of work, depending mainly on the computational power of the equipment used for analysis.Execution of Basic Protocol 3 is estimated to take ∼4 hr.

Figure 1
Figure 1 Proposed workflows for model (A) and non-model organisms (B).

Figure 2
Figure 2 Quality analysis performed by FastQC.(A and B) Per-base sequence quality of the SRR10207218_1.fastq.gzlibrary before (A) and after (B) read trimming with Trimmomatic.

Figure 3 >
Figure 3 Graphics generated from the analysis of differential gene expression.(A) PCA plot illustrates the variation between treatment and control replicates.(B) MA plot displays the log 2 (fold change) of the treatment over the mean of normalized counts.(C) Volcano plot highlights in red the genes differentially expressed in the treatment versus control condition.(D) Heatmap showcases the 10 most prominent differentially expressed genes in the treatment-versus-control comparison.

Figure 4
Figure 4 GO term enrichment results obtained using clusterProfiler display enriched categories for genes in the treatment versus control contrast (upregulated genes).(A) Dot plot, (B) bar plot, and (C) enrichment map (emap) plot showcase the top enriched GO terms based on overrepresentation analysis.Pola-Sánchez et al.

Figure 5
Figure 5 Differentially expressed genes (DEGs) obtained by HISAT2 and Kallisto.(A) Bar graph illustrating the number of induced and repressed genes for each bioinformatics tool.(B and C) Venn diagrams illustrating upregulated (B) and downregulated (C) genes shared and unique between HISAT2 and Kallisto.Those meeting the criteria of p adj <0.05 and abs[log 2 (fold change)]>1, according to DESeq2 analysis, were considered.We generated this Venn diagram using the online tool available at https://bioinformatics.psb.ugent.be/webtools/Venn/.

Pola-Sánchez et al. 5 of 30
The raw RNA-seq data are publicly available in the NCBI SRA database at www.ncbi.nlm.nih.gov/sra/ with the accession number PRJNA575031.To simplify the process, we will use the European

Part 4: Read trimming with Trimmomatic Now
that we have assessed the quality of the raw data, a common next task is read trimming.For this purpose, various bioinformatics tools have been developed,

Table 1
Accession and Treatment Numbers of the Samples Used for Analysis

Table 3 Differential expression tables can be downloaded from https:// github.com/jmvillalobos/RNA-seq-protocol.The output of a differential expression table generated with DESeq2 should look like Table3.

Counting and gene expression matrices
Using the quant.sffiles, a matrix with estimated counts and another with TPM expression values can be generated and normalized across samples using the TMM method.17.Initially, compile a list of quant.sffiles for each replicate:

Table 4
Troubleshooting Guide