Resource: A multi‐species multi‐timepoint transcriptome database and webpage for the pineal gland and retina

Abstract The website and database https://snengs.nichd.nih.gov provides RNA sequencing data from multi‐species analysis of the pineal glands from zebrafish (Danio rerio), chicken (White Leghorn), rat (Rattus nove gicus), mouse (Mus musculus), rhesus macaque (Macaca mulatta), and human (Homo sapiens); in most cases, retinal data are also included along with results of the analysis of a mixture of RNA from tissues. Studies cover day and night conditions; in addition, a time series over multiple hours, a developmental time series and pharmacological experiments on rats are included. The data have been uniformly re‐processed using the latest methods and assemblies to allow for comparisons between experiments and to reduce processing differences. The website presents search functionality, graphical representations, Excel tables, and track hubs of all data for detailed visualization in the UCSC Genome Browser. As more data are collected from investigators and improved genomes become available in the future, the website will be updated. This database is in the public domain and elements can be reproduced by citing the URL and this report. This effort makes the results of 21st century transcriptome profiling widely available in a user‐friendly format that is expected to broadly influence pineal research.


| INTRODUCTION
The pineal transcriptome has been studied for over 30 years, starting with Northern blot detection of single transcripts encoding proteins involved in melatonin synthesis, including those encoding Tph1 and Asmt (Hiomt). 1-4 Since then, pineal transcriptomics has spanned the development of transcriptomic assays including cDNA-based hybridization technology, qRT-PCR, and RNA-Seq. [5][6][7][8][9][10][11][12][13][14][15] High-throughput sequencing offers many advantages for assaying the transcriptome, but the field of bioinformatics moves quickly with tools, algorithms, and genome assemblies changing from year to year. As a result, data from earlier studies cannot be meaningfully compared with data from later studies that used different methods without completely re-processing all data uniformly using updated methods, assemblies, and annotations. This has been recognized for example in the recount2 project, 16 which has re-processed tens of thousands of human RNA-Seq samples from public repositories uniformly.
Furthermore, for detailed study of particular loci it is critical to visualize expression alongside genomic data from other studies. Genome browsers such as the UCSC Genome Browser 17 allow just this. In particular, this browser supports track hubs that allow for the configuration, coloration, and organization of collections of many tracks using a web interface. 18 This allows researchers to generate highly customized views tailored to their research interest, viewing pineal gland data from this study directly alongside a wealth of publicly available data prepared and made available by the UCSC team.
Cross-species transcriptome reports appear in the literature, focused on two-species comparisons, for example, mouse versus human 19 and zebrafish versus human. 20 Here, we introduce a website that aggregates multiple RNA-Seq studies of the pineal gland spanning years of transcriptomics research on six species across three vertebrate classes, processed uniformly and presented in a user-friendly site allowing inspection of individual genes as well as UCSC Genome Browser track hubs of each experiment. The site can be found at https://snengs.nichd.nih.gov. The results of this effort facilitate the comparative and evolutionary analysis of the pineal gland and retina, reflecting an interest in the evolutionary history that links these tissues as derivatives of a common ancestral photodetector. 21,22 Most of the tens of thousands of transcripts profiled are otherwise absent from the pineal and retinal literature and in many cases have not been well studied in any tissue. Accordingly, the web page opens new avenues of research.
This report alerts investigators to the availability of this resource, which will be of special value where user-friendly compiled pineal and retinal RNA-Seq data are otherwise not available in any format. The graphs and other information extracted from the web page are in the public domain. The web page and its underlying infrastructure is designed to be easily updated as data from new experiments become available or as reanalysis of existing datasets using improved software and updated genomes is completed.

| Animals
Samples were collected to identify differential day/night ratios; and, in the case of the rat, expression was studied as a function of development, denervation, and adrenergic-cyclic AMP stimulation (Table 1). 9,12,[23][24][25] In many cases, retinal tissue was profiled in parallel. Mixed tissue RNA samples were used in conjunction with the pineal gland and retina to estimate the enrichment of a transcript.

| Sequencing
Illumina sequencing was used in all cases. Specific experimental details are available on each experiment's page on the website (see https://snengs.nichd.nih.gov/exper iments) that recapitulates original experimental methods in the respective original manuscript. All data across all species were re-analyzed in the assemblies and annotations described (https:// snengs.nichd.nih.gov/methods). Quality of FASTQ files was analyzed with FastQC v0.11.8 and MultiQC v1.6 and all samples demonstrated high-quality sequencing. Adapters were removed and light quality trimming was performed with cutadapt v1.18 using additional arguments -q-minimum length 25.
These trimmed reads were provided to Salmon v0.12.0 26 for transcript quantification using an index built for transcriptomes as described (https://snengs.nichd.nih.gov/methods), and run using the additional arguments -gcBias -se-qBias --libTypeA. For each gene, the per-transcript values reported by Salmon v0.12.0 were summed to provide a gene-level expression estimate in units of transcripts per million reads (TPM). These are the values reported in the tables and plots of each gene page.

| Genomic visualization
For genomic visualization in the UCSC Genome Browser, trimmed reads were aligned using HISAT2 v2.1.0 to the respective genome indicated below. From these aligned reads, normalized bigWig files were created using the deep-Tools v3.1.3 bamCoverage tool using additional options --minMappingQuality 20 --smoothLength 10 --normalizeUsing BPM --binSize 1 such that multi-mappers were ignored. For stranded libraries, the tool was run twice: once with --filterRNAstrand forward and once with --filterRNAstrand reverse to get separate tracks for each strand. The resulting bigWig files were combined into a UCSC track hub using the trackhub Python package.
UCSC (typically used for extensive visualization capabilities) and Ensembl (typically used for its comprehensive annotations) are not consistent in their chromosome nomenclature. To facilitate linking from gene-level transcription estimates on this website to genomic signal at UCSC, we converted chromosome names from Ensembl to the UCSC equivalents by matching the md5sums of each chromosome; see GitHub repository (https://github.com/NICHD -BSPC/ chrom -name-mappings) for details and code.

| Genome and transcriptome assemblies
For each species, the genomic assembly indicated (https:// snengs.nichd.nih.gov/methods) was used for visualization in the UCSC Genome Browser, while the transcriptome was used to calculate TPM expression estimates to display in plots on individual gene pages.

| Implementation details
The website is written in the Python programming language using the 'Flask' framework. Configuration of the website is driven by a YAML format file that points to Salmon v0.12.0 output along with details like methods descriptions, UCSC track hub colors, bar plot colors, and any other experimentspecific configuration. This greatly streamlines the process of adding new studies and new species. Data were processed using lcdb-wf (https://github.com/lcdb/lcdb-wf), which is itself driven by YAML configuration in a species-agnostic manner, allowing for uniform processing across all studies.

| RESULTS
The Home Page (https://snengs.nichd.nih.gov) introduces the user to the main sections of the database ( Figure 1A). Selecting the Search section displays the Search subpage ( Figure 1B). Entering a gene symbol (ie, Aanat) in the query box opens the Results subpage, which contains a listing of species and experiments ( Figure 2). The Search function will accept alternative symbols; however, when difficulty is encountered obtaining a result, the user is encouraged to refer to gene databases for assistance. This page contains information on the samples, including species, a brief description of the experiment and a Link to the gene page. Clicking on that Link (Figure 2), opens a Gene subpage ( Figure 3) with links to the Ensembl data (gene id) and the UCSC Genome Browser for the gene (Open UCSC track hub for this gene), in addition to presenting experimental results in a bar graph. These results are normalized count data (in TPM). In cases where multiple experiments for a species exist, all experiments are displayed and can be viewed by scrolling vertically.
Selecting Experiments from the Home page displays the experiments with four links (Figure 4). The first is the Search subpage, described above. The second (Download) retrieves the data in an Excel file. The third retrieves the Details ( Figure 5) of the experiment, including sample preparation and data analysis; scrolling horizontally is necessary to open the table. The last is a Link to the UCSC Genome Browser, which documents the location of reads mapping for each gene.
The Methods and Help pages are not presented as figures. The Methods page contains general information on the Bioinformatics methods and identifies the genomic assemblies used; the Help subpage has links to tutorials on the use of the UCSC browser and contact information for further assistance.
As an example of the utility of comparing data across multiple species in a uniform format, we searched for differences in the day/night levels of transcripts among species. As shown in Table 2, the large night/day rhythms in the transcript abundance of several genes in the rat are not seen in the rhesus monkey or to a similar degree in other species (https:// snengs.nichd.nih.gov/search). This emphasizes the importance of post-translational modifications that occur. 27,28 It also is a caution against making generalities based on studies of one species. The data also focus on the similarity of the genomic profiles of pineal glands from the species studied (Table 3). Selective expression of each gene was calculated as the ratio of expression of a specific gene in the pineal gland to that in a mixture of RNA from a group of tissues. As expected, three genes responsible for melatonin synthesis (Tph1, Aanat and Asmt) were selectively expressed in the glands studied. Another group of genes selectively expressed in the pineal gland includes those established as markers of the retina. The high expression of these genes only in the pineal gland and retina is known. 21,22 However, the specific functions of these retina-related genes and other selectively expressed genes in the pineal gland have not received significant attention and deserve further analysis.
An analysis of the conserved highly expressed and tissue specific transcripts in the pineal gland, retina and in both tissues (Table 4) was done by identifying the highly tissue specific transcripts. They were then binned according to their expression ratio (pineal gland: retina). The results reveal a relatively smaller sets of pineal-specific and retina-specific genes, and a larger group of genes expressed in both tissues. Noting that these genes are selectively expressed only in these two tissues and not in others, it is highly likely that these genes represent evolutionarily conserved elements that can be considered to be related to the common origin of both tissues. In some cases, their roles have been identified, but in many cases, a functional role has not been established.

| DISCUSSION
This database will serve as a foundation for future molecular biological research on the pineal gland and retina, making F I G U R E 2 Results subpage. The results of querying a gene symbol generates a listing of the experiments and species in which the gene was found (https://snengs. nichd.nih.gov/search). Depending on the size of the gene family, multiple gene symbols may be displayed. In this case, one has to use the data cautiously. From this page, highlighted links (View) direct the user to the Gene subpage, which lists results of a single species and experiment available the data to scientists with a computer and an internet connection. The uniform processing of raw data makes the comparison of results more meaningful and takes advantage of advances in tools, algorithms, assemblies, and annotations since original publication. Whereas the human and mouse genomes are the most highly annotated, and the chicken and zebrafish less so, the maturity of all annotations allows for in-depth analysis of nearly all genes. A potential problem is that symbols used for identification of a gene in one species may not be used in other species or may be used for different genes. Hence, in cases where identification is questionable, confirmation may require analysis of sequence homology.

| Utility and accuracy of RNA-Seq data
In judging the utility and accuracy of the RNA-Seq data, it should be noted that there is good agreement with data from other methods for the analysis of pineal gland and retina material, including microarray, Northern blot and qRT-PCR as regards day/night differences. [5][6][7][8][9][10][11][12][13] Accordingly, the RNA-Seq data can be viewed as highly useful and reliable.
The method also has advantages over other methods, perhaps the most important is that it sequences all transcripts, including those without a history in any literature. This opens new avenues for study for the pineal gland and retina. One of the most fertile areas is the identification of noncoding RNAs, both micro RNAs and long noncoding RNAs. 9,23 Some of these are known to have daily rhythms in both tissues. Noteworthy is the discovery of a unique micro RNA-183-96-182 cluster in the pineal gland and retina, 9,29 which represents the major component of pineal miRNAs. Accordingly, it can be considered to be an additional marker of the common ancestral photodetector which gave rise to the pineal gland and retina. Although the function of this cluster remains unknown in the pineal gland, it has been reported to play a role in phototransduction and development in the eye. 30,31 Study of pineal miRNAs also led to the discovery of very high levels of pY RNA1-s2 in the retina, relative to other F I G U R E 3 Gene subpage. An example of the gene page (https://snengs.nichd. nih.gov/speci es/Rat/gene/ENSRN OG000 00011 182#Time_Serie s-anchor) that displays information about a specific gene including relevant experimental details, the UCSC track hub (Open UCSC track hub for this gene) and a help page for use of the UCSC Genome Browser. General information about each gene is available by clicking on the gene id, for example, ENSRNOG00000011182. Experimental results are displayed below in a bar graph. In addition, accessing the results of a single experiment will open other experiments dependent on availability | 9 of 13 CHANG et Al. tissues, 25 including the pineal gland. Moreover, it was found that pY RNA1-s2 selectively binds the nuclear matrix protein Matrin 3 and to a lesser degree to heterogeneous nuclear ribonucleoprotein U-like protein. The distribution of pY RNA1-s2 in all retinae and retinal cell lines suggests a role in vision. Both these discoveries could not have been made using methods other than RNA-Seq.
Likewise, the finding of robust daily rhythms in the abundance of several long noncoding RNAs 23 in the pineal gland under neural control, and the discovery of expression of lncSN134 in both the retina and pineal gland was dependent on the use of RNA-Seq. The long noncoding RNAs range in size significantly and like pineal miRNAs, remain largely unstudied and unknown.
Whereas RNA-Seq is a powerful technique, the results must be viewed with healthy skepticism, especially with transcripts that are weakly expressed and when evaluating small night/day differences in transcript abundance. Confirmation by an independent method should be considered. In addition, in the case of weakly expressed transcripts, the mapping of reads on the UCSC browser should be evaluated to confirm that the read assignment pattern is consistent with the intron/ exon features of the transcript.

| Experimental design
A problem that is considered in any study designed to measure day/night differences is the number of time points per day. Often this is limited by factors including the housing of animals and the number of animals per point necessary to obtain sound data. RNA-Seq introduces another factor, the cost of analysis. Accordingly, the design of the studies included in the database (Table 1) is also a reflection of the cost of sequencing and bioinformatics. The studies included sampling that ranged from two to six time points per day. When sampling is done at only two time points, noon, and midnight, the potential for overlooking a dawn/ dusk rhythm exists. Accordingly, it is best not to limit experiments to two time point studies and to use four or more to detect daily rhythms. However, in the case of study of daily rhythms in the pineal gland, a two time point study will capture most large changes. Moreover, this approach is highly instructive, in that it provides valuable data on the levels of tens of thousands of transcripts. Accordingly, one can see merit in such studies.
The number of replicates to use is also another important issue. RNA-Seq data are typically highly reproducible for most transcripts when normalized. This reflects a feature of the method, in that there is redundancy in the detection of a transcript, as a result of fragmentation and amplification. In the final analysis, each calculated transcript level is not simply a single measurement, but reflects multiple detection events, depending on the size of the transcript and abundance. Accordingly, in N = 1 situations, it is possible to obtain an indication of statistical variance of all transcripts, and use this to determine whether, for example, a day/night difference is statistically significant.

| Transcriptomics versus proteomics
Whereas RNA-Seq does provide a highly useful tool for the discovery and characterization of transcripts, it is not a substitute for proteomics. The study of an mRNA and its encoded protein often are in agreement as regards the presence and dynamic changes in both. However, this is clearly not the case in all situations.
An excellent example is Aanat. In the rat, Aanat mRNA, protein, and activity increase at night, reflecting phosphorylation of the protein at two sites. When lights are turned on in the middle of the night, a rapid decrease in enzyme activity occurs, with little change in mRNA levels. The changes in enzyme activity are due to dephosphorylation of the protein, which is rapidly destroyed by proteasomal proteolysis, as reviewed. 32 Another example of mRNA levels and protein levels not exhibiting similar dynamics is found in studies of the rhesus pineal gland. There is little daily change in mRNA encoding Aanat, although the changes in enzyme activity are robust. 33 These observations are evidence that it is necessary to determine whether changes in an mRNA are associated with changes in an encoded protein to determine the relationship. Unfortunately, the science of proteomics has not advanced to the all-inclusive nature of mRNA analysis, in part because it is difficult to uniformly detect the possible post-translational modifications.
Use of the database will allow investigators to initiate efforts to identify transcripts that are highly expressed in the pineal gland relative to the retina and or other tissues, transcripts that are highly expressed in the pineal gland of one species but not another, transcripts that exhibit marked night/ day differences, transcripts that are under neural/adrenergic cyclic AMP control, and transcripts that exhibit changes in expression during development. In doing so, the web page should promote and enhance future studies of pineal cell biology.
F I G U R E 4 Experiments subpage. The Experiments subpage (https://snengs.nichd.nih.gov/exper iments) is accessed from the Home page. It is an index for all experiments, leading to several resources. The highlighted link retrieves the Search subpage, described above. The Data link (Download) returns the data for an entire experiment in an Excel file, which also contains the average expression values and variance. Selecting Details opens the page with experimental details (see Figure 5). The highlighted link to the UCSC Track Hub (Link) gives access to the mapped data on the UCSC Genome Browser | 11 of 13 CHANG et Al.

| Referencing the web page
The data on the web page are in the public domain and the use of the figures and data does not require authorization of the authors. The web page should be referenced by citing this publication. F I G U R E 5 Details subpages. The Details subpages are accessed from the Experiments subpage (see Figure 4; https://snengs.nichd.nih. gov/exper iments) by clicking on Details for a specific experiment. This yields information on sample preparation, RNA preparation, and data processing; and, the location of archived data. The search box is used to interrogate the table with identifiers (eg, SRX3229487) or fragments of identifiers (eg, _04h) in the table. The Samples section in this Figure