Volume 12, Issue 3
Research Article
Open Access

De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology

Javier Canales

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

These authors contributed equally to work.Search for more papers by this author
Rocio Bautista

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

These authors contributed equally to work.Search for more papers by this author
Philippe Label

INRA, Université Blaise Pascal, Aubière Cedex, France

These authors contributed equally to work.Search for more papers by this author
Josefa Gómez‐Maldonado

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
Isabelle Lesur

INRA, Cestas, France

Université de Bordeaux, Talence, France

HelixVenture, Mérignac, France

Search for more papers by this author
Noe Fernández‐Pozo

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
Marina Rueda‐López

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
Dario Guerrero‐Fernández

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
Vanessa Castro‐Rodríguez

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
Hicham Benzekri

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
Rafael A. Cañas

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
María‐Angeles Guevara

Departamento de Ecología y Genética Forestal, INIA‐CIFOR, Madrid, Spain

Search for more papers by this author
Andreia Rodrigues

Forest Biotech Lab, IBET/ITQB, Oeiras, Portugal

Search for more papers by this author
Pedro Seoane

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
Caroline Teyssier

INRA, Unité Amélioration, Génétique et Physiologie Forestières, Orléans Cedex 2, France

Search for more papers by this author
Alexandre Morel

INRA, Unité Amélioration, Génétique et Physiologie Forestières, Orléans Cedex 2, France

Search for more papers by this author
François Ehrenmann

INRA, Cestas, France

Université de Bordeaux, Talence, France

Search for more papers by this author
Grégoire Le Provost

INRA, Cestas, France

Université de Bordeaux, Talence, France

Search for more papers by this author
Céline Lalanne

INRA, Cestas, France

Université de Bordeaux, Talence, France

Search for more papers by this author
Céline Noirot

INRA de Toulouse Midi‐Pyrénées, Auzeville, Castanet Tolosan cedex, France

Search for more papers by this author
Christophe Klopp

INRA de Toulouse Midi‐Pyrénées, Auzeville, Castanet Tolosan cedex, France

Search for more papers by this author
Isabelle Reymond

FCBA, Pôle Biotechnologie et Sylviculture, Cestas, France

Search for more papers by this author
Angel García‐Gutiérrez

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
Jean‐François Trontin

FCBA, Pôle Biotechnologie et Sylviculture, Cestas, France

Search for more papers by this author
Marie‐Anne Lelu‐Walter

INRA, Unité Amélioration, Génétique et Physiologie Forestières, Orléans Cedex 2, France

Search for more papers by this author
Celia Miguel

Forest Biotech Lab, IBET/ITQB, Oeiras, Portugal

Search for more papers by this author
María Teresa Cervera

Departamento de Ecología y Genética Forestal, INIA‐CIFOR, Madrid, Spain

Search for more papers by this author
Francisco R. Cantón

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Search for more papers by this author
Christophe Plomion

INRA, Cestas, France

Université de Bordeaux, Talence, France

Search for more papers by this author
Luc Harvengt

FCBA, Pôle Biotechnologie et Sylviculture, Cestas, France

Search for more papers by this author
Concepción Avila

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
M. Gonzalo Claros

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Search for more papers by this author
Francisco M. Cánovas

Corresponding Author

Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Universidad de Málaga, Málaga, Spain

Plataforma Andaluza de Bioinformática, Edificio de Bioinnovación, Parque Tecnológico de Andalucía, Málaga, Spain

Correspondence (Tel: +34 952131942; fax: +34 952132376; email: canovas@uma.es)Search for more papers by this author
First published: 21 November 2013
Citations: 70

Summary

Maritime pine (P inus pinaster A it.) is a widely distributed conifer species in S outhwestern E urope and one of the most advanced models for conifer research. In the current work, comprehensive characterization of the maritime pine transcriptome was performed using a combination of two different next‐generation sequencing platforms, 454 and I llumina. De novo assembly of the transcriptome provided a catalogue of 26 020 unique transcripts in maritime pine trees and a collection of 9641 full‐length cDNA s. Quality of the transcriptome assembly was validated by RT ‐PCR amplification of selected transcripts for structural and regulatory genes. Transcription factors and enzyme‐encoding transcripts were annotated. Furthermore, the available sequencing data permitted the identification of polymorphisms and the establishment of robust single nucleotide polymorphism (SNP ) and simple‐sequence repeat (SSR ) databases for genotyping applications and integration of translational genomics in maritime pine breeding programmes. All our data are freely available at S ustainpineDB , the P . pinaster expressional database. Results reported here on the maritime pine transcriptome represent a valuable resource for future basic and applied studies on this ecological and economically important pine species.

Introduction

Forests are essential components of the ecosystems covering approximately one‐third of the Earth's land area and playing a fundamental role in the regulation of terrestrial carbon sinks. Trees represent nearly 80% of the plant biomass (Olson et al ., 1983) and 50%–60% of annual net primary production in terrestrial ecosystems (Field et al ., 1998).

Conifers are the most important group of gymnosperms. Having diverged from a common ancestor more than 300 million years ago (Bowe et al ., 2000), gymnosperms and angiosperms have evolved very efficient and distinct physiological adaptations (Leitch and Leitch, 2012). Coniferous forests dominate large ecosystems in the Northern Hemisphere and include a broad variety of woody plant species, some of which are the largest, tallest and longest living organisms on Earth (Farjon, 2010). Conifer trees are also of great economic importance, as they are the primary source for timber and paper production worldwide. Total timber production in the European Union in 2011 was 427 million m3 (UNECE, 2013). Approximately 22% was used to produce energy, while the rest was used to supply industrial demands. A study of United Nations Economic Commission for Europe/Food and Agriculture Organization (UNECE/FAO) point out that the future needs in forest biomass to meet the demands of industrial wood as an energy source will exceed production by 2020. The development of a more productive and sustainable forest plantation is essential to meet the increasing demand of wood worldwide together with minimizing environmental impacts (e.g. decreasing pressure on natural forests).

The extant conifers comprise 615 species classified into eight families within the division Pinophyta (Farjon, 2010). Some of the most important conifer trees such as pines, spruces and firs are included in the family Pinaceae . The genus Pinus comprises the largest number of diversified species (113). Maritime pine (Pinus pinaster Aiton) is a broadly planted species (4.2 million hectares) in the southwestern part of the Mediterranean Basin, especially along the Atlantic coast in France, Spain and Portugal where it is the dominant species on more than 2.3 million hectares (Sanz et al ., 2006). The maritime pine is particularly tolerant to abiotic stresses showing relatively high levels of intraspecific variability (Aranda et al ., 2010). The maritime pine is also one of the most genetically studied conifer species for genomic research in Europe (Mackay et al ., 2012; Neale and Kremer, 2011), and a large number of genomic resources and phenotypic data have been generated in the last few years and are available for the conifer research community (http://www.scbi.uma.es/sustainpine; https://w3.pierroton.inra.fr/PinusPortal). Furthermore, knowledge gained from studying this conifer species will potentially help to better understand gene function and diversity in closely related, economically significant species but also in other noneconomic but environmentally important gymnosperm species (Neale and Kremer, 2011).

Until recently, advances in the genomics of conifers were hampered by the large size of their genomes ranging from 20 to 40 Gb, which is more than 200‐fold the Arabidopsis genome and roughly sevenfold the human genome (Mackay et al ., 2012; Ritland, 2012). As the conifer genome is extremely large, major research efforts were concentrated on transcriptomic analysis. The first large‐scale EST (Expressed Sequence Tag) projects based on Sanger sequencing revealed that transcriptomes of pines and spruces are highly diverse and complex (Allona et al ., 1998; Cairney et al ., 2006; Li et al ., 2009; Pavy et al ., 2005; Ralph et al ., 2008).

The emergence of the next‐generation sequencing (NGS) has profoundly transformed the landscape of genome analysis. Research efforts in conifers have provided growing catalogues of ESTs for several species of economic and ecological importance, including Picea sitchensis (Ralph et al ., 2008), Picea glauca (Rigault et al ., 2011), P. pinaster (Fernández‐Pozo et al ., 2011) and Picea abies (Chen et al ., 2012). Recent landmark studies in conifer genomics reported the draft assembly of the first genome for a gymnosperm species. Specifically, massive parallel sequencing data were used for assembling the 20 Gb genomes of white (Birol et al ., 2013) and Norway (Nystedt et al ., 2013) spruces.

A full catalogue of genes and protein coding sequences in maritime pine is necessary to improve our understanding of biological functions and evolutionary relationships with other conifer trees and angiosperms. Comparative genomics provide a powerful mean to study gene structure and the evolution of gene function and regulation. Analysis of key genes and pathways allows scientists to better understand how complex biological processes are regulated and evolve (Koonin et al ., 2004; Soltis and Soltis, 2003).

In this study, we present results from deep sequencing of the P. pinaster transcriptome. The data were used to create a reference transcriptome in this pine species, containing an extensive set of genes expressed in a variety of actively growing tissues/organs (http://www.scbi.uma.es/sustainpinedb/sessions/new). Comparison to available public databases indicates that the transcriptome of P. pinaster is similar in size to that of previously characterized P. glauca (Rigault et al ., 2011) and P. abies (Nystedt et al ., 2013).

Results

Sequencing strategy and de novo assembly

Massive scale cDNA sequencing was performed to define a reference transcriptome for maritime pine. Long reads obtained by the 454 sequencing platform and short reads obtained by the Illumina platform were used for assembly of 210 513 putative unigenes from 18 cDNA libraries constructed from a variety of tree sources, tissues and experimental conditions (Table 1). Figure 1 shows a schematic representation of the workflow followed for preprocessing and assembling, as well as the combination of software algorithms utilized to obtain a final catalogue of unigenes. A total of 6 381 011 long reads provided 4 098 257 useful long reads (64.2%, mean length: 282 bp), and 591 174 069 short reads provided 292 596 546 useful short reads (49.5%, mean length: 86 bp). As shown in Figure 2, the recovery of useful 454 long reads was variable among the different libraries (37%–92%) with high recoveries (88%–92%) in EuroPineDB and UAGPF. Overall, recovery was around 70% of the initial reads. Regarding short reads, only 32% of single‐end short reads were kept after data cleaning. The primary reason of out‐filtering these single‐end sequences was redundancy.

Table 1. Description of samples used for DNA sequencing
Gene library Sequencing platform Sampled plant material Experimental conditions SRA code
EuroPineDB Sanger/454 Bud, xylem, phloem, stem, needles, roots, stem, embryos, callus, cone, male and female strobili ESTs and SSH libraries from different tissues and conditions as described by Fernández‐Pozo et al., 2011 SRS479769
Biogeco1 454 Xylem, bud and needle ESTs from differentiating xylem, swelling bud and young needles SRX032960, SRX032961, SRX032962, SRX032963
Biogeco2 454 Bud EST from quiescent buds harvested on 2‐year‐old maritime pine (low growing family) in well‐watered or drought‐stress conditions SRX031546
Biogeco3 454 Bud EST from quiescent buds harvested on 2‐year‐old maritime pine (fast growing family) in well‐watered or drought‐stress conditions SRX031589
UAGPF1 454 Embryome ESTs from developing, immature embryos (1‐week maturation) SRX022618
INIA_PPIN 454 Bud ESTs from buds PRJNA221139
U_root 454 Root ESTs from roots (1‐month‐old seedlings) SRS480239
U_tip 454 Root tips ESTs from root tips (1‐month‐old seedlings) SRS480265
U_H 454 Hypocotyl ESTS from hypocotyl (1‐month‐old seedlings) SRS480236
U_N 454 Needle ESTs from needles (1‐month‐old seedlings) SRS480237
U_Cot_Os 454 Cotyledon ESTs from cotyledons grown under dark conditions SRS479771
U_H_Os 454 Hypocotyl ESTs from hypocotyl grown under dark conditions SRS480236
U_R_6 454 Roots ESTs from roots (6‐month‐old seedlings) SRS480238
U_S_8 454 Stem ESTs from stem (8‐month‐old seedlings) SRS480261
UAGPF2 Illumina Somatic embryo Paired‐end ESTs from developing, immature embryos (1 week maturation) SRR609713
BIOGECO4 Illumina Bud ESTs from young and aged buds SRX031587
BIOGECO5 Illumina Root ESTs from drought‐stressed and control roots in hydropony SRX031592, SRX031590
BIOGECO6 Illumina Bud ESTs from young and aged buds SRX031594
IBET Illumina Zygotic embryo Paired‐end ESTs from embryos SRS481044
image
Schematic representation of 454 and I llumina sequencing, trimming and assembling into unigenes of P inus pinaster transcriptome short read and long read.
image
Flow chart showing preprocessing into useful reads, assembly into contigs and overlap‐based reconciliation into final unigenes of sequenced data from 5 (591 174 069 short reads, I llumina) or 14 (6 381 011 long reads, 454) cDNA libraries in maritime pine.

The assembly strategy used a combination of two completely different algorithms with the hypothesis that it may provide superior results than each algorithm by separate. MIRA3 and Euler‐SR were selected for long‐read assembly because the former is based on overlap‐layout‐consensus algorithm, while Euler‐SR is based on de Bruijn graphs resolved by means of an Eulerian path. Furthermore, each algorithm produced the better results with simulated data (H Benzekri and MG Claros, unpublished results). Regarding short reads, we have used de Bruijn‐based algorithms. ABySS was selected for two main reasons: (i) it can efficiently assemble using different k‐mers (Lin et al ., 2011) and (ii) it is able to assemble vertebrate‐sized genomes and transcriptomes (Li et al ., 2010). Finally, contigs separately obtained from short read and long read (355 483 in total) were reconciled using CAP3, resulting in 210 513 contigs to describe a P. pinaster transcriptome. Notably, 1241 unigenes were longer than 3000 nucleotides. Unigene length ranged from 40 to 7876 bp, with an average of 495 bp and a median of 361 bp.

The complete set of unigenes was compared using blast with the previous P. pinaster transcriptome available at EuroPineDB (Fernández‐Pozo et al ., 2011), and 128 294 unigenes (61%) were not described in this earlier database, while 935 EuroPineDB unigenes were not present in the current unigene collection. When the P. pinaster transcriptome was compared with the P. glauca and P. abies draft genomes, 87.5% and 99.3% of homology was found, respectively, confirming that most assembled unigenes were pine transcripts.

Annotation of unigenes

Unigene annotation was achieved by combining the results of several annotation processes. Each annotation is associated with an E ‐value to enable the empirical assessment of annotation quality. A preliminary analysis of the collection of unigenes using Full‐LengtherNext (FLN) revealed that, from the 181 100 unigenes annotated (46.6%), 26 020 were nonredundant transcripts based on orthologue ID. It is also remarkable that 18 667 full‐length (FL) unigenes were reconstructed with a mean length of 1495 nucleotides, representing 19% of the total annotated unigenes (Table 2). Of these, 9641 FL unigenes were different, unique genes (9.8% of the total annotated unigenes, and 37.0% of unique unigenes). The frequency distribution of FL unigenes (Figure 3) indicated a high proportion of unigenes ranging from 500 to 1500 nucleotides with the longest transcript being 7876 nucleotides.

Table 2. Summary of final data for the P. pinaster transcriptome
Absolute number %
Unigenes 210 513 100.0
Artefacts 2010 0.95
Unigenes after resolving artefacts 209 928 99.7
Unigenes > 200 nt 181 100 86.0
Unigenes > 500 nt 52 550 25.0
Unigenes > 3000 nt 1241 0.59
Unigenes with orthologueaa Percentages for this classification are calculated using this file as 100.0%.
98 175 46.6
Different orthologue ID 26  020 26.5
Complete transcripts (full length) 18 667 19.0
Different full‐length transcripts 9641 9.8
Putative ncRNAs 176 0.08
Unigenes without orthologueaa Percentages for this classification are calculated using this file as 100.0%.
111 577 53.0
Coding 29 736 26.6
Putative coding 23 545 21.1
Nonredundant coding 9799 8.8
  • Numbers in bold can be considered the representative amount of unigenes of the category.
  • a Percentages for this classification are calculated using this file as 100.0%.
image
Frequency distribution of total unigenes and full‐length unigenes.

Preliminary analysis using FLN also revealed that 111 577 unigenes (53.0%) did not possess significant homology to any other plant gene. This number includes new conifer genes as well as artefactual assemblies. To distinguish between both possibilities, FLN includes a TestCode analysis (Fickett, 1982) and a comparison with the noncoding RNA database (http://www.mirbase.org). As a result, at least 9799 nonredundant coding unigenes can worth a consideration of putative new conifer gene. In fact, 4608 unigenes had a homologue EST in the Pine Gene Index 9.0 database (http://compbio.dfci.harvard.edu/cgi-bin/tgi/gimain.pl?gudb=pine). However, only 176 unigenes from this study were determined to be candidate noncoding RNAs. Therefore, the minimal P. pinaster transcriptome can be calculated as 26 020 unigenes having unique ID. In addition, it could also be considered the 9799 unigenes without homology but having coding characteristics, plus the 176 noncoding RNAs, that is 35 995 unigenes.

Because this transcriptome was deemed satisfactory, genes annotated as described in Experimental procedures (GO term, a definition or a KEGG code) were subjected to statistical analyses. Of total unigenes, 62.2% (130 845) were annotated, which indicated that the level of annotation was similar to previously published results (Fernández‐Pozo et al ., 2011). Furthermore, the distribution of GO terms at level 2 of biological process and at level 3 of molecular function (Figure S1) shows that the putative transcriptome covers most important cell functions. A total of 58 296 unigenes possessed unknown sequences that could not be found in existing databases. The annotated transcriptome can be browsed, downloaded and queried at http://www.scbi.uma.es/sustainpinedb/.

Validation of full‐length cDNA sequences in the transcriptome database

Full‐length cDNA (FLcDNAs) are essential for gene annotation, unambiguous determination of intron–exon boundaries and gene functional analysis. To examine the quality of the FLcDNA collection established in the maritime pine transcriptome database, the validation of a number of selected genes coding for proteins of a variety of sizes was advisable. Sequences encoding structural and regulatory genes were selected, appropriate primers based on available sequences were designed, and the corresponding FLcDNAs were amplified by RT‐PCR from RNA samples extracted from a variety of maritime pine tissues. Table 3 shows a summary of this work. All selected genes were successfully amplified from cDNA samples using the sequence information available in the database. PCR products fit the theoretical, predicted size in most studied unigenes (Canales et al ., 2012; Cánovas et al ., 2007; Rueda‐López et al ., 2013; Villalobos et al ., 2012).

Table 3. Summary of FL cDNA used for sequence assembly validation by PCR amplification
Gene name Theoretical size of ORF from assembly (bp) Experimental size of ORF (bp) Accession numberaa Accession number of unigene in Sustainpine and GenBank.
Arginase 1026 1026 sp_v3.0_unigene23824
Xyloglucan endotransglycosylase 860 860 sp_v3.0_unigene29476
Phenylalanine ammonia‐lyase 2265 2265

sp_v3.0_unigene17298

AY641535

PII‐protein 714 732

sp_v3.0_unigene23578

AJ489604

Asparagine synthetase 1 1782 1782 sp_v3.0_unigene14147 HQ625490
Asparagine synthetase 2 1770 1773

sp_v3.0_unigene18231

HM222940

Glutamate decarboxylase 1530 1530

sp_v3.0_unigene11755

DQ125946

Glutamate dehydrogenase 1236 1236

sp_v3.0_unigene15901

HM222941

Sucrose synthase 1914 1914 sp_v3.0_unigene34880
Ammonium transporter 1.1 1584 1584 KC807907
Ammonium transporter 1.2 1464 1464 KC807908
Ammonium transporter 1.3 1539 1539 KC807909
Ammonium transporter 2.1 1461 1461 KC807910
Ammonium transporter 2.3 1446 1446 KC807911
Glutamine dumper 1 345 345 sp_v3.0_unigene97635
Glutamine dumper 2 384 384 sp_v3.0_unigene20421
bb MYB family of TF.
MYB1
1023 1023

sp_v3.0_unigene29297

EU482890

bb MYB family of TF.
MYB4
946 946 sp_v3.0_unigene127348
bb MYB family of TF.
MYB 8
1605 1605 FN868598
cc Dof family of TF.
Dof2
1065 1065 KC688677
cc Dof family of TF.
Dof3
1047 1047 KC688678
cc Dof family of TF.
Dof4
1053 1053 KC688679
cc Dof family of TF.
Dof6
907 907 KC688680
cc Dof family of TF.
Dof7
1404 1404 KC688681
cc Dof family of TF.
Dof8
822 822 KC688682
cc Dof family of TF.
Dof9
897 897 KC688683
dd NAC family of TF.
NAC1
1559 1545 sp_v3.0_unigene7635
dd NAC family of TF.
NAC2
1476 1398 sp_v3.0_unigene14173
dd NAC family of TF.
NAC3
1457 1388 sp_v3.0_unigene18613
dd NAC family of TF.
NAC4
1630 1278 sp_v3.0_unigene20354
dd NAC family of TF.
NAC5
1824 1180 sp_v3.0_unigene1398
Methionine synthase 2301 2301

sp_v3.0_unigene34452

HE566045

S‐adenosylmethionine synthase 1176 1176

sp_v3.0_unigene3027

HE574556

S‐adenosylhomocysteine hydrolase 1458 1458

sp_v3.0_unigene19530

HE574555

Methylenetetrahydrofolate reductase 1785 1785

sp_v3.0_unigene15993

HE574560

Caffeate O‐methyltransferase 1161 1095

sp_v3.0_unigene17184

HE574557

Hydroxycinnamoyl‐CoA: shikimate hydroxycinnamoyl transferase 1302 1302

sp_v3.0_unigene8683

HE574565

Glycine decarboxilase complex

H‐protein a

504 504

sp_v3.0_unigene30488

HE574553

Glycine decarboxilase complex

H‐protein b

516 516

sp_v3.0_unigene126824

HE574563

Mitochondrial serine hydroxymethyltransferase 1572 1572

sp_v3.0_unigene439

HE574554

Cytosolic serine hydroxymethyltransferase 1413 1413

sp_v3.0_unigene17057

HE574564

D‐3‐Phosphoglycerate dehydrogenase 1947 1947

sp_v3.0_unigene543

HE574561

3‐Phosphoserine aminotransferase 1302 1302

sp_v3.0_unigene37851

HE574562

Pinoresinol‐lariciresinol reductase 939 939

sp_v3.0_unigene17681

HE574558

Phenylcoumaran benzylic ether reductase 927 927

sp_v3.0_unigene31659

HE574559

Phenylpropenal double‐bond reductase 1056 1056

sp_v3.0_unigene22698

HE575885

  • a Accession number of unigene in Sustainpine and GenBank.
  • b MYB family of TF.
  • c Dof family of TF.
  • d NAC family of TF.

Maritime pine regulatory genes

Transcription factors (TF) were specifically searched for in the Sustainpine database. The unique transcripts containing domains of plant TF in maritime pine using Pfam motifs were 877 distributed in 30 families (Table S1). Comparative analysis was performed with other woody plants including white spruce (P. glauca ), poplar (Populus trichocarpa ) and grapevine (Vitis vinifera ), the herbaceous models Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa ), and the model moss Physcomitrella patens . The total number of TF in maritime pine was similar to the number previously reported for white spruce (Rigault et al ., 2011), suggesting that the information quoted here could be close to the full representation of the maritime pine transcriptome. In addition, the TF gene number in Physcomitrella (802) and conifer species (877–892) is smaller than in angiosperms (Table S1), either woody (1337–2499) or herbaceous (1407–1828).

Overall, the relative representation of each TF domain among total TF domains was largely conserved in maritime pine and spruces (Figure 4 and Table S1) with only a few exceptions. For example, the SRF family is under‐represented with only four members in maritime pine relative to the 45–71 reported in spruces or the 42–109 members reported in angiosperms. The AP2 family with 106 members is over‐represented with regard to spruces (24–63), but contains a similar number of TF than in angiosperms (93–209). In contrast, the histone‐like TF (CBF/NF‐Y) family is under‐represented in maritime pine (26) and Norway spruce (20) when compared to white spruce (52), but exhibits similar numbers to those found in angiosperm species.

image
Distribution of unique transcripts corresponding to TF gene families in P inus pinaster and comparison to other plant transcriptomes. The number of different encoded transcripts with the conserved DNA ‐binding domain of each family is represented. The distribution of TF gene families in P . pinaster , P icea glauca, P icea abies, P opulus trichocarpa and A rabidopsis thaliana is compared.

When we compared this general distribution in maritime pine to six plant models, we observed conservation of TF categories that can be classified establishing a hierarchical clustering (Figure S2) where a closer conservation in members of each family is observed for Physcomitrella , spruce and pine with a second branch for Arabidopsis , rice and grapevine and a separate branch for poplar with a greater number of TF. Likely, the recent genome duplication increased the number of TF in each family of poplar (Tuskan et al ., 2006). Families not found in maritime pine and previously reported in white spruce are as follows: RNA polymerase II TF SIII, RNA polymerase III, TF IIIC subunit and alcohol dehydrogenase TF Myb/SANT‐like (Rigault et al ., 2011).

Enzyme‐encoding genes

The representation of gene family members involved in various metabolic pathways was determined, and data retrieved from the SustainpineDB were compared with those found in Norway spruce (P. abies) , P. trichocarpa and A. thaliana (Figure 5). A first set of genes studied were those participating in nitrogen acquisition and assimilation. The number of genes encoding nitrate reductase and nitrite reductase was similar in the three species (1–2). For genes involved in inorganic nitrogen transport, some important differences were observed among species. For instance, Arabidopsis contains almost twice genes encoding nitrate transporters that maritime pine, Norway spruce and poplar. For ammonium transporters, only the poplar genome presents an expanded gene family of 14 members.

image
Comparison of gene families for relevant enzymes in P inus pinaster, P icea abies, P opulus trichocarpa and A rabidopsis thaliana . The following databases were used in addition to Sustainpine DB : P . abies v1.0, P . trichocarpa v3.0, A . thaliana TAIR 10.

Fewer transcripts for genes encoding enzymes of ammonium assimilation were found in conifer species. Only 2–3 transcripts for glutamine synthetase (GS) were identified in gymnosperms (P. pinaster and P. abies ), in accordance with previous results (Cánovas et al ., 2007). In contrast, genomes of angiosperm species are endowed with GS families with a higher number of members, eight in Populus and six in Arabidopsis . A single expressed gene was found for ferredoxin‐glutamate synthase (Fd‐GOGAT) and NADH‐GOGAT in maritime pine and Norway spruce. In contrast, two genes encode Fd‐GOGAT and NADH‐GOGAT in poplar.

A second group of genes where those encoding enzymes involved in synthesis of methionine and S‐adenosylmethionine (SAM), the activated form of methionine, which participate in a number of essential metabolic pathways in plants. In particular, we focused on three genes involved in the synthesis and recycling of SAM, a methyl donor in multiple cellular transmethylation reactions (Figure 5). The number of genes encoding cobalamin‐independent methionine synthase and SAM synthase was identical in conifers and Arabidopsis but slightly higher in P. trichocarpa , whereas an additional S‐adenosyl‐L‐homocysteine hydrolase gene was found in pine and Norway spruce in regard to angiosperms.

Cellulose, hemicellulose and lignin are major components of the secondary cell wall, a well‐developed structure in woody perennials compared with Arabidopsis, particularly in trees such as maritime pine and poplar. A similar number of genes encoding the cellulose synthase catalytic subunit (CesA) were found in conifers compared with Arabidopsis , whereas the number increased to 18 in poplar. The number of genes encoding sucrose synthase and UDP‐glucose 6‐dehydrogenase in the maritime pine transcriptome was slightly lower compared with the other three plant species.

We also compared the number of genes encoding three key enzymes of the phenylpropanoid pathway that direct the carbon flow from the shikimate pathway to monolignol biosynthesis: phenylalanine ammonia‐lyase, cinnamoyl‐CoA reductase and cinnamyl alcohol dehydrogenase (Figure 5). The number of genes encoding the three enzymes was similar in maritime pine, Norway spruce and Arabidopsis and also in poplar in the case of phenylalanine ammonia‐lyase. In contrast, the members of the cinnamoyl‐CoA reductase and cinnamyl alcohol dehydrogenase families are considerably higher in poplar (40 and 21, respectively).

Single nucleotide polymorphism and simple‐sequence repeat identification

Using CLCBio with stringent criteria both in terms of minimum allele frequency (MAF = 20%) and contig depth (minimum of 10 reads), we identified 55 607 in silico Single nucleotide polymorphism (SNPs) in 13 892 unigenes (Table S2), corresponding to an average of 1 SNPs per 2 kb surveyed in the assembled transcriptome. Setting the contig depth parameter to 4 or 10, yielded 542 830 or 20 208 SNPs, respectively, with GigaBayes. Suprisingly, only 10 829 and 2106 SNPs were simultaneously found between CLCBio and GigaBayes with relaxed (minimum contig depth, CRL = 4) and more stringent (CRL=10) detection criteria, respectively; corresponding to 2% and 10.4% of common SNPs with CLCBio. Despite a fivefold increase in commonly detected SNPs using the more stringent detection criteria in GigaBayes, both software detected highly divergent sets of SNPs. Given this inconsistency, we used a set of 5138 SNPs validated in natural populations and mapping pedigrees of maritime pine (Lepoittevin et al ., 2010; Chancerel et al ., 2011; Chancerel et al., 2013) to evaluate the ability of both software to identify true SNPs. Prior to this analysis, we checked that the flanking sequences associated with the true SNPs were found in the present assembly using BLASTn analysis. Using CLCBio, 36.26% (1863 SNPs) of true SNPs were detected, while this rate dropped to 9.17% (471 SNPs) and 1.36% (70 SNPs) using relaxed and stringent GigaBayes criteria. By normalizing Gygabayses detection rates to the 55 607 putative SNPs identified by CLCBio, the relaxed and stringent GigaBayes detection procedures led to a success rate of 0.94% (48 true SNPs detected) and 3.75% (193 SNPs), respectively. In conclusion, CLCBio was found to perform nearly 10 times better than the most stringent GigaBayes SNP detection procedure with our data set.

A total of 5974 putative simple‐sequence repeat (SSRs) were found, with trinucleotide repeats (3309) being the most common, and dinucleotide repeats (479) the less abundant. This is in agreement to previously published P. pinaster SSR abundance (Fernández‐Pozo et al ., 2011).

Discussion

Maritime pine transcriptome assembly

Long‐read sequence data sets are required for transcriptome assembly in nonmodel species for which a reference genome is not available. In conifers, 454 sequencing has been recently used to generate well‐defined transcriptomes in several species of ecological and economic interest, that is, Pinus contorta (Parchman et al ., 2010), P. glauca (Rigault et al ., 2011), P. pinaster (Fernández‐Pozo et al ., 2011), Pinus taeda and 11 other conifers (Lorenz et al ., 2012). In the present work, we used a combination of 454 and Illumina sequencing to define a minimal reference transcriptome for maritime pine (P. pinaster ). A similar approach was recently used to characterize, for example, the globe artichoke transcriptome (Scaglione et al ., 2012). The nonredundant transcriptome resulting from the assembly contains 26 020 unique transcripts with orthologue ID in public databases, a number very close to the 27 720 unique cDNA clusters reported for the P. glauca transcriptome (Rigault et al ., 2011) and higher than the 17 000 unique coding genes obtained in the assembly of P. contorta transcriptome (Parchman et al ., 2010). The number of unique transcripts in maritime pine is also close to the number of genes (28 354) resulting from the draft assembly of the 20‐gigabase genome of P. abies (Nystedt et al ., 2013). Considering all the available data, an elevated coverage of the maritime pine transcriptome is estimated.

FLcDNAs catalogues as genomic resources

The availability of large collections of FLcDNAs in several conifers, such as Sitka (Ralph et al ., 2008) and white spruces (Rigault et al ., 2011), as well as Cryptomeria (Futamura et al ., 2008), has greatly facilitated the assembly and annotation of FLcDNAs in maritime pine. FLcDNAs are crucial for accurate annotation, comparative analysis with other conifer species and also for functional analysis of relevant genes associated to maritime pine growth, development and response to environmental changes. Furthermore, this genomic resource will greatly facilitate protein identification as well as protein–protein interaction studies through proteomics approaches (Cánovas et al ., 2004). For all these reasons, it was of paramount importance to validate the assembly of the FLcDNA collection (9641 different transcripts).

Over the last few years, refined protocols have been developed for Agrobacterium ‐mediated genetic transformation of maritime pine embryogenic tissue, cryopreservation of transgenic lines and efficient transgenic plant regeneration through somatic embryogenesis (reviewed in Trontin et al ., 2013). These new developments and the availability of a large collection of FLcDNAs have paved the way for the application of reverse genetics towards functional dissection of traits of economic and ecological interest in maritime pine trees. Thus, the availability of both standardized transformation methods and FLcDNA catalogues is expected to significantly increase throughput in candidate gene analysis together with facilitating comparison across laboratories interested in maritime pine genomics. The functional analysis of key genes is crucial for future applications in tree improvement, new variety design and sustainable forest management (e.g. development of marker‐assisted selection).

Maritime pine gene families and genome size

It has been suggested that the increased size and complexity of conifer genomes relative to angiosperms may be explained by the existence of large gene families (Kinlaw and Neale, 1997). However, this assumption is not fully supported by available data as most TF (Figure 4 and Table S1) or other gene families (Figure 5) present in maritime pine (this work) or spruce genomes (Birol et al ., 2013; Nystedt et al ., 2013; Rigault et al ., 2011) were of similar or even lower size compared with angiosperm species (P. trichocarpa , A. thaliana and V. vinifera ). Meanwhile, the existence of large gene families in conifers coding for enzymes of secondary metabolism has been reported (Martin et al ., 2004), there are other families in primary metabolism that contain similar, or even shorter, number of functional members in conifers than in angiosperms (Cánovas et al ., 2007). High genome size and complexity in conifers may be more readily explained by divergence and accumulation of retrotransposons and pseudogenes (Morse et al ., 2009; Nystedt et al ., 2013). Retrotransposons and pseudogenes can be expressed and contribute in some extent to the collection of unigenes without orthologue in the maritime pine transcriptome. Accumulation of pseudogenes may have functional advantages in the regulation of gene expression if they are expressed. Recently, Poliseno et al . (2010) reported that expressed pseudogenes compete with authentic target transcripts for miRNA binding and, as such, modulate expression levels of their cognate genes.

Transcription factors

The identification of transcription factors (TF) and subsequent analysis of the composition and organization of TF families are necessary steps to understand the regulatory networks associated with key processes in conifer trees. The number of TF in P. pinaster appears to be similar to that of P. glauca (Rigault et al ., 2011), but considerably lower compared with those found in the genomes of several angiosperms. This fact is confirmed by studies carried out in specific families. For example, the Dof family has only ten members in maritime and loblolly pines (Rueda‐López et al ., 2013); this is twofold to eightfold lower than gene number in angiosperms, including 36 in A. thaliana , 30 in rice (Lijavetzky et al ., 2003), 22 in grape or 46 in poplar (Pérez‐Rodríguez et al ., 2010). Moreover, it has also been argued that some subfamilies such as Knox I involved in meristem cell identity have pursued a distinct path of evolution in conifers. Three of the four groups delineated by phylogenetic analyses in angiosperms have no conifer sister groups (Guillet‐Claude et al ., 2004), suggesting that conifers evolved in a nonlinear fashion compared with angiosperms. Recently, it has been proposed that expansion of the VASCULAR NAC DOMAIN (VND) gene family might be related to xylem vessel complexity in angiosperms (Nystedt et al ., 2013). Other example of this divergent pattern of evolution for TF is found in the HD‐Zip III gene family involved in regulation of cambium, and primary and secondary vascular differentiation whose structure also has diverged considerably between angiosperms and gymnosperms (Côté et al ., 2010). Furthermore, distinct evolutionary trends in angiosperms and gymnosperms are evident by differential family gene expansion of subgroup 4 R2R3‐Myb with more recent duplications in P. glauca (Bedon et al ., 2010). In this particular case, the higher number of members in conifers appears to be related with isoprenoid‐ and flavonoid‐oriented stress responses.

Most of the existing knowledge of plant TF genes was obtained from studies in Arabidopsis . While Arabidopsis is a useful model for many developmental and environmentally regulated processes in higher plants, it lacks certain traits that are of immense value to agriculture or forestry. To better understand the evolutionary relationship and dependence of transcriptional regulation and morphological complexity in Viridiplantae is important to analyse how particular families of transcription factors did expand in correlation with the general increase in morphological complexity. In this sense, the availability of full‐length sequences from gymnosperms come to close a huge gap and will help to understand how the land plants managed, in terms of transcriptional regulation, to become multicellular.

Gene families of metabolic pathways

A major contrast was observed for genes involved in nitrogen metabolism likely related with adaptations to nitrogen availability and use. For example, poplar, Norway spruce and pine possess almost half the number (7) of nitrate transporters encoded in the Arabidopsis genome (12). This difference may be due to differences among species on soil nitrate uptake and transport during active growth, compared with trees in which active growth is primarily supported by internal nitrogen storage and mobilization (Cantón et al ., 2005). In contrast, the family of ammonium transporters is expanded in P. trichocarpa (14) compared with pine, spruce and Arabidopsis (6). It has been suggested that this specific feature may be related to the peculiar physiology of a perennial and mycorrhizal tree (Couturier et al ., 2007). However, the observation of similar number of ammonium transporters in conifer trees and Arabidopsis does not support the above assumption anymore. Another major difference is the reduced number of the GS gene family in pine (2) and spruce (3) compared with Populus (8) and Arabidopsis (6). Deep transcriptome analysis in the current study supports previous reports describing only two genes for cytosolic GS pine and the lack of a plastidic isoform (Cánovas et al ., 2007).

In trees, consumption of methyl units during lignification implies the existence of an important carbon sink, and the S‐adenosylmethionine availability may affect wood quality through alterations in lignin content and composition (Villalobos et al ., 2012). Nevertheless, no major differences were found in the number of genes in poplar and conifers with respect to Arabidopsis .

The number of genes related to carbon partitioning to cellulose and hemicellulose seems to be similar in the four species, with the exception of CesA that is expanded in P. trichocarpa . The CesA gene family in poplar includes 18 members that form nine phylogenetic groups, eight of which contain a pair of CesA genes with nearly identical sequence as a likely result of recent genome duplication. Some of them showed redundant expression xylem‐specific, and it was suggested that more CesA genes could be required for the massive synthesis of cellulose in trees (Suzuki et al ., 2006). However, the number of genes encoding CesA subunits in P. pinaster, P. abies and A. thaliana seems to be similar.

Monolignol biosynthesis is a complex branch of the general phenylpropanoid pathway. The number of genes encoding phenylalanine ammonia‐lyase was similar in all species (4–6). However, an increased number of genes encoding cinnamoyl‐CoA reductase were found in P. trichocarpa , P. pinaster and P. abies compared with Arabidopsis . Previous work in angiosperms showed that cinnamoyl‐CoA reductase family represents the largest lignin biosynthesis gene family in several species (Xu et al ., 2009). In conifers, transcriptomic data reported here suggest that the cinnamoyl‐CoA reductase family is also expanded. In contrast, it has been previously reported that conifers may have a single copy of the cinnamyl alcohol dehydrogenase gene (Mackay et al ., 1997). However, different transcripts were found in maritime pine and Norway spruce.

Overall, gene families involved in metabolic pathways were of similar size in P. pinaster and P. abies . The number of genes in P. pinaster was either similar or lower to those existing in Populus or Arabidopsis . These findings should be confirmed when a reference genome for pine is available.

SNP identification

Based on previous unigene sets constructed with significantly fewer sequences obtained essentially from one ecotype (that of the Aquitaine provenance), highly multiplexed SNP arrays were constructed in maritime pine for linkage applications (Chancerel et al ., 2011, 2013) and association mapping (Lepoittevin et al ., 2012). The present study provides the most comprehensive SNP catalogue for maritime pine bringing together sequences from a diverse range of ecotypes (France, Spain, Portugal) and therefore allowing the use of high throughput genotyping technologies for applications in: i) genetic diversity and population structure analysis, with less ascertainment bias than was observed from the analysis of previous SNP data sets (Chancerel et al., 2011) thank to a better coverage of the species diversity; and, ii) genomic selection, because about 14k different gene‐loci are now covered by at least one SNP. Our study reveals a strong influence of the SNP calling algorithm on the number of detected SNPs. Such variability in the number of called SNPs between SNP detection software has been already reported in another conifer species (Muller et al ., 2012) and indicates that results should be interpreted with caution, especially if based on a single detection approach. In our case, using a set of already validated SNPs, we show that CLCBio was able to detect true SNPs at a rate that was ten times higher than the most stringent criteria implemented in GigaBayes. Trees have long generation times, and breeding (especially for pine) is a slow process (Mullin et al ., 2011). Genomic selection offers the possibility to increase genetic gain per time unit in these long‐lived organisms as illustrated recently for P. taeda (Resende et al ., 2012) and accelerate their domestication (Harfouche et al ., 2012). Genomic selection will also permit to control more precisely that sufficient level of diversity is maintained in future varieties to allow forest trees to cope with major biotic and abiotic constraints in rapidly evolving environment.

Concluding remarks

In this work, comprehensive characterization of the P. pinaster transcriptome was performed using a combination of two different next‐generation sequencing platforms, 454 and Illumina. The de novo assembly of the maritime pine transcriptome provides a large catalogue of expressed genes (26 020 unigenes with orthologue, 9799 unigenes with coding characteristics but without orthologue, 176 ncRNA) and a relevant collection of FLcDNAs (9641). In addition, the sequencing data permitted the identification and establishment of robust SSR‐ and SNP‐based databases for genotyping applications and translational integration in maritime pine breeding programmes. These genomic resources will facilitate genome sequencing, functional genomics and applied studies in maritime pine trees.

Experimental procedures

Tree source, tissues and experimental conditions

Pinus pinaster samples of developing xylem were collected from different genotypes of a Corsican clonal population planted in 1986 at the forestry station of INRA‐Pierroton (Aquitaine, France), as described (Villalobos et al ., 2012). Miscellanea of maritime pine tissues (cones, male and female strobili, buds, xylem and phloem) were collected from adult trees located at Sierra Bermeja (Málaga). Roots, stems and needles of 2‐week‐old pine seedlings and zygotic embryos were also sampled as well as maritime pine embryonal masses. Somatic embryos were produced from embryonal masses line (AAY06006) of P. pinaster originating from a Landes × Morocco polycross. Embryonal masses were induced from immature zygotic embryo (Park et al ., 2006). Embryonal masses and somatic embryos were cultured according to Lelu‐Walter et al . (2006). Zygotic embryos at several developmental stages isolated from immature seeds were also used. All samples were frozen in liquid nitrogen and stored at −80 °C untill further use.

RNA isolation, cDNA synthesis and construction of libraries

Total RNA was isolated as described by Canales et al . (2012). Double‐stranded cDNA libraries were constructed using the MINT kit (Evrogen) with incorporated modifications proposed by Babik et al . (2010). The quality of cDNA libraries was checked on 1.5% agarose gels and Agilent 2100 Bioanalyzer. cDNA libraries were also constructed from poly(A) enriched RNA. Briefly, equal amounts of total RNA (50 μg) from different tissues were combined prior to mRNA purification. Poly(A)+ mRNA was double purified using the Oligotex mRNA Mini Kit (Qiagen), according to the manufacturer's instructions. First and second strand cDNA synthesis was conducted from 10 μg of mRNA following the protocol described by Durban et al . (2011). Five cDNA synthesis reactions were performed to obtain sufficient cDNA quantity for 454 pyrosequencing. RNA isolation, purification and library construction from somatic embryos were carried out by GATC company (Constanz, Germany) using Smart cDNA construction kit and standard procedures for Illumina GA IIx sequencing and using standard kit for construction of normalized cDNA library dedicated to Roche/454 GS FLX sequencing.

DNA sequencing and preprocessing analysis

Libraries were sequenced using different sequencing platforms, including Roche/454 Titanium, Roche/454 GS FLX, Illumina GAIIx and Illumina HiSeq 2000. Chloroplast, mitochondrial and ribosomal sequences in short reads were filtered out by reference mapping using BWA (Li and Durbin, 2009). All reads were preprocessed using the SeqTrimNext pipeline (http://www.scbi.uma.es/seqtrimnext; Falgueras et al., 2010) available at the Genotoul's computer facilities (INRA Toulouse, France) and at the Plataforma Andaluza de Bioinformática (University of Málaga, Spain). Low quality, ambiguous and low complexity stretches, linkers, adaptors, vector fragments, organelle DNA, polyA/polyT tails, and contaminated sequences were removed while keeping the longest informative part of the read, discarding sequences below 20 (short reads) or 40 bp (long reads). The command line for long reads was seqtrimnext ‐t transcriptomics_plants.txt ‐Q input_reads.fastq > output.txt , where the configuration file transcriptomics_plants.txt is provided by default with SeqTrimNext and contains the configuration parameters. The command line for short reads was the same than above, but configuration file also indicates to filter out sequences shorter than 20 pb, skip preliminary statistic calculations and skip the read clonality analysis.

Transcriptome assembly

The bioinformatics process for obtaining the putative unigenes using the SeqTrimNext preprocessed reads is detailed in Figure 1. Short reads and long reads were assembled using well‐established protocols (see Methods S1). The complexity and redundancy of short read and long read assemblies was reduced using CAP3 assembler (Liang et al ., 2000) to obtain the unigenes. A 95% cut‐off for overlap per cent identity was applied to cope with both sequence variation (high heterozygosity of pine genes) and genome heterogeneity between samples as previously described (Fernández‐Pozo et al ., 2011).

Annotation analysis

Unigenes were analysed using Sma3s (Pérez‐Pulido, A., Muñoz‐Mérida, A., Claros, M.G., Trelles, O., submitted; http://www.bioinfocabd.upo.es/node/11) with the command line sma3s_v2.pl ‐a 123 ‐d uniprot_plants.dat ‐i Unigenes.fasta ‐b Unigenes.blast ‐p F > output using as input the unigenes (.fasta file), the blast result of unigenes against the plant UniProtKB database (.blast file) and the metadata of the plant UniProtKB database (.dat file) to provide a gene description, GO terms, EC keys, KEGG maps and InterPro codes for every sequence. They were also analysed with FLN to provide gene description, identify, which unigenes corresponded to FLcDNAs, detect putative start and stop codons as well as the putative protein sequence, extract, which unigenes could be sRNAs, and obtain a quick preview of the pine unigene content. AutoFact (Koski et al ., 2005) was also used to provide a third gene description.

Identification of polymorphisms

Because nucleotide variation is considered frequent in plant genes, we screened the unigenes for SNPs and SSRs variation. SSRs were screened using MREPS (http://bioinfo.lifl.fr/mreps/; Kolpakov et al ., 2003) with default parameters. SNP detection was performed based on 4 098 953 Roche/454 and Sanger reads using either CLC‐Bio Workbench, v6.0 (CLC Bio, Aarhus, Denmark) or Gygabayes (http://bioinformatics.bc.edu/marthlab/wiki/index.php/Software), an implementation and expansion of the PolyBayes SNP detection algorithm from Marth et al . (1999). With respect to the former, we first used the reference mapping function to map the reads onto the 210 513 unigenes generated in the present study. The following parameters were used: similarity score = 80%, minimum length fraction = 90% and maximum number of hits for reads = 10. Then, we used the SNP detection function, based on the neighbourhood quality standard (NQS) algorithm, using the following parameters: minimum coverage = 10, minimum central base quality = 20, minimum neighbourhood quality over a window length of 11 nucleotides = 15, maximum gap and mismatch count = 2 and minimum allele frequency = 20%. With respect to the latter, we used the following parameters: (ploidy = diploid; CRL = 4 or 10; CAL2 = 2; PSL=0.9; O = 1; D = 0.001). In both software, no insertion/deletion variants (InDels) were considered.

Acknowledgements

We are indebted to the anonymous reviewers for their thorough evaluation and constructive recommendations that helped to improve this manuscript. We would like to thank Aranzazu Flores Monterroso, Remedios Crespillo, José Vega Bartol and Marta Simôes for help in the preparation of samples. This work was supported by the SUSTAINPINE Project funded by the Plant KBBE programme, Scientific and Technological Cooperation in Plant Genome Research (PLE2009‐0016). Zygotic embryo data production was supported by FCT (Portugal) through projects PTDC/AGRGPL/102877/2008 and P‐KBBE/AGR‐GPL/0001/2009, and grants PEst‐OE/EQB/LA0004/2011 and SFRH/BD/79779/2011 (AR). Alexandre Aguiar and Isabel Carrasquinho from Instituto Nacional de Investigação Agrária e Veterinária (INIAV) are acknowledged for making zygotic embryos available. Somatic embryo data production was supported by the EMBRYOME project, contract number 33639, funded by the french ‘Conseil Régional de la Région Centre’.

      Number of times cited according to CrossRef: 70

      • The transcriptome of Pinus pinaster under Fusarium circinatum challenge, BMC Genomics, 10.1186/s12864-019-6444-0, 21, 1, (2020).
      • The Analysis of Pinus pinaster SnRKs Reveals Clues of the Evolution of This Family and a New Set of Abiotic Stress Resistance Biomarkers, Agronomy, 10.3390/agronomy10020295, 10, 2, (295), (2020).
      • Structural and Functional Characteristics of Two Molecular Variants of the Nitrogen Sensor PII in Maritime Pine, Frontiers in Plant Science, 10.3389/fpls.2020.00823, 11, (2020).
      • Phylogenomics indicates the “living fossil” Isoetes diversified in the Cenozoic, PLOS ONE, 10.1371/journal.pone.0227525, 15, 6, (e0227525), (2020).
      • Connections Between Amino Acid Metabolisms in Plants: Lysine as an Example, Frontiers in Plant Science, 10.3389/fpls.2020.00928, 11, (2020).
      • Gene and Genome Sequencing in Conifers: Modern Era, The Conifers: Genomes, Variation and Evolution, 10.1007/978-3-319-46807-5, (43-60), (2019).
      • Small RNA profiling in Pinus pinaster reveals the transcriptome of developing seeds and highlights differences between zygotic and somatic embryos, Scientific Reports, 10.1038/s41598-019-47789-y, 9, 1, (2019).
      • Using transcriptome sequencing and pooled exome capture to study local adaptation in the giga‐genome of Pinus cembra, Molecular Ecology Resources, 10.1111/1755-0998.12986, 19, 2, (536-551), (2019).
      • Completion of the cytosolic post-chorismate phenylalanine biosynthetic pathway in plants, Nature Communications, 10.1038/s41467-018-07969-2, 10, 1, (2019).
      • Utilization of Tissue Ploidy Level Variation in de Novo Transcriptome Assembly of Pinus sylvestris , G3: Genes|Genomes|Genetics, 10.1534/g3.119.400357, 9, 10, (3409-3421), (2019).
      • A high-density exome capture genotype-by-sequencing panel for forestry breeding in Pinus radiata, PLOS ONE, 10.1371/journal.pone.0222640, 14, 9, (e0222640), (2019).
      • Characterization and Application of EST-SSR Markers Developed From the Transcriptome of Amentotaxus argotaenia (Taxaceae), a Relict Vulnerable Conifer, Frontiers in Genetics, 10.3389/fgene.2019.01014, 10, (2019).
      • Mass spectrometry‐based forest tree metabolomics, Mass Spectrometry Reviews, 10.1002/mas.21603, 0, 0, (2019).
      • SNP discovery in radiata pine using a de novo transcriptome assembly, Trees, 10.1007/s00468-019-01875-w, (2019).
      • Glutamate synthases from conifers: gene structure and phylogenetic studies, BMC Genomics, 10.1186/s12864-018-4454-y, 19, 1, (2018).
      • Analysis of the WUSCHEL-RELATED HOMEOBOX gene family in Pinus pinaster : New insights into the gene family evolution, Plant Physiology and Biochemistry, 10.1016/j.plaphy.2017.12.031, 123, (304-318), (2018).
      • Development of EST-SSR markers in Larix principis-rupprechtii Mayr and evaluation of their polymorphism and cross-species amplification, Trees, 10.1007/s00468-018-1733-9, 32, 6, (1559-1571), (2018).
      • Transcriptome Comparison Reveals Distinct Selection Patterns in Domesticated and Wild Agave Species, the Important CAM Plants, International Journal of Genomics, 10.1155/2018/5716518, 2018, (1-12), (2018).
      • Resources for conifer functional genomics at the omics era, , 10.1016/bs.abr.2018.11.002, (2018).
      • Branching out: a new era of investigating physiological processes in forest trees using genomic tools, Tree Physiology, 10.1093/treephys/tpy026, 38, 3, (303-310), (2018).
      • Maritime Pine Pinus Pinaster Aiton, Step Wise Protocols for Somatic Embryogenesis of Important Woody Plants, 10.1007/978-3-319-89483-6_13, (167-179), (2018).
      • A comparative analysis between SNPs and SSRs to investigate genetic variation in a juniper species (Juniperus phoenicea ssp. turbinata), Tree Genetics & Genomes, 10.1007/s11295-018-1301-x, 14, 6, (2018).
      • NAC Transcription Factors in Woody Plants, , 10.1007/124_2018_19, (2018).
      • Identification and expression profiling of genes governing lignin biosynthesis in Casuarina equisetifolia L., Gene, 10.1016/j.gene.2018.07.012, 676, (37-46), (2018).
      • De novo transcriptome sequencing and SSR markers development for Cedrela balansae C.DC., a native tree species of northwest Argentina, PLOS ONE, 10.1371/journal.pone.0203768, 13, 12, (e0203768), (2018).
      • Application of EST-SSR markers developed from the transcriptome of Torreya grandis (Taxaceae), a threatened nut-yielding conifer tree , PeerJ, 10.7717/peerj.5606, 6, (e5606), (2018).
      • Abies pinsapo Boiss. Transcriptome Sequencing and Molecular Marker Detection: A Novel Genetic Resources for a Relict Mediterranean Fir, Forest Science, 10.1093/forsci/fxy022, (2018).
      • Nitrogen utilization during germination of somatic embryos of Norway spruce: revealing the importance of supplied glutamine for nitrogen metabolism, Trees, 10.1007/s00468-018-1784-y, (2018).
      • Transcriptome analysis of Pinus halepensis under drought stress and during recovery, Tree Physiology, 10.1093/treephys/tpx137, 38, 3, (423-441), (2017).
      • Root growth of somatic plants of hybrid Pinus strobus (L.) and P. wallichiana (A. B. Jacks.) is affected by the nitrogen composition of the somatic embryo germination medium, Trees, 10.1007/s00468-017-1635-2, 32, 2, (371-381), (2017).
      • The role of arginine metabolic pathway during embryogenesis and germination in maritime pine (Pinus pinaster Ait.), Tree Physiology, 10.1093/treephys/tpx133, 38, 3, (471-484), (2017).
      • Pp1, a main regulator of phenylalanine biosynthesis and utilization in maritime pine, Plant Biotechnology Journal, 10.1111/pbi.12854, 16, 5, (1094-1104), (2017).
      • Molecular fundamentals of nitrogen uptake and transport in trees, Journal of Experimental Botany, 10.1093/jxb/erx037, 68, 10, (2489-2500), (2017).
      • Integrated analysis of 454 and Illumina transcriptomic sequencing characterizes carbon flux and energy source for fatty acid synthesis in developing Lindera glauca fruits for woody biodiesel, Biotechnology for Biofuels, 10.1186/s13068-017-0820-2, 10, 1, (2017).
      • Expression profiling in Pinus radiata infected with Fusarium circinatum, Tree Genetics & Genomes, 10.1007/s11295-017-1125-0, 13, 2, (2017).
      • The gene expression landscape of pine seedling tissues, The Plant Journal, 10.1111/tpj.13617, 91, 6, (1064-1087), (2017).
      • Identification of Norway Spruce MYB-bHLH-WDR Transcription Factor Complex Members Linked to Regulation of the Flavonoid Pathway, Frontiers in Plant Science, 10.3389/fpls.2017.00305, 8, (2017).
      • Characterization of Three L-Asparaginases from Maritime Pine (Pinus pinaster Ait.), Frontiers in Plant Science, 10.3389/fpls.2017.01075, 8, (2017).
      • Identification of a small protein domain present in all plant lineages that confers high prephenate dehydratase activity, The Plant Journal, 10.1111/tpj.13195, 87, 2, (215-229), (2016).
      • RNA-seq analysis in forest tree species: bioinformatic problems and solutions, Tree Genetics & Genomes, 10.1007/s11295-016-0995-x, 12, 2, (2016).
      • Establishing gene models from the Pinus pinaster genome using gene capture and BAC sequencing, BMC Genomics, 10.1186/s12864-016-2490-z, 17, 1, (2016).
      • The variations in the nuclear proteome reveal new transcription factors and mechanisms involved in UV stress response in Pinus radiata, Journal of Proteomics, 10.1016/j.jprot.2016.03.003, 143, (390-400), (2016).
      • Selection and testing of reference genes for accurate RT-qPCR in adult needles and seedlings of maritime pine, Tree Genetics & Genomes, 10.1007/s11295-016-1018-7, 12, 3, (2016).
      • Genetic mapping of Pinus flexilis major gene (Cr4) for resistance to white pine blister rust using transcriptome-based SNP genotyping, BMC Genomics, 10.1186/s12864-016-3079-2, 17, 1, (2016).
      • Performance of genomic prediction within and across generations in maritime pine, BMC Genomics, 10.1186/s12864-016-2879-8, 17, 1, (2016).
      • Molecular Aspects of Conifer Zygotic and Somatic Embryo Development: A Review of Genome-Wide Approaches and Recent Insights, In Vitro Embryogenesis in Higher Plants, 10.1007/978-1-4939-3061-6_8, (167-207), (2016).
      • Deciphering the molecular basis of ammonium uptake and transport in maritime pine, Plant, Cell & Environment, 10.1111/pce.12692, 39, 8, (1669-1682), (2016).
      • Dataset of UV induced changes in nuclear proteome obtained by GeLC-Orbitrap/MS in Pinus radiata needles, Data in Brief, 10.1016/j.dib.2016.03.074, 7, (1477-1482), (2016).
      • Linkage and Association Mapping for Two Major Traits Used in the Maritime Pine Breeding Program: Height Growth and Stem Straightness, PLOS ONE, 10.1371/journal.pone.0165323, 11, 11, (e0165323), (2016).
      • Nitrogen Economy and Nitrogen Environmental Interactions in Conifers, Agronomy, 10.3390/agronomy6020026, 6, 2, (26), (2016).
      • An RNA‐Seq‐based reference transcriptome for Citrus, Plant Biotechnology Journal, 10.1111/pbi.12447, 14, 3, (938-950), (2015).
      • Group II late embryogenesis abundant (LEA) proteins: structural and functional aspects in plant abiotic stress, Plant Growth Regulation, 10.1007/s10725-015-0113-3, 79, 1, (1-17), (2015).
      • Forest tree genomics: 10 achievements from the past 10 years and future prospects, Annals of Forest Science, 10.1007/s13595-015-0488-3, 73, 1, (77-103), (2015).
      • High‐density SNP assay development for genetic analysis in maritime pine (inus pinaster), Molecular Ecology Resources, 10.1111/1755-0998.12464, 16, 2, (574-587), (2015).
      • The overexpression of the pine transcription factor PpDof5 in Arabidopsis leads to increased lignin content and affects carbon and nitrogen metabolism, Physiologia Plantarum, 10.1111/ppl.12381, 155, 4, (369-383), (2015).
      • Paternity recovery in two maritime pine polycross mating designs and consequences for breeding, Tree Genetics & Genomes, 10.1007/s11295-015-0932-4, 11, 5, (2015).
      • The oak gene expression atlas: insights into Fagaceae genome evolution and the discovery of genes regulated during bud dormancy release, BMC Genomics, 10.1186/s12864-015-1331-9, 16, 1, (112), (2015).
      • Nucleotide polymorphisms in a pine ortholog of the Arabidopsis degrading enzyme cellulase KORRIGAN are associated with early growth performance in Pinus pinaster , Tree Physiology, 10.1093/treephys/tpv050, 35, 9, (1000-1006), (2015).
      • Transcriptome sequencing and development of EST-SSR markers in Pinus dabeshanensis, an endangered conifer endemic to China, Molecular Breeding, 10.1007/s11032-015-0351-0, 35, 8, (2015).
      • The NAC transcription factor family in maritime pine (Pinus Pinaster): molecular regulation of two genes involved in stress responses, BMC Plant Biology, 10.1186/s12870-015-0640-0, 15, 1, (2015).
      • Combined de novo and genome guided assembly and annotation of the Pinus patula juvenile shoot transcriptome, BMC Genomics, 10.1186/s12864-015-2277-7, 16, 1, (2015).
      • Transcriptome-wide analysis supports environmental adaptations of two Pinus pinaster populations from contrasting habitats, BMC Genomics, 10.1186/s12864-015-2177-x, 16, 1, (2015).
      • Understanding developmental and adaptive cues in pine through metabolite profiling and co-expression network analysis, Journal of Experimental Botany, 10.1093/jxb/erv118, 66, 11, (3113-3127), (2015).
      • Bioinformatics Analyses to Separate Species Specific mRNAs from Unknown Sequences in de novo Assembled Transcriptomes, Bioinformatics and Biomedical Engineering, 10.1007/978-3-319-16480-9_32, (322-332), (2015).
      • Single‐nucleotide polymorphism discovery and validation in high‐density SNP array for genetic analysis in European white oaks, Molecular Ecology Resources, 10.1111/1755-0998.12407, 15, 6, (1446-1459), (2015).
      • ReprOlive: a database with linked data for the olive tree (Olea europaea L.) reproductive transcriptome, Frontiers in Plant Science, 10.3389/fpls.2015.00625, 6, (2015).
      • De novo assembly, characterization and functional annotation of Senegalese sole (Solea senegalensis) and common sole (Solea solea) transcriptomes: integration in a database and design of a microarray, BMC Genomics, 10.1186/1471-2164-15-952, 15, 1, (952), (2014).
      • The GRAS gene family in pine: transcript expression patterns associated with the maturation-related decline of competence to form adventitious roots, BMC Plant Biology, 10.1186/s12870-014-0354-8, 14, 1, (2014).
      • Novel dehydrins lacking complete K-segments in Pinaceae. The exception rather than the rule, Frontiers in Plant Science, 10.3389/fpls.2014.00682, 5, (2014).
      • From soil to plant, the journey of P through trophic relationships and ectomycorrhizal association, Frontiers in Plant Science, 10.3389/fpls.2014.00548, 5, (2014).

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.