Next‐generation sequencing fails to identify human virus sequences in cutaneous squamous cell carcinoma

Nonmelanoma skin cancer (NMSC) shows a strongly increased incidence in solid organ transplant recipients (OTRs) and AIDS patients, suggesting an infectious etiology. The role of certain viruses, i.e., cutaneous human papillomaviruses (HPVs), in NMSC in immunosuppressed patients remains controversial. Merkel cell polyomavirus (MCPyV), which was recently identified using high‐throughput sequencing, has been linked to cutaneous proliferations. Here, we aimed to identify novel or known viral sequences at the transcript level in cutaneous squamous cell carcinomas (SCCs) from OTR by using 454 high‐throughput pyrosequencing, which can produce long reads (∼400 bp) and thus is better suited for the analysis of unknown sequences than other sequencing platforms. cDNA libraries from three OTR SCC biopsies were generated and submitted to next‐generation sequencing using a 454 platform. Bioinformatic analysis included digital transcriptome subtraction and—in parallel—reference mapping as an alternative way for depleting human sequences. All control sequences introduced for bioinformatics analysis were recovered correctly. Among 717,029 454‐sequenced transcripts, nearly all identified viral reads were derived from phages. Bacterial sequences originated from the skin flora or environmental sources. Our study did not reveal any transcripts of known oncogenic or related unknown human viruses. These findings suggest that there is no abundant expression of known human viruses, or viruses with a high degree of homology to known human viruses, in cutaneous SCCs of OTR. Further studies are required to exclude the presence of viruses in NMSC, which cannot easily be identified on the basis of sequence homology to known viruses.

Nonmelanoma skin cancer (NMSC) including cutaneous squamous cell carcinoma (SCC) is the most common skin cancer among solid organ transplant recipients (OTRs) and a significant cause of morbidity and mortality in these patients. 1,2 The NMSC rates are markedly increased in OTR and AIDS patients compared to the general population. 3,4 Furthermore, OTRs with NMSC tend to have multiple tumor lesions, a higher recurrence rate and the tumors are more likely to metastasize. [5][6][7] Established risk factors for NMSC include exposure to ultraviolet radiation, fair color of the skin and immunosuppressive treatment. 1,2 The question whether the increased NMSC rate in immunocompromised individuals-in particular OTR-could reflect a viral involvement in skin carcinogenesis has been investigated intensively. Although there is evidence that human papillomaviruses (HPVs), especially from species beta and gamma, could play a role in the development of NMSC, their etiologic role is discussed controversially. 8,9 Consistent with this, a recent study investigating the cancer transcriptome of 31 cutaneous SCCs by high-throughput sequencing (Illumina) failed to detect active HPV expression in any of the tumor samples. 10 This suggests that HPV may not play role in the continuous proliferation of SCC cells.
Massive parallel sequencing has opened the possibility of comprehensively cataloguing microbial genomes and of analyzing transcriptomes in individual patient samples. Some high-throughput sequencing platforms (i.e., Illumina or SOLiD) produce short reads but reach a high-sequence coverage and are well suited, e.g., for targeted resequencing of cancer genomes. In contrast, the 454 technology yields fewer reads per run, but the average read length of $400 bases (with a maximum of up to 1 kb) facilitates similarity searches (e.g., via Basic Local Alignment Search Tool, BLAST). Therefore, this technology is a powerful tool to identify novel viral or bacterial sequences with a sufficient degree of homology to known pathogen genomes. Examples for the discovery of new viruses by this approach include an arenavirus transmitted through solid organ transplantation 11 and the Merkel cell polyomavirus (MCPyV), which was identified in the rare skin tumor Merkel cell carcinoma (MCC) by BLAST-based digital transcriptome subtraction (DTS) out of more than 300,000 pyrosequenced cDNA sequences. 12 MCPyV as well as two novel human polyomaviruses (HPyVs) identified by degenerate PCR are shed from the skin as assembled virions and therefore seem to belong to the human skin microbiome. 13 The recently discovered trichodysplasia spinulosa-associated polyomavirus appears to be another example for the involvement of novel viruses in skin proliferations of immunocompromised patients. 14 However, the role of MCPyV or other polyomaviruses in NMSC other than MCC is currently unclear: MCPyV DNA has been detected more frequently in NMSCs of immunosuppressed patients, whereas other studies found no evidence for MCPyV infection of NMSCs. 8,[15][16][17] In our study, we aimed to identify novel viral sequences in cutaneous SCCs from OTR by using 454 high-throughput pyrosequencing. We sequenced cDNA of SCC samples to identify actively transcribed viral sequences and to avoid detection of viral DNA shed from the skin surface. To fully exploit the advantage of the long-read lengths, we used the scoring matrix-based local aligner (BLAST) for DTS.

Patients and samples
Tissue biopsies from cutaneous SCCs were obtained from three patients (male 74, female 69 and male 69 years old) who had undergone solid organ transplantation (kidney, double lung or heart-lung transplantation), respectively, 10, 12 or 19 years ago. All patients gave their written informed consent to storage and analysis of the samples, and the study was approved by the local ethics committee. All tumor samples were confirmed as cutaneous SCC by an experienced dermatopathologist, who received the tumor tissue of the resection borders to confirm complete resection of the tumors, while we analyzed the tumor ''core.'' The HPV-18-positive HeLa cervical cancer cell line served as positive control.
cDNA library construction and high-throughput sequencing From each SCC sample, 50 mg of tumor tissue was lysed in a TissueLyser II (Qiagen, Germany) and total RNA was extracted using Trizol (Invitrogen, Germany). PolyA(þ)-RNA was extracted using oligo-d(T)-covered magnetic beads (Dynabeads Direct, Invitrogen, Germany) followed by 20-min DNAseI treatment (Promega, Germany) to avoid DNA carryover. After randomly primed reverse transcription (Expand Reverse Transcriptase, Roche, Germany), the second strand was synthesized using Klenow polymerase (New England Biolabs, Germany), and the resulting cDNA was used as the template for Roche 454 Rapid Library procedure (Roche, Germany). Two separate libraries were prepared from sample SSC-1. Integrity and fragment size of the library were evaluated using a 6000 RNA Pico chip (Agilent, Germany) in an Agilent Bioanalyzer.
After amplification by emulsion PCR (GS FLX Titanium SV emPCR Kit, Roche), high-throughput sequencing for each tumor sample was performed on one ''Large Region'' of a Titanium PicoTiterPlate (for HeLa cells on two 1/8th regions) in a GS FLX Sequencer (Roche, Germany) according to the manufacturer's instructions.

Bioinformatics
To remove artificial and low-quality sequence reads of the obtained 454 data, several filtering steps were applied to the raw data sets. FASTA-corresponding quality files were checked by LUCY software, 18 which excludes low-quality reads ( Fig. 1). Low-complexity sequences, defined by the BLAST ''dust'' filtering option, were removed using in-house Python scripts (available upon request).
To deplete the obtained high-quality (HiFi) data set of human transcripts, we used two methods, DTS 19 with our modifications and the ''reference mapping approach.'' Both methods used the same human reference databases [RefSeq Human Genomic, Human Transcript, Human assembled chromosomes (GRCh37, HuRef, Hs_Celera) and Human Mitochondrion; see Supporting Information Table S2]. In the DTS approach, data were BLASTed using BLASTþ 20 software package (see sources and parameters in Supporting Information Tables S1 and S3) in two steps: first, XML files obtained using MegaBLAST option were parsed and subtracted for matches using in-house Biopython-based scripts. Second, a BLASTn option was used to identify the remaining reads with matches in human databases. In the alternative way of depleting human transcripts, the ''reference mapping approach,'' the same HiFi data sets were mapped using the Roche 454 Reference Mapper v2.5.3 to the mentioned human reference databases.
The remaining data of both the DTS and the mapping approach, i.e., reads with no match or not mapped to human sequences, were screened against viral databases by BLASTing at the nucleotide level, initially with default parameters. Nonmatching queries were reanalyzed using ''low-stringency'' parameters. Subsequently, sequences with no match at the nucleotide level were BLASTed at the protein level with BLASTx using default parameters. Finally, sequences without significant matches at both nucleotide and protein levels were checked with the tBLASTx algorithm (Supporting Information Table S3).
All hits to viral database entries were subsequently checked for possible matches in the NR (nonredundant) database using NCBI-BLAST online or EMBL-based WU-BLAST. 21 In addition, for these putative viral hits, the resulting expectancy value, normalized score, alignment length and alignment percent identity were compared with matches obtained from a BLAST search against the NR database. Higher values for the indicated parameters and lower for the expectancy value, respectively, indicated the correct match.

Control sequences
To control the ability of our bioinformatic algorithm to identify reads related to published viral sequences, a part of the large T (LT) antigen of MCPyV coding sequence (positions 1-175; GeneBank NC_010277.1) was manually modified to yield artificial sequences with 93 and 83% homology to the MCPyV LT wt sequence. These sequences were added to the data file as controls. The percent identity between control reads and corresponding database records served as surrogate value for the sensitivity of the analysis.

Results
Results of high-throughput sequencing 454 high-throughput sequencing yielded a total of 717,029 reads for the three SCC samples. On an average, 7.3% of all sequences failed to pass the quality check, leaving 664,822 ''HiFi'' sequences (a few more reads were filtered out by the reference mapper software previous to mapping; therefore, slightly fewer reads (664,436) were subsequently analyzed for this approach; see Fig. 1). After passing the complete bioinformatic work flow, a total of 662,757 and 662,267 reads matched human sequences using the ''DTS'' or ''reference mapping'' approach, respectively. Detailed information for each data set from the three SCC samples is presented in Table 1.
To estimate the coverage of individual cellular genes, we analyzed the number of reads belonging to transcripts expected to have a high abundance in our samples: 6,576 (0.92%) reads covered keratin 14 transcripts (NM_000526.4) expected to be highly abundant in basal keratinocytes. As low-abundant transcripts that were expected to occur only in Figure 1. Flowchart of the bioinformatic analysis. The numbers given in the flowchart indicate the number of sequence reads left at each step of the analysis pipeline. 1 Quality control performed by LUCY software. 2 Low complexities were identified with BLAST ''dust'' filtering option and removed using in-house Python script. 3 Mapping was performed with Roche Reference Mapper v 2.5. 4 Similarity search in human databases on the nucleotide level. 5 Similarity search on the nucleotide level, but with lower stringency parameters. 6 BLAST search for translated nucleotide query against viral and nonredundant protein databases. 7 Checking for possible similarities of translated nucleotide entries in translated nucleotide databases. 8 Thirty-seven unique sequences remained unassigned in the DTS and 42 in the mapping approach only, whereas 425 sequences remained unassigned by both methods. This is represented by numbers in ovals.
subpopulations of cells in our tumor biopsies, we identified nine hits for CD101, three hits for langerin (CD207) and one hit for each CD14 and CD1a, which are markers for Langerhans cells.

Recovery of control sequences
All control sequences (the original and two artificially modified MCPyV LT sequences), which were ''spiked'' into the data sets before bioinformatic analysis, could be recovered by both ways of data analysis, DTS and reference mapping. At the nucleotide level, the artificially modified control reads resulted in polyomavirus matches only: exact matches to MCPyV and-as the match with the lowest homology-Pan troglodytes verus polyomavirus (HQ385748.1), which was identified with both the 93 and 83% homology control sequences. At the protein level, the hit with the weakest homology recovered was Gorilla gorilla gorilla polyomavirus (ADQ54206.1, 41% identity to the original MPyV control sequence). No species other than polyomaviruses matched to any of the control reads. This confirms the capacity of the bioinformatic approach to identify potentially unknown viruses related to known viruses, and its sufficient level of specificity, even with the lowest stringency of analysis.

Recovery of HPV-18 reads from the HeLa data set
454 sequencing of HeLa cells yielded 82,364 reads. A total of 3.6% were filtered out as low-quality reads and $96% matched to human databases. Using the DTS and mapping approach, 187 and 192 HPV-18 transcripts, respectively, were identified. Analysis of partially mapped sequences revealed additional 79 sequences, representing fused transcripts of cellular and HPV sequences. For the details, see Supporting Information Table S4.

Characterization of nonhuman hits in SCC samples
A total of 2,171 (DTS approach) and 3,301 (mapping approach) residual sequences were screened for matches with viral and other non-human sequences (Fig. 1). We obtained 11, 7 and 12 viral matches using both DTS and the ''mapping'' approach for SCC-1, -2 and -3, respectively (Table 1). With one exception (Paramecium bursaria chlorella virus), all hits to viral databases were phages, e.g., Pseudomonas phage or Enterobacteria phage ( Table 2; detailed alignments are shown in Supporting Information Fig. S1).
Among the remaining other sequences (i.e., nonviral and non-human), bacterial or fungal hits (e.g., Propionibacteria, Pseudomonas or Streptomyces species as well as Malassezia species; see Supporting Information Table S5) were identified. Moreover, a number of them had nonexact matches to primates and rodents. Because of the low likelihood of mammalian mRNA other than human in our samples, these nonexact matches to primates and rodents were regarded as human. Few reads matching to sequences not expected in the analyzed material, e.g., to plant species or model organisms, such as V. vinifera (grape) or D. rerio (zebrafish) were not Reads matching to sequences not expected in the analyzed material, e.g., to plant species or model organisms. 2 All left, unassigned sequences have been manually checked.
considered as plausible and assigned to the group of ''other'' matches (Table 1). Following this analysis, a total of 504 sequences could not be assigned to any entry in any of the databases used (425 of these unassigned sequences could not be assigned by both methods; 37 sequences remained unassigned in the DTS and 42 in the mapping approach only). Each of these sequences was BLASTed without limitation for the expectancy value. The resulting file was analyzed manually, and all putative viral matches or other matches of interest were checked online in EMBL-based WU-BLAST using default parameters (with ''dust'' filter option off). Although we obtained hits to viral sequences, these were short, had high expectancy values (i.e., a high chance, that the match was accidental) and they were found among other matches of the same quality. Thus, we concluded that these matches were most likely spurious.

Discussion
High-throughput sequencing of three cutaneous SCC samples did not reveal known human viral transcripts or any transcripts with sufficient homology to any known human or animal viruses. All control sequences could be recovered indicating that our analysis would have been capable of identifying viral sequences with a low homology to known human or animal viruses (we observed bona fide matches with as little as 41% homology at the protein level). In addition, several viral phage sequences as well as bacterial and fungal hits were identified, which might originate from an environmental source or the skin flora and thus further confirm the validity of the method. Both bioinformatic approaches (DTS or reference mapping) to deplete human sequences from the data set yielded a comparable number of non-human matches for each tumor sample and can therefore be used equally for this purpose. Our analysis also did not reveal any transcripts of known HPVs or polyomaviruses, which is consistent with the results of others 10 who did not find papillomavirus mRNA expression in SCCs by transcriptome sequencing using the Illumina technique. Taken together, this might indicate that known viruses such as HPV or HPyV are not transcribed in the majority of tumor cells and may therefore at least not be required for the maintenance of NMSC tumor cells.
Although Feng et al. were able to identify a novel polyomavirus, MCPyV, in an analysis of four Merkel cell carcinoma samples 12 yielding a total of 395,734 sequenced reads (two reads out of these were MCPyV reads), in our study no expressed viral sequences besides phages were found by 454 sequencing of three SCC samples resulting in a total of 717,029 reads. Nevertheless, sequencing of HeLa cells as positive control successfully identified HPV-18 transcripts (0.3% of all reads) with a similar percentage as obtained by others using short-read approaches (i.e., Illumina). 10,22 The reported levels of tumor virus transcripts in other tumors range between 0.02 and 0.06% for HPV-16-containing cervical and periungual SCCs 10 and 0.24% for the KSHV-infected tumor cell line BCBL-1. 19 Given our number of obtained sequence reads from the three SCCs, one viral transcript would have reflected a relative abundance of $1-2 Â 10 À6 (1 of 717,029), i.e., much lower than observed for HeLa cells or reported for other tumors.
We conclude that the fact that the 454 technique does not allow sequencing to the same depth as, e.g., the Illumina protocol is balanced by its advantage of a longer read length, thereby providing a better basis for the detection of novel sequences related to known microorganisms. However, the possibility of having missed a viral transcript of very low abundance and/or very low sequence homology to known viruses cannot be completely excluded.
Mapping data can be downloaded using the following links (contact the corresponding author for the password):