Efficient algorithms for the discovery of DNA oligonucleotide barcodes from sequence databases



    1. School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A 1S6,
    Search for more papers by this author
    • Present address: PO Box 3674, Garibaldi Highlands, BC, Canada V0N 1T0.

  • V. DAHL,

    1. School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A 1S6,
    Search for more papers by this author
  • W. CHEN,

    1. Agriculture & Agri-Food Canada, Ottawa, ON, Canada K1A 0C6,
    2. Department of Biology, Carleton University, Ottawa, Ontario, Canada, K1S 5B6
    Search for more papers by this author

    1. Agriculture & Agri-Food Canada, Ottawa, ON, Canada K1A 0C6,
    2. Department of Biology, Carleton University, Ottawa, Ontario, Canada, K1S 5B6
    Search for more papers by this author

C. A. Lévesque, Fax: 1-613-759-1701; E-mail: levesqueca@agr.gc.ca


Efficient design of barcode oligonucleotides can lead to significant cost reductions in the manufacturing of DNA arrays. Previous methods are based on either a preliminary alignment, which reduces their efficiency for intron-rich regions, or on a brute force approach, not feasible for large-scale problems or on data structures with very poor performance in the worst case. One of the algorithms we propose uses ‘oligonucleotide sorting’ for the discovery of oligonucleotide barcodes of given sizes, with good asymptotic performance. Specific barcode oligonucleotides with at least one base difference from other sequences in a database are found for each individual sequence. With another algorithm, specific oligonucleotides can also be found for groups or clades in the database, which have 100% homology for all oligonucleotide sequences within the group or clade while having differences with the rest of the data. By re-organizing the sequences/groups in the database, oligonucleotides for different hierarchical levels can be found. The oligonucleotides or polymorphism locations identified as species or clade specific by the new algorithm are refined and screened further for hybridization thermodynamic properties with third party software.


DNA arrays are important tools for functional genomics, diagnostics and molecular detection of micro-organisms (Heller 2002; Lievens & Thomma 2005; Summerbell et al. 2005; Sessitsch et al. 2006; Wu et al. 2006; Agindotan & Perry 2007; Boonham et al. 2007; Lévesque 2007). A new generation of arrays was introduced by using light mask activated in-situ oligonucleotide synthesis on silicon chips (Lipshutz et al. 1995). Compared to cDNA-based ones, oligonucleotide-based arrays using photolithography or microspotting are popular in functional genomic research. For biodiversity research or the detection of micro-organisms in complex environmental samples, oligonucleotide-based arrays have been more commonly used than arrays with large DNA fragments (Lévesque et al. 1998; Uehara et al. 1999; Ye et al. 2001; Fessehaie et al. 2003; Lievens et al. 2003; Peplies et al. 2003). Since these arrays are often based on a single DNA region and that minor sequence difference must be detected without false positives, an oligonucleotide approach is a more practical choice than arrays with large DNA fragments. Longer polymerase chain reaction (PCR) products spotted on membranes give very poor species specificity (Lévesque et al. 1998).

By carefully choosing oligonucleotides with homology to various groups of sequences, it is possible to select oligonucleotides with high specificity that are less likely cross react with unintended targets. In functional genomics, this could mean oligonucleotide hybridization to all alleles of a gene, all genes in a gene family or hybridization to a single allele differentiated by a single nucleotide polymorphism. In biodiversity, the oligonucleotides can be designed for specificity at the phylogenetic cluster, family, genus or species level.

Oligonucleotide barcodes are chosen based on the objective of hybridizing with a complementary strand from the target DNA sequence, into a dimer that is stable at the experimental temperature. Promising candidates for providing matches in a region of a given target sequence are perfect complements of subsequences in the target area which do not match perfectly with any other sequences to be discriminated against. The problem of oligonucleotide barcodes for array design can then be reformulated as the discovery of unique subsequences of a given size within a unique or a set of target sequences. The method presented here provides a reasonably large pool of possible oligonucleotide barcode sequences or locations that can be further tested with other software for thermodynamic properties with their complements, thus filtering out the inadequate ones and adjusting the length of good candidates.

Ambiguous nucleotides further complicate the problem, as they may match a group of nucleotides at the complementary site of the sequence. For example, for a sequence containing -ARA-, complementary sequences containing -TTT- or -TCT- will match (the nucleotide R represents either A or G, complementary to either T or C). The method provided here can deal with large databases containing some ambiguities. Introns are a rich source for specific oligonucleotides. However, it is very difficult to design specific oligonucleotides from the poor alignments generally obtained from intron-rich sequences. An oligonucleotide search algorithm that would not rely on alignments would be very useful if not essential when dealing with highly variable sequences such as the ones often given by intron-rich sequences.

We establish the computational complexity extremes, ‘brute force’ and ‘optimal’ of algorithms that can identify oligonucleotide barcodes by finding unique (per sequence) subsequences within a group of sequences. Let us consider the size of a problem given by a number m sequences of size n and a target d of the size of individual barcode oligonucleotides. Usually barcode oligonucleotide sizes are chosen in the 10–70 range. Our discussion is provided for DNA oligonucleotide barcode sequences, but it can also be applied to RNA for functional genomics studies. A ‘brute force’ algorithm could be designed that will: (i) enumerate every subsequence of size d from every sequence (the first subsequence); (ii) numerate every subsequence of size d from every sequence (the second subsequence); and (iii) match every nucleotide of the first subsequence with every nucleotide in the same position of the second subsequence. The result of the algorithm is the set of first subsequences that only match subsequences from the same sequence.

Step i will be executed m(n – d) times, or O(mn), step ii will be executed (m – 1)(n – d) times, or O(mn), and step iii would consist in the worst-case of d comparisons, and an average of d/2 comparisons, or O(d). The computational complexity of this ‘brute force’ algorithm is the product of complexities of steps i, ii and iii, or O(m2n2d). Even for a reasonably small test set, where n ≈ 104, m ≈ 104 and d ≈ 102, the algorithm already involves a multiple of 1018 operations, posing very serious challenges to current state-of-the-art hardware configurations. It can be further noted that during its execution, assuming a hypothetical ‘optimal’ algorithm, every nucleotide of every sequence in the test set will have to be seen at least once, i.e. the complexity of the optimal oligonucleotide barcode discovery algorithm is at least O(mn).

We have reviewed a number of other algorithms that deal with the problem of identifying oligonucleotide barcodes with high specificity for sequences or groups of sequences. Zhang et al. (2002) used a phylogenetic tree, and identified unique signatures for each node in the tree (representing a particular evolutionary bacterial grouping). One of their systems (Subsystem II) is based on the brute force algorithm described above and consequently will scale poorly to larger databases. The system used hash tables as intermediary data structures, removing sequences that contain ambiguities or that are considered to be ‘insufficiently complete’. It was evaluated on a set of 7322 16S rRNA sequences, and signature oligonucleotides were sought for a representative phylogenetic tree containing 929 sequences. Signatures were found for 80% of the nodes in the representative tree, indicating that highly specific nucleotide sequences exist in large numbers in bacterial 16S rRNA.

The PRIMROSE system (Ashelford et al. 2002) used three algorithms that identify unique oligonucleotides in 16S rRNA. Their Algorithm 1 was similar to manual methods of signature oligonucleotide discovery, and started from a consensus alignment for the group of target sequences. As there are no guarantees that subsequence uniqueness will follow alignment, especially given the subjective ‘tuning’ parameters of the alignment, the algorithm is likely to have reasonably high specificity, at the expense of coverage, that is, an unknown number of unique oligonucleotides may not be identified. Their Algorithm 2 and Algorithm 3 were based on the brute force approach, and consequently suffered from poor scalability. The latter used added regular-expression matching for subsequences with ambiguities, which did not affect its performance asymptotically.

Wesselink et al. (2002) used hash tables, and presented a solution to a more general problem — that of identifying the shortest contiguous subsequence that uniquely identifies a given target DNA sequence (this problem may simply be reformulated as an iterative identification of oligonucleotide signatures of size d, where d is varied over the whole range of target sizes, e.g. 10–70) but did not address ambiguities in the target sequences. Their system used hashing with open addressing, and with an additive rehashing strategy, resulting in good computational complexity on average. However, hashing in general is based on the assumption of randomness of the keys, which cannot be guaranteed in general for DNA or RNA sequences, potentially resulting in collisions that degrade the performance of the hash table data structure. Consequently, in the worst case, the performance of a system based on hashing will be comparable to the brute force approach.

Emrich et al. (2003) used a k-mismatch variant (Tarhio & Ukkonen 1993) of the Boyer-Moore substring matching algorithm for finding approximate (to k differences) occurrences of d size nucleotides within a target set T of sequences and a set of a nontarget set NT of sequences. The substring matching algorithm is sub-linear on the average case, leading to an average performance of O(m2n2), better by a factor of d than the brute force approach but still poorly scalable to large data sets.

A number of systems, such as Li & Stromo (2001), Rahmann (2002) used suffix arrays (Manber & Myers 1993), which can be configured for asymptotically optimal performance. Suffix trees (Kurtz 1999) can be similarly used. Kaderali & Schliep (2002) further combined suffix arrays with dynamic programming for eliminating oligonucleotides that are not stable at the experimental temperature.

From a computational point of view, given the sizes of data sets, good asymptotic complexity for oligonucleotide barcode discovery is essential. We propose an algorithm that identifies unique oligonucleotides from large databases taking into consideration the issues described above. Our algorithm does not assess the thermodynamic properties of the oligonucleotides; other software programs are needed for this subsequent step.

Materials and methods

Considering the set Sd of all sequences of size d, containing only unambiguous nucleotides from the nucleotide set S = {A,C,G,T}, we define over Sd a trivial (identity) equivalence ‘matching’ relation (=d) and an arbitrarily chosen ‘lexicographic’ order ≤d relation.

  • 1A ≤1 C ≤1 G ≤1T
  • 2(inline image)xd, yd inline image Sd, where xd = Xxd−1, yd = Yyd−1, X,Y inline image S and xd−1, yd−1 inline image Sd−1:
    • a. X ≤1Y ⇒ xd ≤d yd
    • b. X =1Y and xd−1 ≤ d−1yd−1 ⇒ xd ≤dyd
    • c. otherwise (if case a or b are not applicable): xd ≤d yd is false

Observation 1

The identity relation (=d) is an equivalence relation: it is symmetric, transitive and reflexive.

Observation 2

d is a total order for Sd: it is reflexive, transitive, antisymmetric and it applies for every two elements in Sd.

For the extended nucleotide set E = S inline image {R,Y,W,S,M,K, H,B,V,D,N}, ‘ambiguous matching’ (denoted as ≈ or ≈1) is based on the definition of ambiguous nucleotides, e.g. R is either A or G (A ≈1R and G ≈1R). We call Ed the set of nucleotide strings of size d containing characters (bases) of E.

Observation 3

The ≈d relation is not an equivalence relation, since it does not satisfy the transitivity requirement.

Counterexample: A ≈1N, C ≈1N, yet A ≈1C is false.

Since Sd supports a total order relation (Observation 2) and an equivalence relation (Observation 1), any set of sequences of size d of unambiguous nucleotides can be sorted.

Since Ed does not support an equivalence relation (Observation 3), sets of sequences of d nucleotides where at least one sequence contains at least one occurrence of an ambiguous nucleotide cannot be sorted (only sets that have both an equivalence relation and a total order relation can be sorted).

Since Sd is a subset of Ed, let us characterize the ambiguity of a given sequence s inline image Ed as the cardinality of the set {t inline image Sd | t d s}, expressed in terms of a unary function z. The ambiguity of an element of E1 can be calculated as the cardinality of the set of unambiguous nucleotides that represent the element. For example: z(A) = 1 and z(R) = 2. The ambiguity of an element t of Ed can be calculated as inline image , where ti is i-th element in t.

A sorting approach

We approach the discovery of unique oligonucleotides as a sorting problem. We propose a method by which we enumerate all subsequences of size d (the target oligonucleotide size) from a set of m sequences of size n, and build a list of all oligonucleotide barcode candidates, at each step eliminating all the ones that match new subsequences. Since Sd supports both a total order relation (≤d) and an equivalence relation (=d or =) that is reflective of the annealing support, we can sort any subset G of Sd. If L =˙ G˜ is the cardinality of the subset G, and for m of sequences of size d, L = (n – d)m holds; for d << n, L ≈ mn. Also, the complexity of comparing two elements of Sd is O(d).

Efficient sorting methods, such as ‘heap sort’ can be used for sorting G, with computational complexity O(dL · log L) = O(dnm · log(nm)). After sorting, duplicate records can be removed in O(L) = O(nm). Alternately, all elements of G could be copied into a sort tree data structure, such as an AVL tree (Adel'son-Vel'skii & Landis 1962). AVL trees are binary balanced trees, where the difference between heights of the subtrees of every node is at most 1. Insertions or deletions can create imbalances, that are represented by a finite number of cases. The identification of imbalance and subsequent rebalancing is executed efficiently through rotations associated with each imbalance case.

AVL trees have consistent complexity of search O(d log(nm)) and consistent complexity of insertions O(d log(nm)). Alternatives to AVL trees, such as two to three trees, or red-black trees can be used with comparable performance; refer to (Knuth 1998).

A list of nm subsequences of length d can be sorted through progressively building an AVL tree from elements of the list. The complexity of the algorithm will be O(dnm log(nm)).

We propose the following ‘oligonucleotide sorting’ algorithm for the discovery of unique oligonucleotide subsequences of size d from m sequences of length n:

Algorithm 1 — Oligonucleotide sorting

  • 1Enumerate every subsequence of size d from every sequence.
  • 2Enter subsequences from step 1 above in an AVL tree.
  • 3Perform a linear comparison between every nucleotide of the first subsequence enumerated in step 1 above and every nucleotide of the second subsequence — leaf of the AVL tree in step 2.
  • 4Mark every subsequence already on the tree that is identical with the current subsequence.
  • 5List all unmarked elements of the AVL tree. This list represents the list of oligonucleotide barcodes for the set.

The complexity of steps 1 through 4 is O(log(nm)) · O(nm) · O(d) = O(dnm log(nm)), a much better result than the ‘brute force’ algorithm. Also, because of the small asymptotic contribution of the logarithm, this result is asymptotically close to the optimum O(nm). While the latter is achievable using suffix arrays, it is unclear without a direct comparison (not performed here) which method would perform faster on actual data sets.

Unfortunately, from Observation 3, we can determine that an arbitrary set of ambiguous sequences of size n containing elements of Ed\Sd cannot be sorted, which makes Algorithm 1 not applicable for sequences containing ambiguous nucleotides. We propose the following algorithm for such situations:

Algorithm 2 — Ambiguous oligonucleotide sorting

  • 1Enumerate every unambiguous subsequence of size d from every sequence.
  • 2Enter subsequences from step 1 above in an AVL tree; for each node, identify the sequence that the subsequence originates from.
  • 3Perform a linear comparison: every nucleotide of the first subsequence enumerated in step 1 above matching every nucleotide of the second subsequence — leaf of the AVL tree in step 2.
  • 4Remove every subsequence already on the tree that is identical with the current subsequence.
  • 5For every ambiguous subsequence sa of size d from every sequence.
    • a. Calculate the degree of ambiguity z of the ambiguous subsequence.
    • b. If z(sa) is less than the number of nodes in the tree, then enumerate all unambiguous subsequences matching sa; remove each such subsequences from the tree.
    • c. Otherwise, traverse the whole tree and remove every oligonucleotide barcode candidate from the tree which matches the ambiguous subsequence.
  • 6List all unmarked elements of the AVL tree. This list represents the list of oligonucleotide barcodes for the set.

The algorithm has three extreme scenarios:

  • • The ambiguity degree z(t) of every ambiguous subsequence t is low z(t) << nm. The complexity will be comparable with the complexity of Algorithm 1 — Oligonucleotide sorting, and step 5.b will be executed in all situations.
  • • The number a of ambiguous subsequences is low a << nm. The complexity will be comparable with the complexity of Algorithm 1 — Oligonucleotide sorting and step 5.c will be executed a times.
  • • The worst case, where neither of the above is true. The complexity of the algorithm is similar with the brute force approach.

A detailed analysis of the computational complexity for the average case of the algorithm is not provided here. However, for a given data set, z and a can be calculated in time linear in the size of the data set, providing very good predictability of the computational effort required for running the algorithm.

Algorithm 3 — Ambiguous oligonucleotide sorting to find group or clade barcode oligonucleotides

For large sets of related genomic sequences, it is possible that oligonucleotide barcodes cannot be found for many of the individual sequences in the set. Even when barcodes are found at the highest resolution level, it would still be desirable to have barcodes for higher hierarchical levels, for example at each node of a tree. The association of related sequences into evolutionarily significant groups or clades can be attempted and the ‘ambiguous oligonucleotide sorting’ algorithm can then be modified to accommodate such groups.

We define a group oligonucleotide barcode of size d as a contiguous sequence of size d that occurs in every member of a group and does not occur anywhere else in the sequence set. For each sequence in the group, oligonucleotide barcodes can be found using the ‘ambiguous oligonucleotide sorting’ algorithm for sequences in the group. The computational complexity of the algorithm will be unaffected, asymptotically.

  • 1Enumerate every unambiguous subsequence of size d from the group
  • 2Add the subsequence to an AVL tree; include a ‘group counter’ for each node
  • 3For every sequence from the group
    • a. Enumerate every subsequence from the sequence
    • b. If the subsequence occurs on the AVL tree, increment the group counter, then continue with the next sequence
  • 4Enumerate every unambiguous subsequence from every sequence not in the group
    • a. If the subsequence occurs in the AVL tree, remove it
  • 5For every ambiguous subsequence sa of size d from every sequence
    • a. Calculate the degree of ambiguity z of the ambiguous subsequence
    • b. If z(sa) is less than the number of nodes in the AVL tree, then enumerate all unambiguous subsequences matching sa; remove each such subsequences from the tree.
    • c. Otherwise, traverse the whole tree and remove every oligonucleotide barcode candidate from the tree which matches the ambiguous subsequence.
  • 6The subsequences remaining in the AVL tree are the group oligonucleotide barcodes
  • 7Calculate oligonucleotide barcode oligos for every sequence in the group using Algorithm 2. Perform ambiguous oligonucleotide sorting, using only sequences from the group. This list represents the list of within group oligonucleotide barcodes for the group.

All presented algorithms can be run for only one or for a number of target oligonucleotide sizes, increasing their complexity by that number (a constant factor). The algorithms can also be repeated for each group in the set of sequences. Sequences in every group will be uniquely identified using two oligonucleotide barcodes: the group oligonucleotide barcode, and the within group oligonucleotide barcode. We use the oligonucleotide locations found by SigOli and run Array Designer 1.1 to find oligonucleotides at these locations with desired thermodynamic properties (Chen et al. 2009). A detailed explanation on how to design oligonucleides with SigOli and Array Designer 1.1 was also provided in Seifert & Lévesque (2004). There are other third party software systems that could evaluate the oligonucleotides selected by SigOli with respect to such properties.

Results and discussion

An open source command-line utility (SigOli for Signature Oligonucleotides) which provides implementations in C++ of algorithms 2, 3 and 4 is available at the web site http://www.lifeintel.org and can be distributed and used under the GNU public license. SigOli has been compiled and tested on a variety of platforms (Windows, Linux, Solaris). The full data set and analyses from the study by Chen et al. (2009) is also available at this website.

In order to evaluate if these new algorithms can find clade and species specific oligonucleotides in a sequence data set, we used sequences of Pythium species to see if the algorithm could find the positions of oligonucleotides already published and validated (Tambong et al. 2006). These oligonucleotides were found manually by looking at a large multiple alignment. For each oligonucleotide in this study, individual folders were created and each contained one folder with the internal transcribed spacer (ITS) sequences (c. 1000 bp) from (Lévesque & de Cock 2004) that matched the oligonucleotide and another one with the other nonmatching sequences. It took 7 min for a Dell laptop D610 (1.86 GHz) running Windows XP to process the 170 folders that had the matching and nonmatching nested folders. Figure 1 shows the distribution of the differences between the middle position of each published and validated oligonucleotides with the middle position selected by SigOli after finding 20-mers. SigOli predicted very accurately the position of oligonucleotide locations. The shifts in the positions are due to the fact that the final oligonucleotides often have to be moved left or right from the centre in order to minimize hairpins and dimers. When we use Array Designer to finalize the oligonucleotides, such shifts also occur for optimal design. SigOli found an average of 13 positions for each query. In nine cases, only one was found which corresponded to the oligonucleotide position selected manually. In a few cases, the oligonucleotide location selected manually was not found by SigOli. We changed the oligonucleotide length for the search and the right position was ultimately found. It took several weeks of manual analyses to find the oligonucleotides in Tambong et al. (2006) whereas such analysis with SigOli found the same sites and many more potential ones very fast. We are now designing arrays using SigOli combined with Array Designer for final selection.

Figure 1.

Frequency distribution of pairwise differences in oligonucleotide barcode position between the ones published and validated in Tambong et al. (2006) and the matching ones found by SigOli. The pairwise positions were determined by the central base pair (bp) location of each oligonucleotide against the same target strain.

In a more recent project, Chen et al. (2009) used SigOli and Array Designer 1.1 to design species and clade specific barcode oligonucleotides based upon a previous barcode study database of 358 sequences at the 5′ end of COX1, representing 58 species of Penicillium subgenus Penicillium and 12 allied species (Seifert et al. 2007). SigOli was used to identify the oligonucleotide at various specificity levels. All COX1 sequences were first regrouped into 16 folders (each containing several subfolders) according to their resolution within the NJ tree (Seifert et al. 2007). For a given resolution folder, sequences within each subfolder contained all the sequences of a given node. Under each folder, SigOli searched unique oligonucleotide positions within each subfolder, i.e. no other subfolders contained selected oligonucleotides while maintaining perfect match within all sequences of a subfolder. For the subgenus Penicillium COX1 data set, SigOli generated an Excel output file containing 422 potential positions, which were then used as an input file for Array Designer 1.1. This step was necessary to adjust the oligonucleotide length because of variation in A/T and G/C content and to reject unsuitable oligonucleotides because of strong hairpins. By viewing the blast results given by Array Designer, we selected approximately 180 perfectly matched oligos (20–41 bp in length) which were all unique within each target clade. These oligonucleotides were synthesized, spotted onto an array, and are tested by hybridization of labelled PCR products from pure culture and environmental samples to the array. See Chen et al. (2009) in this special issue for more details.

Some searches for oligonucleotides for selected mycotoxin-producing species of Penicillium and Fusarium were designed the same way with other genes, and these results were published (Seifert & Lévesque 2004).

As another test for SigOli, close to 3000 ITS sequence Fasta files of fungi from GenBank were organized into 679 folders to test the speed of the algorithm further. It took 17 s to search this database for specific 20 bp oligonucleotide positions and 19 s for oligonucleotide strings on a similar hardware configuration as described above.

A number of algorithms for the discovery of barcode oligonucleotides applicable to a variety of conditions were presented, namely, unstructured sequence set, groups and ambiguous nucleotides. These algorithms have very good asymptotic performance or, in the case of sequences containing ambiguous nucleotides, predictable running time. The SigOli utility has been used successfully as part of a barcode oligonucleotide microarray design process.

GenBank is growing exponentially and it is now possible to microfabricate microarrays at very high density. A key bottleneck in designing microarrays with large number of species or SNPs to be detected is the selection of the appropriate oligonucleotides. Multiple alignments no longer provide a suitable tool to find specific oligonucleotides within increasingly large data sets, especially if intron-rich regions are being analysed. By using the grouping capability of this new algorithm, it is possible to find conserved oligonucleotides that differentiate all sequences belonging to a group from any other sequences within the database. If there is significant variation within a group of sequences, it becomes almost impossible to manually identify group-specific oligonucleotides that differentiate them from a large pool of other sequences.

Alignment algorithms also illustrate a common difficulty with many sequence analysis software as the processing time scales poorly with sequence length or number of sequences. We have provided an algorithm for which processing time increases linearly (bound by a constant for naturally occurring problems) with the amount of data to be analysed.


The development of the algorithms presented here was supported by a grant from the Canadian Biotechnology Strategy program held by C.A. Lévesque. The current work by C.A. Lévesque and W. Chen on the use of DNA barcodes for developing DNA arrays is supported through funding to the Canadian Barcode of Life Network from Genome Canada through the Ontario Genomics Institute, NSERC, and other sponsors listed at http://www.BOLNET.ca. Veronica Dahl gratefully acknowledges support from NSERC's Discovery grant 31611024 and from the European Commission's Marie Curie Chair of Excellence. We are also thankful to two anonymous reviewers for their very insightful comments to improve our manuscript.

Conflict of interest statement

The authors have no conflict of interest to declare and note that the funders of this research had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.