The multiple alignments of very short sequences

Abstract The multiple sequence alignment (MSA) is an increasingly important task in bioinformatics as we have to deal with the constantly increasing gene‐ and protein sequence databases. MSA is applied in phylogenetic analysis, in discovering conservative protein domains, in the assignment of secondary and tertiary structural features in proteins, or in the metagenomic sample analysis and gene discovery. Usually, the focus is on the MSA of long sequences, since in the practice these tasks appear most frequently. However, the strict analysis of the optimal MSA of short sequences is an area of negligence, and findings there may contribute to better and faster algorithms for the multiple alignment of long sequences. In the present contribution, we are examining length‐1 sequences using arbitrary metric and length‐2 sequences using unit metric, and we show that the optimum of the MSA problem can be achieved by the trivial alignment in both cases.


| Definitions and notations
Definition 1 Let Σ = a 1 , …, a n be a finite alphabet; a string over Σ is called a sequence. The pair of sequences s ′ 1 , s ′ 2 is an alignment of sequences s 1 and s 2 if for i = 1, 2: s � i is obtained from s i by inserting gaps (spaces, denoted by -) into or at either end of s i and after that, s ′ 1 and s ′ 2 have the same length. It is assumed that "-" is not an element of alphabet Σ.
The alignment of Definition 1 consists of two sequences of the same length. Consequently, every character of s ′ 1 is uniquely corresponded to a character of s ′ 2 , simply by locating at the same position.
Let be the common length of s ′ 1 and s ′

2
. The cost of this alignment is where d is a score scheme over Σ ∪ {−}, and s � j (i) is the ith character of s ′ j . The score scheme is usually required to be a metric on the set Σ ∪ {−}, that is, it needs to satisfy (v, u); and the triangle inequality: A frequently used score scheme is the unit metric, where d (u, v) = 0 if u = v and 1 otherwise. We call an alignment optimal for two sequences if its cost is minimal among every possible alignments.
The definition of aligning two sequences can easily be generalized for more strings: let k ≥ 2 be a positive integer, and suppose that we want to align the sequences s 1 , s 2 , …, s k . Let us insert gaps into or at either end of strings s 1 , s 2 , …, s k , so that they have the same length , and in the proper order, write the k sequences s � 1 , s � 2 , …, s � k , each of length , under one another. This table can be considered a matrix of size k × , and it is called a multiple alignment of sequences s 1 , s 2 , …, s k . Different scoring methods can be applied for multiple alignments, perhaps the most often used one is the sum of pairs method, where the cost is the sum of the costs of the alignments of the k 2 pairs from the aligned sequences. More exactly, if s 1 , …, s k are sequences to be aligned, then their sum of pair cost 27

is
Examples. (i) Let S : = {CCG, GCG, CGC}. The following set of sequences is a multiple alignment of S: Using the unit metric and computing the costs of the columns, cost( ) = 3 + 0 + 0 + 2 = 5.
(ii) Let Σ now contain only two characters (C and G) with the following metric: Using the given metric, cost ( ) is equal to 2 + 2 + 2 = 6.

ALIGNMENT FOR LENGTH-1 SEQUENCES
In this section, we focus on aligning length-1 sequences (equivalently, characters of Σ). An important earlier result needs to be quoted here 28 : Theorem 2 (Lemma 3) Let U be a subset of a set S of sequences over Σ, such that U contains only identical sequences, and let be an optimal alignment of S. Let U denote the restriction of to the rows of U. Then An important corollary of this theorem is the following one: it is enough to examine the sets of pairwise different sequences because in each optimal alignment, every instance of a given sequence is aligned identically.
The next definition will be used frequently throughout this work: Definition 3 Let S be a set of sequences that have the same length. is called the trivial alignment of S if is constructed by writing every sequence under each another, without using any gaps.

| 525
TAKÁCS And GROLMUSZ 2.1 | Multiple sequence alignment for length-1 sequences using unit metric The main result of this subsection is the next theorem: Theorem 4 Using unit metric, there cannot be a multiple sequence alignment for length-1 sequences that has cost less than the cost of their trivial alignment. Additionally, if we align k pairwise different length-1 sequences, then the cost of an optimal alignment is k 2 . Proof By Theorem 2, we may assume that the characters to be aligned are pairwise different. It is easy to see that the trivial alignment of k different characters has a cost of Let us suppose that this alignment is not optimal, then the length of every aligned sequence must be at least 2 in an optimal alignment. If this common length of aligned sequences is ≥ 2, then the general structure of the n × matrix of this multiple alignment is as follows: ∀i: and they are placed so that in each row, there is only one character and − 1 gaps (see Table 1). Obviously, the cost of the first column is since there are k 1 different characters with cost of k 1 2 , and besides that, all of the k − k 1 gaps increase the cost by one with every alphabetical character. A similar statement is true for every column, so the cost of this alignment is: Consequently, the cost above is minimized, when , and the cost of this alignment cannot be less than k 2 − k ∕2 = k 2 , that is, the cost of the trivial alignment.
Note: From the proof, it is also clear (by minimizing if the length of aligned sequences is . Since ≤ k, the cost can be at most k 2 − k and this limit can be reached if there is only one character in every column and in every row, then the cost is k (k − 1) = k 2 − k.

ALIGNMENT FOR LENGTH-1 SEQUENCES USING ARBITRARY METRIC
In this subsection, it will be shown that for length-1 sequences, we can use any metric as a score scheme, and the MSA problem still remains as easy as in the case of the unit metric.
Theorem 5 Using arbitrary metric, the minimum cost of the multiple sequence alignment for length-1 sequences is attained by the trivial alignment, and if k different sequences are aligned, then the optimal cost is equal to.
Proof Because of Theorem 2, it can be assumed again that every sequence has exactly one instance in the set S of sequences to be aligned. If we consider the trivial alignment of the S, it is easy to see that its cost is equal to C. Induction for the number of the columns in a MSA will be used to show that no alignment can have lower cost than C.
Let be assumed that the trivial alignment is not optimal, and let denote an optimal alignment. If is not the trivial alignment, then has columns where ≥ 2. It can be shown that cannot have exactly two columns, because in this case, the trivial alignment would have a lower cost than has.
Let us assume to the contrary that has exactly two columns; so there are k 1 sequences in the first column and k 2 in the second column, where k 1 + k 2 = k and there is exactly one character in each row (since our sequences to be aligned have length equal to 1, see Table 2).
We assume, without loss of generality, that the sequences in the first column are a 1 , a 2 , …, a k 1 and every other sequences are placed in the second column. If the cost of the first column of is denoted by cost (1), then Similarly, the cost of the second column is and cost( ) = cost(1) + cost (2).
A lower bound for cost ( ) can be determined by pairing the d a i , − summands in cost (1) to the summands of same form in cost (2) and using the triangle inequality. For example, for a fix i (1 ≤ i ≤ k 1 ) and ∀j: It is useful to notice that the summands on the right side of this inequality are exactly those ones that are not included in cost (1) when we consider summands of the form of d a i , a j for this fix i.
By considering this inequality for every 1 ≤ i ≤ k 1 , the following lower bound can be given:

This implies that
It is assumed that the trivial alignment with cost C is not optimal; therefore, cannot be an optimal alignment of S. By this contradiction, it is proved that an optimal alignment of S cannot have exactly 2 columns.
Using induction, we assume that it is shown ∀i: 2 ≤ i < that an optimal alignment cannot have exactly i columns, and let be an optimal alignment with columns. Considering the cost of the first two columns of , there are k 1 sequences in the first column and k 2 sequences in the second one. It is enough to prove that by merging these two columns, the cost of the new alignment is lower than the cost of . The cost of these columns (see Table 3) in is equal to

TAKÁCS And GROLMUSZ
Let us focus on the first k � = k 1 + k 2 characters of these columns. It is an alignment of a 1 , a 2 , …, a k� on two columns and it was shown that if these sequences are aligned trivially instead of using two columns, then the cost of the alignment cannot be higher. It means the following: On the left side of this inequality, there is the cost of the first two columns of , while on the right side, there is the cost of the column that is constructed by merging the first two columns of . Therefore, a lower bound for cost ( ) is given by an alignment that has l − 1 columns, implying that is not optimal W.

| MULTIPLE SEQUENCE ALIGNMENT FOR LENGTH-2 SEQUENCES
In this section, it will be shown that using the unit metric, a set of length-2 sequences cannot be aligned with less cost than their trivial alignment; however, this statement does not hold for using arbitrary metric.

Theorem 6
Using the unit metric, no multiple sequence alignment for length-2 sequences has less cost than their trivial alignment. If we align k different sequences s 1 = a i 1 a i k+1 , s 2 = a i 2 a i k+2 , …, s k = a i k a i 2k , then the cost of the optimal alignment is.
Proof Let S denote the set of sequences that need to be aligned. It is clear that the trivial alignment of S has the cost written above, so this lower bound is accessible. In other words, it is enough to prove that for any S, a non-trivial alignment cannot have less cost than the trivial one.

Let
be an alignment of S on columns where ≥ 3 . Let the rows of be permuted, so that those aligned sequences, where the indices of the two non-gap characters are the same, are placed under each other, forming a block of sequences. This operation does not change the cost of . In every row of , there are exactly two characters and − 2 gaps, so there can be 2 types of aligned sequences in , considering only the positions of the non-gap characters in a row. This implies that there will be 2 (not necessarily nonempty) blocks after permuting the rows of (e.g., if = 4, then there are 4 2 = 6 blocks after the permutation of the rows, see Table 4).
After making this block setting, it is clear that there are six types of aligned character pairs in : 1. first characters of some sequences aligned with other sequences' first characters; 2. first characters of some sequences aligned with other sequences' second characters; 3. first characters of some sequences aligned with gaps; 4. second characters of some sequences aligned with other sequences' second characters; 5. second characters of some sequences aligned with gaps; 6. gaps aligned with gaps.

T A B L E 4
The structure of after permuting its rows and making its block setting with = 4. Number 1 denotes the first characters, and number 2 the second letters. During the proof, an upper bound is given for the cost of aligning letters with the same order that are not aligned in by using character-gap alignment costs that are included in cost ( ) In the trivial alignment , there are only pairs of types (i) and (iv); moreover, every sequence's first character is aligned with each another in (and it holds similarly for every second character of the sequences of S). Nevertheless, in a nontrivial alignment , there are aligned sequences whose first or second characters are not aligned with each other in . This implies that it is enough to give an upper bound for the cost of these characters in that are aligned with each other in but are not aligned with each other in , using parts of cost( ) for this bound. (Because every part of cost ( ) is non-negative, if a bijection can be given between the letterletter alignments in that are not aligned in and some other alignments of characters of (not excluded charactergap alignments), so that the latter alignments have always at least as much cost as the former ones, then it means that cost ( ) ≥ cost( ).) If d denotes the unit metric, then the following inequality holds for every pair of sets P, R on arbitrary alphabet (where P and R can contain a letter more than once): Using this inequality, a bijection mentioned above can be given: first, let be considered two sequences whose first characters (a i and a j ) are not aligned in (it can be assumed that a j has bigger column index). This implies that the element that is in the intersection of the row of a j and the column of a i must be a gap. d a i , a j ≤ d a i , − , so the cost of the alignment of a i and a j in can be estimated by the cost of the alignment of two characters in .
Similarly, if two sequences are considered whose second characters (a i and a j ) are not aligned in , then (assuming that a j has bigger column index) the element in the intersection of the row of a i and the column of a j must be a gap. The same estimation can be given like before, meaning that the cost of the alignment of a i and a j in is less or equal to the cost of a character-gap alignment in .
Considering the block setting (Table 4) of , let B i and B j be the two blocks whose sequences' first characters are not aligned in . Assuming that the first characters of sequences in B j have bigger column index, there must be | | | B j | | | gaps in the intersection of the column of the first characters of sequences in B i and the rows of B j . If we denote the first letters of the sequences of B i B j by a b i a b j , then (because of the statements of the latter two paragraphs) the following holds: Besides that, a similar result can be established if we consider two blocks whose sequences' second characters are not aligned, using the gaps of the block that has the column with smaller column index (see Table 5). By these estimations, it is clear that this assignment between the character-character alignments in , which are not present in , and charactergap alignments in lead to a result that the latter costs in cannot be less than the corresponding costs in . We also need to show that this assignment is a bijection, that is, there are no character-gap alignments that are used more than one time.
A set of gaps in the block setting are considered in an estimation if and only if some characters in the block that are containing these gaps and some characters from another block that are aligned in the same column must be aligned in but they are not aligned in . This implies that these gaps are not used in estimations like above more times than the alignment of this gap set with the rest of the given column. Therefore, the former assignment is a bijection, implying that cost ( ) ≥ cost ( ).W.

Remark
In the proof, only the following property of the unit metric has been used: ∀a i , a j ∈ Σ: d a i , a j ≤ d a i , − . It follows that Theorem 6 remains valid for any metric, satisfying this property.

S = {CCG, GCG, CGC} .
T A B L E 6 The trivial and an optimal alignment of S

T A B L E 7
The trivial and an optimal alignment of S