A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance

Li, Shuai Cheng; Ng, Yen Kaow; Zhang, Louxin

doi:10.3390/a1020043

Open AccessArticle

A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance

by

Shuai Cheng Li

^1,*,

Yen Kaow Ng

² and

Louxin Zhang

³

¹

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada N2L 3G1

²

Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka 819-0395, Japan

³

Department of Mathematics, National University of Singapore, Singapore 117543

^*

Author to whom correspondence should be addressed.

Algorithms 2008, 1(2), 43-51; https://doi.org/10.3390/a1020043

Submission received: 5 September 2008 / Revised: 1 September 2008 / Accepted: 9 October 2008 / Published: 9 October 2008

Download Versions Notes

Abstract

:

In this paper we consider a basic clustering problem that has uses in bioinformatics. A structural fragment is a sequence of ℓ points in a 3D space, where ℓ is a fixed natural number. Two structural fragments

f_{1}

and

f_{2}

are equivalent if and only if

f_{1} = f_{2} \cdot R + τ

under some rotation R and translation τ. We consider the distance between two structural fragments to be the sum of the squared Euclidean distance between all corresponding points of the structural fragments. Given a set of n structural fragments, we consider the problem of finding k (or fewer) structural fragments

g_{1}, g_{2}, \dots, g_{k}

, so as to minimize the sum of the distances between each of

f_{1}, f_{2}, \dots, f_{n}

to its nearest structural fragment in

g_{1}, \dots, g_{k}

. In this paper we show a polynomial-time approximation scheme (PTAS) for the problem through a simple sampling strategy.

Keywords:

Clustering 3D point sequences; squared Euclidean distance; algorithm; polynomial-time approximation scheme.

1. Introduction

In this paper we consider the problem of clustering similar sequences of 3D points. Two such sequences of points are considered the same if they are equivalent under rotation and translation. The scenario which we consider is as follows. Suppose there is an original sequence of points that gave rise to a few variations of itself, through slight changes in some or all of its points. Now given these variations of the sequence, we are to reconstruct the original sequence. A likely candidate for such an original sequence would be a sequence which is “nearest" in terms of some distance measure, to the variations.

A more complicated scenario involves k original sequences of the same length. Formally, we formulate the problem as follows. Given n sequences of points

f_{1}, f_{2}, \dots, f_{n}

, we are to find a set of k sequences

g_{1}, \dots, g_{k}

, such that the sum of distances

\begin{matrix} \sum_{1 \leq i \leq n} min_{1 \leq j \leq k} d i s t (f_{i}, g_{j}) \end{matrix}

(1)

is minimized. In this paper we consider the case where

d i s t

is the minimum sum of squared Euclidean distances between each of the points in the two sequences

f_{i}

and

g_{k}

, under all possible rigid transformations on the sequences of points. A cost function in the form of the squared Euclidean distance is used in many techniques for clustering 3D points [1]. Since our clustering problem is quite different from those previously studied, it calls for a new technique. (The “square" in the distance measure is to fulfill a condition needed by the method in this paper. The method does not work, for example, in the case of the root mean squared Euclidean distance. On the other hand, the method easily adapts to other distance measures that fulfill the required condition.)

Such a problem has potential use in clustering protein structures. A protein structure is typically given as a sequence of points in 3D space, and for various reasons, there are typically minor variations in their measured structures. The problem can be considered a model of the situation where we have a set of measurements of a few protein structures, and are to reconstruct the original structures.

In this paper, we show that there is a polynomial-time approximation scheme (PTAS) for the problem, through a sampling strategy. More precisely, we show that an optimal solution obtained by sampling smaller subsets of the input suffices to give us an approximate solution, and the approximation ratio improves as we increase the size of the subsets we sample.

2. Preliminaries

Throughout this paper we let ℓ be a fixed non-zero natural number. A structural fragment is a sequence of ℓ 3D-points. The mean square distance (

M S

) between two structural fragments

f = (f [1], \dots, f [ℓ])

and

g = (g [1], \dots, g [ℓ])

, is defined to be

\begin{matrix} M S (f, g) = min_{R \in R, τ \in T} \sum_{i = 1}^{ℓ} {∥ f [i] - (R \cdot g [i] + τ) ∥}^{2} \end{matrix}

(2)

where

R

is the set of all rotation matrices,

T

the set of all translation vectors, and

∥ x - y ∥

is the Euclidean distance between

x, y \in R^{3}

.

The root of the

M S

measure,

R M S (f, g) = \sqrt{M S (f, g)}

is a measure that has been extensively studied. Note that

R \in R

,

τ \in T

that minimize

\sum_{i = 1}^{ℓ} {∥ f [i] - (R \cdot g [i] + τ) ∥}^{2}

to give us

M S (f, g)

will also give us

R M S (f, g)

, and vice versa. Since given any f and g, there are closed form equations [2,3] for finding R and τ that give

R M S (f, g)

,

M S (f, g)

can be computed efficiently for any f and g.

Furthermore, it is known that to minimize

\sum_{i = 1}^{ℓ} {∥ f [i] - (R \cdot g [i] + τ) ∥}^{2}

, the centroid of f and g must coincide [2]. Due to this, without loss of generality we assume that all structural fragments have centroids at the origin. Such transformations can be done in

O (n ℓ)

time. After such transformations, in computing

M S (f, g)

, only the parameter

R \in R

need to be considered, that is,

\begin{matrix} M S (f, g) = min_{R \in R} \sum_{i = 1}^{ℓ} ∥ f [i] - R \cdot g [i] ∥^{2} \end{matrix}

(3)

Suppose that given a set of n structural fragments

f_{1}, f_{2}, \dots, f_{n}

, we are to find k structural fragments

g_{1}, \dots, g_{k}

, such that each structural fragment

f_{i}

is “near", in terms of the

M S

, to at least one of the structural fragments in

g_{1}, \dots, g_{k}

. We formulate such a problem as follows:

k-Consensus Structural Fragments Problem Under $M S$
Input:	n structural fragments $f_{1}, \dots f_{n}$ , and a non-zero natural
	number $k < n$ .
Output:	k structural fragments $g_{1}, \dots g_{k}$ , minimizing the cost
	$\sum_{i = 1}^{n} {min}_{1 \leq j \leq k} M S (f_{i}, g_{j})$ .

In this paper we will demonstrate that there is a PTAS for the problem.

We use the following notations: Cardinality of a set A is written

| A |

. For a set A and non-zero natural number n,

A^{n}

denotes the set of all length n sequences of elements of A. Let elements in a set A be indexed, say

A = {f_{1}, f_{2}, \dots, f_{n}}

, then

A^{m!}

denotes the set of all the length m sequences

f_{i_{1}}, f_{i_{2}}, \dots, f_{i_{m}}

, where

1 \leq i_{1} \leq i_{2} \leq \dots \leq i_{m} \leq n

. For a sequence S,

S (i)

denotes the i-th element in S, and

| S |

denotes its length.

3. PTAS for the k-Consensus Structural Fragments

The following lemma, from [4], is central to the method.

Lemma 1 ([4]) Let

a_{1}, a_{2}, \dots, a_{n}

be a sequence of real numbers and let

r \in N

,

1 \leq r \leq n

. Then the following equation holds:

\begin{matrix} \frac{1}{n^{r}} \sum_{1 \leq i_{1}, i_{2}, \dots, i_{r} \leq n} \sum_{i = 1}^{n} {(\frac{a_{i_{1}} + a_{i_{2}} + \dots + a_{i_{r}}}{r} - a_{i})}^{2} = \frac{r + 1}{r} \sum_{i = 1}^{n} {(\frac{a_{1} + a_{2} + \dots + a_{n}}{n} - a_{i})}^{2} \end{matrix}

(4)

Let

P_{1} = (x_{1}, y_{1}, z_{1}), P_{2} = (x_{2}, y_{2}, z_{2}), \dots, P_{n} = (x_{n}, y_{n}, z_{n})

be a sequence of 3D points.

\begin{matrix} \frac{1}{n^{r}} \sum_{1 \leq i_{1}, i_{2}, \dots, i_{r} \leq n} \sum_{i = 1}^{n} {∥ \frac{P_{i_{1}} + P_{i_{2}} + \dots + P_{i_{r}}}{r} - P_{i} ∥}^{2} \end{matrix}

\begin{matrix} = & \frac{1}{n^{r}} \sum_{1 \leq i_{1}, \dots, i_{r} \leq n} \sum_{i = 1}^{n} {(\frac{x_{i_{1}} + \dots + x_{i_{r}}}{r} - x_{i})}^{2} + {(\frac{y_{i_{1}} + \dots + y_{i_{r}}}{r} - y_{i})}^{2} + {(\frac{z_{i_{1}} + \dots + z_{i_{r}}}{r} - z_{i})}^{2} \end{matrix}

\begin{matrix} = & \frac{r + 1}{r} \sum_{i = 1}^{n} {(\frac{x_{1} + \dots + x_{n}}{n} - x_{i})}^{2} + {(\frac{y_{1} + \dots + z_{n}}{n} - z_{i})}^{2} + {(\frac{z_{1} + \dots + z_{n}}{n} - z_{i})}^{2} \end{matrix}

\begin{matrix} = & \frac{r + 1}{r} \sum_{i = 1}^{n} {∥ \frac{P_{1} + P_{2} + \dots + P_{n}}{n} - P_{i} ∥}^{2} \end{matrix}

(5)

One can similarly extend the equation for structural fragments. Let

f_{1}, \dots, f_{n}

be n structural fragments, the equation becomes:

\begin{matrix} \frac{1}{n^{r}} \sum_{1 \leq i_{1}, \dots, i_{r} \leq n} \sum_{i = 1}^{n} {∥ \frac{f_{i_{1}} + \dots + f_{i_{r}}}{r} - f_{i} ∥}^{2} = \frac{r + 1}{r} \sum_{i = 1}^{n} ∥ \frac{f_{1} + \dots + f_{n}}{n} - f_{i} ∥^{2} \end{matrix}

(6)

The equation says that there exists a sequence of r structural fragments

f_{i_{1}}, f_{i_{2}}, \dots, f_{i_{r}}

such that

\begin{matrix} \sum_{i = 1}^{n} {∥ \frac{f_{i_{1}} + \dots + f_{i_{r}}}{r} - f_{i} ∥}^{2} & \leq & \frac{r + 1}{r} \sum_{i = 1}^{n} ∥ \frac{f_{1} + \dots + f_{n}}{n} - f_{i} ∥^{2} \end{matrix}

(7)

Our strategy uses this fact —in essentially the same way as in [4]— to approximate the optimal solution for the k-consensus structural fragments problem. That is, by exhaustively sampling every combination of k sequences, each of r elements from the space

R^{'} \times {f_{1}, \dots, f_{n}}

, where

f_{1}, \dots, f_{n}

is the input and

R^{'}

is a fixed selected set of rotations, which we next discuss.

3.1. Discretized Rotation Space

Any rotation can be represented by a normalized vector u and a rotation angle θ, where u is the axis about which an object is rotated by θ. If we apply

(u, θ)

to a vector v, we obtain vector

\hat{v}

, which is:

\begin{matrix} \hat{v} = u (v \cdot u) + (v - w (v \cdot w)) cos θ + (v \times w) sin θ \end{matrix}

(8)

where · represents dot product, and × represent cross product.

By the equation, one can verify that a change of ϵ in u will result in a change of at most

α_{1} ϵ | v |

in

| \hat{v} |

for some computable

α_{1} \in R

; and a change of ϵ in θ will result in a change of at most

α_{2} ϵ | v |

in

| \hat{v} |

for some computable

α_{2} \in R

. Now any rotation along an axis through the origin can be written in the form

(θ_{1}, θ_{2}, θ_{3})

, where

θ_{1}, θ_{2}, θ_{3} \in [0, 2 π)

are respectively a rotation along each of the

x, y, z

axes. Similarly, changes of ϵ in

θ_{1}

,

θ_{2}

and

θ_{3}

will result in a change of at most

α ϵ | v |

, for some computable

α \in R

.

We discretize the values that each

θ_{i}

,

1 \leq i \leq 3

may take within the range

[0, 2 π)

into a series of angles of angular difference ϑ. There are hence at most

O (1 / ϑ)

of such values for each

θ_{i}

,

1 \leq i \leq 3

. Let

R^{'}

denote the set of all possible discretized rotations

(θ_{1}, θ_{2}, θ_{3})

. Note that

| R^{'} |

is of order

O (1 / ϑ^{3})

.

Let

d

be the diameter of a ball that is able to encapsulate each of

f_{1}, f_{2}, \dots, f_{n}

. Hence any distance between two points among

f_{1}, \dots, f_{n}

is at most

d

. In this paper we assume d to be constant with respect to the input size. Note that for a protein structure,

d

is of order

O (ℓ)

[5]. For any

b \in R

, we can choose ϑ so small that for any rotation R and any point

p \in R^{3}

, there exists

R^{'} \in R^{'}

such that

∥ R \cdot p - R^{'} \cdot p ∥ \leq α ϑ d \leq b

.

3.2. A Polynomial-time Algorithm With Cost $((1 + ϵ) D_{o p t} + c)$

Our algorithm for the k-consensus structural fragments problem is summarized in Table 1.

This is what the algorithm does: In (2), we explore m distinct subsets

A_{1}, \dots, A_{m}

from

f_{1}, \dots, f_{n}

, in the hope that each subset is from a distinct cluster in the optimal clustering. Since we explore all possible such subsets this is bound to happen. We then try to evaluate the score of each subset

A_{j}

by sampling up to r structural fragments (allowing repeats) from it (from (2.1) onwards). Such an evaluation is possible due to Equation 7. The evaluation also requires us to exhaustively try out all possible transformations in

R^{'}

, which is what we try to do in (2.2). Each of these samplings of

A_{j}

produces a consensus structural fragment

u_{j}

for

A_{j}

in (2.3), the score of which is evaluated in (2.4). Finally in (3), we output the consensus patterns

u_{1}, \dots, u_{m}

which give us the best score.

We now analyze the runtime complexity of the algorithm. Consider the number of

F_{1}, F_{2}, \dots, F_{m}

in (2.1) that are possible. Let each

F_{j}

be represented by a length r string of

n + 1

symbols, n of which each represents one of

f_{1}, \dots, f_{n}

, while the remaining symbol represents “nothing". It is clear that for any

A_{j}

, any

F_{j} \in A_{j}^{r!}

, or

F_{j} \in A_{j}^{| A_{j} |!}

(where

| A_{j} | \leq r

), can be represented by one such string. Furthermore, any

F_{1}, F_{2}, \dots, F_{m}

can be completely represented by k such strings — that is, to represent the case where

m < k

,

k - m

strings can be set to “nothing" completely. From this, we can see that there are at most

{(n + 1)}^{r k} = O (n^{r k})

possible combinations of

F_{1}, F_{2}, \dots, F_{m}

.

For each of these combinations, there are

| R^{'} |^{r k}

possible combinations of

Θ_{1}, Θ_{2}, \dots, Θ_{m}

at (2.2), hence resulting in

O ((n | R^{'} {|)}^{r k})

iterations to run for (2.3) to (2.5). Since (2.3) can be done in

O (r k ℓ)

, (2.4) in

O (n k | R^{'} | ℓ)

, and (2.5) in

O (n)

time, the algorithm completes in

O (k ℓ (r + n | R^{'} |) (n | R^{'} {|)}^{r k})

time.

We argue that

D_{m i n}

eventually is at most

(r + 1) / r

of the optimal solution plus a factor. Suppose the optimal solution results in the

m \leq k

disjoint clusters

A_{1}, A_{2}, \dots, A_{m} \subseteq {f_{1}, \dots, f_{n}}

.

For each

A_{j}

,

1 \leq j \leq m

, let

u_{j}

be a structural fragment which minimizes

\sum_{f \in A_{j}} M S (u_{j}, f)

. Furthermore, for each

f \in A_{j}

, let

R_{f}

be a rotation where

\begin{matrix} R_{f} \in arg min_{R \in R} {∥ u_{j} - R \cdot f ∥}^{2} \end{matrix}

(9)

and let

\begin{matrix} D_{j} = \sum_{f \in A_{j}} ∥ u_{j} - R_{f} \cdot f ∥^{2} (Hence the optimal cost, D = \sum_{j = 1}^{m} D_{j} .) \end{matrix}

(10)

Table 1. Polynomial-time algorithm for the problem.

**Table 1.** Polynomial-time algorithm for the problem.

By the property of the

M S

measure, it can be shown that

u_{j}

is the average of

{R_{f} \cdot f ∣ f \in A_{j}}

. For each

A_{j}

where

| A_{j} | > r

, by Equation 6,

\begin{matrix} \frac{1}{| A_{j} |^{r}} \sum_{F_{j} \in A_{j}^{r}} \sum_{f \in A_{j}} {∥ \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (r)} \cdot F_{j} (r)}{r} - R_{f} \cdot f ∥}^{2} & = & \frac{r + 1}{r} D_{j} \end{matrix}

(11)

For each such

A_{j}

, let

F_{j} \in A_{j}^{r}

be such that

\begin{matrix} \sum_{f \in A_{j}} {∥ \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (r)} \cdot F_{j} (r)}{r} - R_{f} \cdot f ∥}^{2} & \leq & \frac{r + 1}{r} D_{j} \end{matrix}

(12)

Without loss of generality assume that each

F_{j} \in A_{j}^{r!}

. Let

\begin{matrix} μ_{j} = \{\begin{matrix} \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (r)} \cdot F_{j} (r)}{r} & if | A_{j} | > r \\ \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (| A_{j} |)} \cdot F_{j} (| A_{j} |)}{| A_{j} |} & otherwise \end{matrix} \end{matrix}

(13)

Then we may write,

\begin{matrix} \sum_{j = 1}^{m} \sum_{f \in A_{j}} {∥ μ_{j} - R_{f} \cdot f ∥}^{2} & \leq & \frac{r + 1}{r} D \end{matrix}

(14)

For each rotation

R_{f}

, let

R_{f}

be a closest rotation to

R_{f}

within

R^{'}

. Also, let

\begin{matrix} μ_{j} = \{\begin{matrix} \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (r)} \cdot F_{j} (r)}{r} & if | A_{j} | > r \\ \frac{R_{F_{j} (1)} \cdot F_{j} (1) + \dots + R_{F_{j} (| A_{j} |)} \cdot F_{j} (| A_{j} |)}{| A_{j} |} & otherwise \end{matrix} \end{matrix}

(15)

Since we exhaustively sample all possible

F_{j} \in A_{j}^{r!}

for all possible

A_{j}

and for all

R \in R^{'}

, it is clear that:

\begin{matrix} D_{m i n} & \leq & \sum_{j = 1}^{m} \sum_{f \in A_{j}} {∥ μ_{j} - R_{f} \cdot f ∥}^{2} \end{matrix}

(16)

We will now relate the LHS of Equation 14 with the RHS of Equation 16. The RHS of Equation 16 is

\begin{matrix} \sum_{j = 1}^{m} \sum_{f \in A_{j}} {∥ μ_{j} - R_{f} \cdot f ∥}^{2} \end{matrix}

\begin{matrix} = & \sum_{j = 1}^{m} \sum_{f \in A_{j}} {∥ μ_{j} + (μ_{j} - μ_{j}) + (R_{f} \cdot f - R_{f} \cdot f) - R_{f} \cdot f ∥}^{2} \end{matrix}

\begin{matrix} \leq & \sum_{j = 1}^{m} \sum_{f \in A_{j}} {(∥ μ_{j} - R_{f} \cdot f ∥ + (∥ μ_{j} - μ_{j} ∥ + ∥ R_{f} \cdot f - R_{f} \cdot f ∥))}^{2} \end{matrix}

\begin{matrix} = & \sum_{j = 1}^{m} \sum_{f \in A_{j}} ∥ μ_{j} - R_{f} \cdot f ∥^{2} + {(∥ μ_{j} - μ_{j} ∥ + ∥ R_{f} \cdot f - R_{f} \cdot f ∥)}^{2} \end{matrix}

\begin{matrix} + 2 ∥ μ_{j} - R_{f} \cdot f ∥ (∥ μ_{j} - μ_{j} ∥ + ∥ R_{f} \cdot f - R_{f} \cdot f ∥) \end{matrix}

\begin{matrix} \leq & \sum_{j = 1}^{m} \sum_{f \in A_{j}} ∥ μ_{j} - R_{f} \cdot f ∥^{2} + 8 n ℓ b \end{matrix}

(17)

Hence by Equation 14,

D_{m i n}

is at most

(r + 1) / r = 1 + 1 / r

of the optimal solution plus a factor

c = 8 n ℓ b

. Let

ϵ = 1 / r

,

Theorem 2 For any

c, ϵ \in R

, a

((1 + ϵ) D_{o p t} + c)

-approximation solution for the k-consensus structural fragments problem can be computed in

O (k ℓ (\frac{1}{ϵ} + n | R^{'} |) (n | R^{'} {|)}^{\frac{k}{ϵ}})

time.

The factor c in Theorem 2 is due to error introduced by the use of discretization in rotations. If we are able to estimate a lower bound of

D_{o p t}

, we can scale this error by refining the discretization such that c is an arbitrarily small factor of

D_{o p t}

. To do so, in the next section we show a lower bound to

D_{o p t}

.

3.3. A Polynomial-time 4-approximation Algorithm

We now show a 4-approximation algorithm for the k-consensus structural fragments problem. We first show the case for

k = 1

, and then generalizes the result to all

k \geq 2

.

Let the input n structural fragments be

f_{1}

,

f_{2}

, …,

f_{n}

. Let

f_{a}

,

1 \leq a \leq n

be the structural fragment where

\sum_{1 \leq j \leq n \land j \neq a} M S (f_{a}, f_{j})

is minimized. Note that

f_{a}

can be found in time

O (n^{2} ℓ)

, since for any

1 \leq i, j \leq n

,

M S (f_{i}, f_{j})

(more precisely,

R M S (f_{i}, f_{j})

) can be computed in time

O (ℓ)

using closed form equations from [3].

We argue that

f_{a}

is a 4-approximation. Let the optimal structural fragment be

f_{o p t}

, the corresponding distance

D_{o p t}

, and let

f_{b}

(

1 \leq b \leq n

) be the fragment where

M S (f_{b}, f_{o p t})

is minimized.

We first note that the cost of using

f_{a}

as solution,

\sum_{i \neq a} M S (f_{a}, f_{i}) \leq \sum_{i \neq b} M S (f_{b}, f_{i})

. To continue we first establish the following claim.

Claim 1

M S (f, f^{'}) \leq 2 (M S (f, f^{''}) + M S (f^{''}, f^{'}))

.

PROOF. In [6], it is shown that

\begin{matrix} R M S (f, f^{'}) \leq R M S (f, f^{''}) + R M S (f^{''}, f^{'}) \end{matrix}

(18)

Squaring both sides gives

\begin{matrix} M S (f, f^{'}) \leq M S (f, f^{''}) + M S (f^{''}, f^{'}) + 2 R M S (f, f^{''}) R M S (f^{''}, f^{'}) \end{matrix}

(19)

Since

\begin{matrix} 2 R M S (f, f^{''}) R M S (f^{''}, f^{'}) \leq M S (f, f^{''}) + M S (f^{''}, f^{'}) \end{matrix}

(20)

we have

M S (f, f^{'}) \leq 2 (M S (f, f^{''}) + M S (f^{''}, f^{'}))

. ▮

By the above claim,

\begin{matrix} \sum_{i \neq b} M S (f_{b}, f_{i}) & \leq & 2 \sum_{i \neq b} (M S (f_{b}, f_{o p t}) + M S (f_{o p t}, f_{i})) \end{matrix}

(21)

\begin{matrix} = & 2 \sum_{i \neq b} M S (f_{b}, f_{o p t}) + 2 \sum_{i \neq b} M S (f_{i}, f_{o p t}) \end{matrix}

(22)

\begin{matrix} \leq & 2 \sum_{i \neq b} M S (f_{b}, f_{o p t}) + 2 D_{o p t} \end{matrix}

(23)

\begin{matrix} \leq & 2 \sum_{j \neq b} M S (f_{j}, f_{o p t}) + 2 D_{o p t} \end{matrix}

(24)

\begin{matrix} \leq & 2 D_{o p t} + 2 D_{o p t} = 4 D_{o p t} \end{matrix}

(25)

Hence

\sum_{i \neq a} M S (f_{a}, f_{i}) \leq 4 D_{o p t}

. We now extend this to k structural fragments. Algorithms 01 00043 i002

We first pre-compute

M S (f, f^{'})

for every pair of

f, f^{'} \in S

, which takes time

O (n^{2} ℓ)

. Then, at step (1), there are at most

O (n^{k})

combinations of A, each which takes

O (n k)

time to compute at step (2). Hence in total we can perform the computation in

O (n^{2} ℓ + k n^{k + 1})

time. To see that the solution is a 4-approximation, let

S_{1}, S_{2}, \dots, S_{m}

where

m \leq k

be an optimal clustering. Then, by our earlier argument, there exists

f_{i_{1}} \in S_{1}

,

f_{i_{2}} \in S_{2}

, …,

f_{i_{m}} \in S_{m}

such that each

f_{i_{x}}

is a 4-approximation for

S_{x}

, and hence

f_{i_{1}}, f_{i_{2}}, \dots, f_{i_{m}}

is a 4-approximation for the k-consensus structural fragments problem. Since the algorithm exhaustively search for every combination of up to k fragments, it gives a solution at least as good as

f_{i_{1}}, f_{i_{2}} \dots, f_{i_{m}}

, and hence is a 4-approximation algorithm.

Theorem 3 A 4-approximation solution for the k-consensus structural fragments problem can be computed in

O (n^{2} ℓ + k n^{k + 1})

time.

3.4. A $(1 + ϵ)$ Polynomial-time Approximation Scheme

Recall that the algorithm in Section 3.2 has cost

D \leq (1 + ϵ) D_{o p t} + 8 n ℓ b

where

b = α ϑ d

. From Section 3.3 we have a lower bound D

_{o p t}

of

D_{o p t}

. We want

8 n ℓ b \leq ϵ D_{o p t} \leq ϵ D_{o p t}

. To do so, it suffices that we set

ϑ \leq ϵ D_{o p t} / (8 n ℓ α d)

. This results in an

| R^{'} |

of order

O (1 / ϑ^{3}) = O ({(n ℓ d)}^{3})

. Substituting this in Theorem 2, and combining with Theorem 3, we get the following.

Theorem 4 For any

ϵ \in R

, a

((1 + ϵ) D_{o p t})

-approximation solution for the k-consensus structural fragments problem can be computed in

O (n^{2} ℓ + k n^{k + 1} + k ℓ (\frac{2}{ϵ} + n λ) {(n λ)}^{\frac{2 k}{ϵ}})

time, where

λ = {(n ℓ d)}^{3}

.

4. Discussions

The method in this paper depends on Lemma 1. For this reason, the technique does not extend to the problem under distance measures where Lemma 1 cannot be applied, for example, the

R M S

measure. However, should Lemma 1 apply to a distance measure, it should be easy to adapt the method here to solve the problem for that distance measure.

One can also formulate variations of the k-consensus structural fragments problem. For example, Algorithms 01 00043 i003

While the cost function of the k-consensus structural fragments problem resembles that of the k-means problem, the cost function of the k-closest structural fragments resembles that of the (absolute) k-center problem. One interesting problem for future study is whether this problem has a PTAS or not. It is not clear how to generalize the technique employed in this paper to k-closest structural fragments problem under

M S

.

References

Jain, A. K.; Murty, M. N.; Flynn, P. J. Data clustering: a review. ACM Computing Surveys 1999, 31(3), 264–323. [Google Scholar] [CrossRef]
Arun, K. S.; Huang, T. S.; Blostein, S. D. Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9(5), 698–700. [Google Scholar] [CrossRef] [PubMed]
Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13(4), 376–380. [Google Scholar] [CrossRef]
Qian, J.; Li, S. C.; Bu, D.; Li, M.; Xu, J. Finding compact structural motifs. In Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007, Proceedings; Ma, B., Zhang, K.Z., Eds.; Springer, 2007; Vol. 4580 of Lecture Notes in Computer Science, pp. 142–149. [Google Scholar]
Hao, M.; Rackovsky, S.; Liwo, A.; Pincus, M.; Scheraga, H. Effects of compact volume and chain stiffness on the conformations of native proteins. Proc. Natl. Acad. Sci. 1992, 89, 6614–6618. [Google Scholar] [CrossRef] [PubMed]
Boris, S. A revised proof of the metric properties of optimally superimposed vector sets. Acta Crystallographica Section A 2002, 58(5), 506. [Google Scholar]

© 2008 by the authors; licensee MDPI, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Li, S.C.; Ng, Y.K.; Zhang, L. A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance. Algorithms 2008, 1, 43-51. https://doi.org/10.3390/a1020043

AMA Style

Li SC, Ng YK, Zhang L. A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance. Algorithms. 2008; 1(2):43-51. https://doi.org/10.3390/a1020043

Chicago/Turabian Style

Li, Shuai Cheng, Yen Kaow Ng, and Louxin Zhang. 2008. "A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance" Algorithms 1, no. 2: 43-51. https://doi.org/10.3390/a1020043

Article Menu

A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance

Abstract

1. Introduction

2. Preliminaries

3. PTAS for the k-Consensus Structural Fragments

3.1. Discretized Rotation Space

3.2. A Polynomial-time Algorithm With Cost $((1 + ϵ) D_{o p t} + c)$

3.3. A Polynomial-time 4-approximation Algorithm

3.4. A $(1 + ϵ)$ Polynomial-time Approximation Scheme

4. Discussions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A PTAS For The k-Consensus Structures Problem Under Squared Euclidean Distance

Abstract

1. Introduction

2. Preliminaries

3. PTAS for the k-Consensus Structural Fragments

3.1. Discretized Rotation Space

3.2. A Polynomial-time Algorithm With Cost ( ( 1 + ϵ ) D o p t + c )

3.3. A Polynomial-time 4-approximation Algorithm

3.4. A ( 1 + ϵ ) Polynomial-time Approximation Scheme

4. Discussions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. A Polynomial-time Algorithm With Cost $((1 + ϵ) D_{o p t} + c)$

3.4. A $(1 + ϵ)$ Polynomial-time Approximation Scheme