1. Introduction
Let
be a given matrix in which the number of rows,
m, is considerably larger than the number of columns,
n. Let the rows of
A be denoted as
, where
. That is,
. Let
be a submatrix of
A, which is composed from the first
i rows of
A. Let
denote the largest singular value of
, let
denote the smallest singular value of
, and let
denote the condition number of this matrix. In this paper, we investigate the behavior of the sequences
,
and
. We start by showing that adding rows causes the largest singular value to increase,
and study the reasons for the large, or small, increase. Next, we consider the behavior of the smallest singular values, which is somewhat surprising: at first, adding rows causes decreasing,
Then, as
i passes
n, adding rows increases the smallest singular value. That is,
This behavior is called “
the smallest singular value anomaly”. The study of this phenomenon explains the reasons for large, or small, difference between
and
.
The last observation implies that
is the smallest number in the sequence
. Assume for simplicity that
. In this case,
for
, and the ratio
is the
condition number of
. This number affects the results of certain computations, such as the solution of linear equations, e.g., [
1,
2,
3]. It is interesting, therefore, to examine the behavior of the sequence
. The inequalities (
2) and (
3) show that as
i moves from 1 to
n, the value of
increases. That is,
However, as
i passes
n, both
and
are increasing, and the behavior of
is not straightforward. The fact that the sequence
is increasing tempts one to expect that the sequence
will decrease. That is,
The situation in which (
6) holds is called
the condition number anomaly. In this case, the sequence
increases toward
, which can be quite large, while the sequence
decreases toward a value of
, which is considerably smaller than
. The inequalities that assess the increase in the sequences (
2) and (
4) enable us to derive a useful bound on the ratio
. The bound explains the reasons behind the condition number anomaly, and characterizes situations that invite (or exclude) such behavior.
One type of matrices that exhibits the condition number anomaly is that of dense random matrices in which each element of the matrix is independently sampled from the same probability distribution. In particular, if each element of
A comes from an independent standard normal distribution, then
A is a Gaussian random matrix, and
is a Wishart matrix. The problem of estimating the largest and the smallest singular values of large Gaussian matrices has been studied by several authors. See [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] and the references therein. In this case, when
n is very large and
, we have the estimates
and
which means that very large Gaussian matrices possess the condition number anomaly (for very large
n and
we have
; see [
7,
12]).
Our analysis shows that the condition number anomaly is not restricted to large Gaussian matrices. It is shared by a wide range of matrices, from small random matrices to large sparse matrices. The bounds that we derive have a simple geometric interpretation that helps to see what makes large and what forces the sequence to decrease. Roughly speaking, the condition number anomaly is expected whenever all the rows of the matrix have about the same size and the directions of the rows are randomly scattered. The paper brings several numerical examples that illustrate this feature.
The practical interest in the condition number anomaly comes from the use of iterative methods for solving large sparse linear systems, e.g., [
15,
16,
17,
18]. Some of these methods have the property that the asymptotic rate of convergence depends on the condition number of the related matrix. That is, a large condition number results in slow convergence, while a small condition number yields fast convergence. Assume now that such a method is used to solve a linear system whose matrix has the condition number anomaly. Then the last property implies a similar anomaly in the number of iterations. This phenomenon is called “
iterations anomaly”. The discussion in
Section 5 demonstrates this property in the methods of Richardson, Cimmino, and Jacobi. See Table 12.
2. The Ascending Behavior of the Largest Singular Values
In this section, we investigate the behavior of the sequence . The first assertion establishes the ascending property of this sequence.
Theorem 1. The sequence satisfies Proof. Observe that
is the largest eigenvalue of the matrix
, which is a principal submatrix of
. Hence, (
10) is a direct consequence of the Cauchy interlace theorem. For statements and proofs of this theorem, see, for example, Refs. [
2] (p. 441), [
19] (p. 185), [
20] (p. 149), [
21] and [
22] (p. 186). A second way to prove (
10) is given below. This approach enables a closer inspection of the ascending process.
Here, we use the fact that
is the largest eigenvalue of the cross-product matrix
. Let the unit vector
denote the corresponding dominant eigenvector of
. Then
and
where
denotes the Euclidean vector norm. Note also that
and
Consequently,
which proves (
10). □
Next, we provide an upper bound on the increase in .
Theorem 2. The inequalityholds for . Proof. Observe that
Hence a further use of (
14) gives
which proves (
17). The last inequality in (
19) is due to the Cauchy–Schwartz inequality
and the fact that
. □
Combining (
10) and (
17) shows that
This raises the question of for which directions of
the value of
attains its bounds. The key to answering this question lies in the following observation.
Lemma 1. Assume for a moment that is an eigenvector of the matrix . That is,where is a nonnegative scalar. In this case, the matrix has the same set of eigenvectors as . The eigenvector satisfieswhile all the other eigenpairs remain unchanged. Proof. Since
is an eigenvector of
, substituting the spectral decomposition of
into (
14) yields the spectral decomposition of
. □
The possibility that achieves its upper bound is characterized by the following assertion.
Theorem 3. Assume for a moment that is a dominant eigenvector of . In this caseOtherwise, when is not pointing toward a dominant eigenvector, Proof. The first claim is a direct consequence of Lemma 1. To prove the second claim, we consider two cases. The first one occurs when
. Since
is not at the direction of
, in this case, there is a strict inequality in (
20), which yields a strict inequality in (
19). In the second case,
, so now we have a strict inequality in (
18), which leads to a strict inequality in (
19). □
Finally, we consider the possibility that .
Theorem 4. Assume that and that is an eigenvector of , which corresponds to the smallest eigenvalue of this matrix. That is,Ifthen . Otherwise, whenthe value of satisfies Proof. From Lemma 1, we obtain that
is an eigenvector of
whose eigenvalue equals
. Therefore, if (
27) holds, then
remains the largest eigenvalue. Otherwise, when (
28) holds,
is the largest eigenvalue of
. □
The restriction is due to the fact that if , then the smallest eigenvalue of is always zero. The extension of Theorem 5 to cover this case is achieved by setting zero instead of . Similar results are obtained when points to other eigenvectors of .
3. The Smallest Singular Value Anomaly
In this section we explore the behavior of the smallest singular values. We shall start by proving that the sequence is descending. The proof uses the fact that for , the smallest eigenvalue of is .
Theorem 5. For , we have the inequality Proof. The matrix
is a principal submatrix of
. Hence, (
30) is a direct corollary of the Cauchy interlace theorem. □
Next, we show that the sequence , is ascending.
Theorem 6. For , we have the inequality Proof. One way to conclude (
31) is by using the fact that
is a principal submatrix of
. Let
and
denote the eigenvalues of these matrices. Then, since
,
, and (
31) is a direct consequence of Cauchy interlace theorem.
As before, a second proof is obtained by comparing the matrices
and
, and this approach provides us with useful inequalities. Let the unit vector
, denote an eigenvector of
that corresponds to
. Then
and
has the minimum property
The last property implies the inequality
while a further use of (
14) gives
□
The inequality (
35) implies that the growing of
depends on the size of the scalar product
. Basically, it is difficult to estimate this product, but Lemma 1 and Theorem 5 give some insight. For example, if
is an eigenvector of
whose eigenvalue differs from
, then
. If
is an eigenvector that corresponds to
, there are two possibilities to consider. If
is a multiple eigenvalue, then, again,
. Otherwise, when
is a simple eigenvalue,
where
is the difference between the two smallest eigenvalues of
.
We have seen that the sequence
is descending, while the sequence
is ascending. This behavior is called
the smallest singular value anomaly. The fact that
is the smallest singular value in the whole sequence raises the question of what makes
small. Clearly,
is always smaller than
Thus, to obtain a meaningful answer, we make the simplifying assumption
which enables the following bounds.
Lemma 2. Assume that (38) holds and defineThen,and Proof. It is possible to assume that the above maximum is attained for the first two rows and that
. In this case,
and the eigenvalues of this matrix are
and
. Therefore, since
is a principal submatrix of
, the Cauchy interlace theorem implies (
40) and (
41). □
Usually, the bound (
41) is a crude estimate of
. Yet, in some cases, it is the reason for a small value of
.
4. The Condition Number Anomaly
In this section we investigate the behavior of the sequence
. The discussion is carried out under the assumption that
, which ensures that
for
. We have seen that the sequence
is ascending while the sequence
is descending. This proves that the sequence
is ascending. That is,
It is also known that the sequences
and
are ascending, but this does not provide decisive information about the behavior of the sequence
. We shall start with examples that illustrate this point.
Example 1. This example shows that can be larger than . For this purpose, consider the case when is a dominant eigenvector of . Then from Lemma 1 we see that but , which means that .
Example 2. A similar situation arises when A has the following property. Assume that as i grows, the sequence of rows directions , converges rapidly toward some vector. In this case, the sequence converges to the same vector, which brings us close to the situation of Example 1 (Tables 3 and 10 illustrate this possibility).
Example 3. The third example shows that can be smaller than . Consider the case described in Theorem 4, when (27) holds. Here , and . More reasons that force decrease are given in Corollary 1 below. Example 4. The fourth example describes a situation in which the condition number behaves in a cyclic manner. Let be a given matrix with . Let the matrix A be obtained by duplicating B k times. That is, andThen when i takes the values , the matrix has the formHence, for these values of i we have , and , but . The situation in which the sequence
is descending,
is called
the condition number anomaly. The reasons behind this behavior are explained below.
Theorem 7. Let the positive parameters and be defined by the equalitiesandThen, for , Proof. From (
19), we see that
Similarly from (
35), we obtain
Hence, combining these inequalities gives (
47). □
□
The last corollary is a key observation that indicates at which situations the condition number anomaly is likely to occur. Assume for a moment that the direction of
is chosen in some random way. Then, the scalar product terms
and
are likely to be about the same size. However, since
is (considerably) smaller than
, the term
is expected to be larger than
, which implies (
50).
Summarizing the above discussion, we see that the condition number anomaly is likely to occur whenever the rows of the matrix satisfy two conditions: all the rows have about the same size, and the directions of the rows are scattered in some random way. This conclusion means that the phenomenon is shared by a wide range of matrices. The examples in
Section 6 illustrate this point.
5. Iterations Anomaly
Let
and
, be as in the previous sections. Let
be an arbitrary given vector, which is used to define the vectors
. In this section, we examine how the condition number anomaly affects the convergence of certain iterative methods for solving a linear system of the form
We shall start by considering the
Richardson method for solving the normal equations
e.g., [
16,
17,
18]. Given
the
k-th iteration,
of Richardson method has the form
where
is a pre-assigned relaxation parameter. Recall that
is the gradient vector of the least-squares objective function
at the point
. Hence, iteration (
54) can be viewed as a steepest descent method for minimizing
that uses a fixed step length. An equivalent way to write (
54) is
which shows that the rate of convergence of the method depends on the spectral radius of the iteration matrix
Let
denote the spectral radius of
. Then the theory of iterative methods tells us that the method converges whenever
and the smaller
is, the faster the convergence; see, for example, Refs. [
16,
17,
18]. Observe that the eigenvalues of
lie in the interval
. This shows that (
58) holds for values of
w that satisfy
Furthermore, let
denote the optimal value of
w, for which
attains its smallest value. Then
and
See [
17] (pp. 22–23) and [
18] (pp. 114–115) for detailed discussion of these results. Consequently, as
increases, the spectral radius of the iteration matrix approaches 1, and the rate of convergence slows down. That is, the condition number anomaly results in a similar anomaly in the number of iterations. See Table 12.
Another useful iterative method for solving large sparse linear systems is the
Cimmino method, e.g., [
15,
16,
18,
23]. Let the unit vectors
be obtained by normalizing the rows of
A. Let
be an
matrix whose rows are
, and let
denote the
diagonal matrix
Then
for
. Similarly, we define
for
, and
. Then Cimmino method is aimed at solving the linear system
or the related normal equations
The
kth iteration of the Cimmino method,
has the form
where
is a pre-assigned relaxation parameter, and
are weighting parameters that satisfy
Observe that the point
is the projection of
on the hyperplane
, and the point
is a weighted average of these projections. The usual way to apply the Cimmino method is with equal weights. That is,
for
. This enables us to rewrite the Cimmino iteration in the form
which is the Richardson iteration for solving the normal equations (
66). Therefore, from (
61), we conclude that the optimal rate of convergence of the Cimmino method depends on the ratio
where
is the condition number of
.
Another example is the
Jacobi iteration for solving the equations
The basic iteration of this method has the form
where
is the diagonal matrix (
63) and
is a pre-assigned relaxation parameter. Now the equalities
indicate that the iteration matrix of the Jacobi method is similar to the matrix
. Hence, as before, the optimal rate of convergence depends on the ratio
. Thus, again, a condition number anomaly invites a similar anomaly in the number of iterations.
We shall finish this section by mentioning two further methods that share this behavior. The first one is the conjugate gradients algorithm for solving the normal Equations (
53), whose rate of convergence slows down as the condition numbers of
A increases. See, for example, Refs. [
1] (pp. 312–314), [
3] (pp. 299–300) and [
18] (pp. 203–205). The second is Kaczmarz’s method, which is a popular “row-action” method; see Refs. [
15,
16,
23,
24,
25]. The use of this method to solve
is equivalent to the SOR method for solving the system
, and both methods have the property that a small condition number results in fast convergence while a large condition number slows it [
24,
25].
6. Numerical Examples
In this section, we bring some examples that illustrate the actual behavior of the anomaly phenomena. The first examples consider small matrices.
Table 1 describes the anomaly in a “two-ones” matrix. This matrix has
different rows. Each row has only two nonzero entries, and each nonzero entry has the value 1 (a matrix with
n columns has at most
different rows of this type). This matrix exhibits a moderate anomaly, due to the fact that
is well conditioned.
Table 2 describes the anomaly in a small
segment of the
Hilbert matrix. Here, the
entry equals
. Consequently, the sequence of rows directions
,
, converges slowly toward the vector
, where
. Hence, the decrease in the sequence
is quite moderate.
In
Table 3, we consider a small
segment of the
Pascal matrix. Recall that the entries of this matrix are built in the following way:
for
, and
for
. The other entries are obtained from the rule
In this matrix, the norm of the rows grows very fast while the sequence of row directions
converges rapidly toward the vector
. Thus, as
i becomes considerably larger than
n, both
and
approach
, which causes
to be larger than
.
The random matrices that are tested in
Table 4 and
Table 5 provide nice examples of the anomaly phenomenon. In these matrices, each entry is a random number from interval
. To generate these matrices, and the other random matrices, we used MATLAB’s command “rand”, whose random numbers generator is of uniform distribution. Similar results are obtained when “rand” is replaced with “randn”, which uses normal distribution.
The nonnegative random matrix that is tested in
Table 6 is obtained by MATLAB’s command
. That is, here, each entry of
A is a random number from the interval
. This yields a more ill-conditioned matrix and a sharper anomaly.
Table 7 and
Table 8 consider a different type of random matrices. As its name says, the entries of the “
or 1” matrix are either
or 1, with equal probability. In practice, the
entry,
, is defined in the following way. First, sample a random number,
r, from the interval
. If
, then
; otherwise
. The entries of the “0 or 1” matrix are defined in a similar manner: if
then
; otherwise
. Both matrices display a strong anomaly. The “0 or 1” matrix is slightly more ill conditioned and, therefore, has a sharper anomaly.
The results of
Table 9 and
Table 10 are quite instructive. Both matrices are highly ill conditioned but display a different behavior. The
“narrow range” matrix is a random matrix whose entries are sampled from the small interval [0.99, 1.01]. However, the directions of the rows are not converging, and the matrix displays a nice anomaly. The
“converging rows” matrix is defined in a slightly different way. Here, the entries in the
ith row,
, are random numbers from the interval
. Hence, the related sequence of rows directions,
, converges toward the vector
, which is the situation described in Example 2. Consequently, when
i becomes much larger than
n, we see a moderate increase in the value of
.
Other matrices that possess the anomaly phenomena are
large sparse matrices. The matrix in
Table 11 is created by using MATLAB’s command
sprand
, density) with
100,000,
10,000 and density
. This way, each row of
A has nearly 100 nonzero entries that have random values and random locations. Although not illustrated in this paper, our experience shows that the smaller the density, the sharper the anomaly.
Table 12 illustrates the iterations anomaly phenomenon when using the methods of Richardson, Cimmino, and Jacobi. The first two methods were used to solve linear systems of the form
As before, each
is an
submatrix that is composed from the first
i rows of a given
matrix
A. The construction of
A is done in two steps. First, we generate a random matrix as in
Table 4 and
Table 5. Then the rows of the matrix are normalized to be unit vectors. The vector
is defined by the product
which ensures that
solves the linear system. Since
has unit rows, Cimmino iteration (
71) coincides with Richardson iteration (
54). The value of
w that we use is the optimal one,
and the iterations start from the point
. The iterative process is terminated as soon as the residual vector satisfies
The number of iterations which are required to satisfy this condition is displayed in the last column of
Table 12.
The Jacobi method was used to solve the linear systems (
73), where
and
are defined as above. Since
has unit rows,
is a unit matrix, and Jacobi iteration (
74) is reduced to
The last iteration uses the optimal value of
w, given in (
79). It starts from the point
and terminates as soon as the residual vector satisfies
The number of required iterations is nearly identical to that of the Richardson method. This is not surprising since multiplying (
81) by
shows that the sequence
is generated by Richardson iteration (
54) (there were only two minor exceptions: for
, the Jacobi method required 36,959 iterations instead of 36,960, while for
, the Jacobi required 6,772,151 iterations instead of 6,760,589. In all the other cases, the two methods required exactly the same number of iterations).
The figures in
Table 12 demonstrate the close link between the condition number and the rate of convergence. As anticipated from (
61), for large values of
the spectral radius approaches 1 and the rate of convergence slows down. Thus, a large condition number results in a large number of iterations. Conversely, a small value of
implies a small spectral radius and a small number of iterations. In other words, a condition number anomaly invites a similar anomaly in the number of iterations.
Usually, it is reasonable to assume that the computational effort in solving a linear system is proportional to the number of rows. That is, the more rows we have, the more computation time is needed. From this point of view, the iterations anomaly phenomenon is somewhat surprising, as solving a linear system with rows needs considerably less time than solving a linear system with rows.
7. Concluding Remarks
As an old adage says, the whole is sometimes much more than the sum of its parts. The basic ascending (descending) properties of singular values are easily concluded from the Cauchy interlace theorem, while the inequalities that we derive enable us to see what causes a large, or small, increase. Combining these results gives a better overview of the whole situation. One consequence regards the anomalous behavior of the smallest singular values sequence , and the fact that is the smallest number in this sequence. The second observation is about the condition number anomaly. It is easy to conclude the increasing of the condition numbers sequence, , but Cauchy interlace theorem does not tell us how the rest of this sequence behaves. The answer is obtained by considering the expression for the ratio . This expression explains the reasons behind the condition number anomaly and characterizes situations that invite (or exclude) such behavior. We see that the anomaly phenomenon is likely to occur in “random-like” matrices whose rows satisfy two conditions: all the rows have about the same size and the directions of the rows scatter in some random way. This suggests that the condition number anomaly phenomenon is common in several types of matrices, and the numerical examples illustrate this point.
The practical importance of the condition number anomaly lies in the use of iterative methods for solving large linear systems. As we have seen, several iterative solvers have the property that the rate of convergence depends on the condition number. Therefore, when solving “random-like” systems, a fast rate of convergence is expected in under-determined or over-determined systems, while a slower rate is expected in (nearly) square systems.