Brute Force Computations and Reference Solutions

Konstantinov, Mihail Mihaylov; Petkov, Petko Hristov; Madamlieva, Ekaterina Borisova

doi:10.3390/foundations5010007

Open AccessArticle

Brute Force Computations and Reference Solutions

by

Mihail Mihaylov Konstantinov

¹

,

Petko Hristov Petkov

^2,*

and

Ekaterina Borisova Madamlieva

³

¹

Department of Mathematics, Faculty of Transport Engineering, University of Architecture, Civil Engineering and Geodesy, 1064 Sofia, Bulgaria

²

Department of Technical Sciences, Bulgarian Academy of Sciences, 1040 Sofia, Bulgaria

³

Department of Mathematical Analysis and Differential Equations, Faculty of Applied Mathematics and Informatics, Technical University of Sofia, 1000 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Foundations 2025, 5(1), 7; https://doi.org/10.3390/foundations5010007

Submission received: 5 December 2024 / Revised: 8 February 2025 / Accepted: 17 February 2025 / Published: 26 February 2025

(This article belongs to the Section Mathematical Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we consider the application of brute force computational techniques (BFCTs) for solving computational problems in mathematical analysis and matrix algebra in a floating-point computing environment. These techniques include, among others, simple matrix computations and the analysis of graphs of functions. Since BFCTs are based on matrix calculations, the program system MATLAB^® is suitable for their computer realization. The computations in this paper are completed in double precision floating-point arithmetic, obeying the 2019 IEEE Standard for binary floating-point calculations. One of the aims of this paper is to analyze cases where popular algorithms and software fail to produce correct answers, failing to alert the user. In real-time control applications, this may have catastrophic consequences with heavy material damage and human casualties. It is known, or suspected, that a number of man-made catastrophes such as the Dharhan accident (1991), Ariane 5 launch failure (1996), Boeing 737 Max tragedies (2018, 2019) and others are due to errors in the computer software and hardware. Another application of BFCTs is finding good initial guesses for known computational algorithms. Sometimes, simple and relatively fast BFCTs are useful tools in solving computational problems correctly and in real time. Among particular problems considered are the genuine addition of machine numbers, numerically stable computations, finding minimums of arrays, the minimization of functions, solving finite equations, integration and differentiation, computing condensed and canonical forms of matrices and clarifying the concepts of the least squares method in the light of the conflict remainders vs. errors. Usually, BFCTs are applied under the user’s supervision, which is not possible in the automatic implementation of computational methods. To implement BFCTs automatically is a challenging problem in the area of artificial intelligence and of mathematical artificial intelligence in particular. BFCTs allow to reveal the underlying arithmetic in the performance of computational algorithms. Last but not least, this paper has tutorial value, as computational algorithms and mathematical software are often taught without considering the properties of computational algorithms and machine arithmetic.

Keywords:

computational algorithms; brute force computations; reference solutions; machine arithmetic; failure of computer codes

1. Introduction and Notation

1.1. Preliminaries

In the last 300 years, sophisticated computational algorithms have been developed to solve a wide variety of problems in the mathematical and engineering sciences [1,2,3,4,5]. With the advent of digital computers in the mid-20th century, it became evident that some historical algorithms are unsuitable for direct computer implementation without significant modifications. Meanwhile, other algorithms, despite their cleverness, became obsolete. Consequently, paradigms in scientific computation have evolved, resulting in the creation and deployment of novel computational algorithms as computer codes.

When implemented as computer codes in floating-point computer environment, such as the one realized in the program systems MATLAB^® [6,7] and Octave [8], some of these computational algorithms may produce wrong results without warning. In such cases, BFCTs may give acceptable results. BFCTs may also reveal flaws in symbolic codes such as sym(x,′e′) from [6].

Other computer systems, such as Maple [9] and Mathematica [10,11], use variable precision arithmetic to potentially avoid these pitfalls. However, this may lead to an increase in the time required for algorithm performance. In such cases, BFCTs and a detailed analysis of computational problems with reference solution may be useful.

To evaluate and improve BFCTs’ scalability, optimization or parallelization strategies can be employed on larger datasets and higher-dimensional problems. Techniques such as parallel computing, CPU acceleration [12,13], and cloud computing [14] are particularly advantageous. This seems to be a new and challenging research area.

BFCTs may also be used to obtain initial guesses for computational algorithms intended to solve finite equations and minimize functions of one or several scalar arguments. We stress finally that Monte Carlo algorithms [15] are also a form of BFCT. As a perspective, a combination of artificial intelligence [16,17] with BFCTs is yet to be developed.

The computations in this paper are performed using double precision binary floating-point arithmetic obeying the IEEE Standard [18]. Some of the matrix notations used are inspired by the language of MATLAB^®. Some of the codes for numerical experiments are given so the results can be independently checked. We finally stress that sometimes, BFCTs reveal the underlying machine arithmetic of the used computing environment.

1.2. General Notations

Next, we use the following general notations.

$Z$ —set of integers;
$Z [m, n]$ —set of integers $m, m + 1, \dots, n$ , where $m \leq n$ ;
$Z_{m} \subset Z$ —set of integers $\geq m$ ;
$R (C)$ —set of real (complex) numbers;
$i \in C$ —imaginary unit, where $i^{2} = - 1$ ;
$sum (a_{k}, n)$ —sum of the scalar quantities $a_{1}, a_{2}, \dots, a_{n}$ ;
$int (f, a, b)$ —definite integral of the function $f : R \to R$ on the interval $[a, b]$ ;
≺—lexicographical order on the set of pairs $(a, b) \in R^{2}$ defined as $(a_{1}, b_{1}) ≺ (a_{2}, b_{2})$ if $a_{1} < a_{2}$ , or $a_{1} = a_{2}$ and $b_{1} < b_{2}$ ; we write $(a_{1}, b_{1}) ⪯ (a_{2}, b_{2})$ if $(a_{1}, b_{1}) ≺ (a_{2}, b_{2})$ , or $(a_{1}, b_{1}) = (a_{2}, b_{2})$ ;
$K (m, n)$ —set of $m \times n$ matrices A with elements $A (p, q) \in K$ , where $K$ is one of the sets $Z$ , $R$ or $C$ ;
$K (m) = K (m, m)$ ;
$A (p, :) \in K (1, n)$ —pth row of $A \in K (m, n)$ ;
$A (:, q) \in K (m, 1)$ —qth column of $A \in K (m, n)$ ;
$∥ \cdot ∥$ —spectral norm in $K (m, n)$ ;
$Ones (m, n) \in R (m, n)$ —matrix with unit elements;
$rank (A) \in Z_{0}$ —rank of the matrix $A \in K (m, n)$ ;
$tr (A) \in K$ —trace of the matrix $A \in K (m, n)$ ;
$det (A) \in K$ —determinant of the matrix $A \in K (m)$ ;
$I_{n} \in R (n)$ —identity matrix with $I_{n} (p, p) = 1$ and $I_{n} (p, q) = 0$ for $p \neq q$ ;
$E_{i, j}$ —matrix with unique nonzero element, equal to 1, in position $(i, j)$ ;
$N_{n} \in R (n)$ —nilpotent matrix with $N_{n} (p, p + 1) = 1$ and $N_{n} (p, q) = 0$ for $q \neq p + 1$ ;
$λ_{k} = λ_{k} (A)$ —eigenvalues of the matrix $A \in K (n)$ , where $k = 1, 2, \dots, n$ ; we usually assume that $λ_{1} ⪯ λ_{2} ⪯ \dots ⪯ λ_{n}$ ;
$σ_{k} = σ_{k} (A)$ —singular values of the matrix $A \in K (m, n)$ , where $k = 1, 2, \dots, p$ , $p = \min {m, n}$ ; the singular values are ordered as $σ_{1} \geq σ_{2} \geq \dots \geq σ_{p} \geq 0$ ;
$∥ A ∥ = σ_{1}$ —spectral norm of the matrix $A \in K (m, n)$ ;
$cond (A) = ∥ A ∥ ∥ A^{- 1} ∥ = σ_{1} / σ_{n} \geq 1$ —condition number of the non-singular matrix $A \in K (n)$ .

1.3. Machine Arithmetic Notations

We use the following notations from binary floating-point arithmetic, where codes and results in the MATLAB^® environment are written in typewriter font.

$M_{+}$ $(M_{-})$ —set of finite positive (negative) machine numbers;
$M = M_{-} \cup {0} \cup M_{+}$ —set of finite machine numbers;
$fl (x) \in M$ —rounded value of $x \in R$ ; $fl (x) \in M$ is computed as $m / 2^{n}$ by the command sym(x,’f’), where $m \in Z$ , $n \in Z_{0}$ and m is odd if $n \in Z_{1}$ ;
$M_{+}$ $(M_{-})$ —set of positive (negative) machine numbers; we have $M = M_{-} \cup {0} \cup M_{+}$ ;
$X^{+} > X$ ( $X^{-} < X$ )—machine number right (left) nearest to $X \in M$ ; $X^{+}$ is obtained as X + eps(X);
$eps = 2^{- 52}$ —machine epsilon; we have $1^{+} = 1 + eps$ , $1^{-} = 1 - eps / 2$ ; $eps$ is obtained as eps(1), or simply as eps;
$ρ = eps / 2 = 2^{- 53} ≃ 1.11 \times 10^{- 16}$ —rounding unit;
$R = 2^{53} = 9 007 199 254 740 992 ≃ 9.01 \times 10^{15}$ —maximal integer such that positive integers $\leq R$ are machine numbers; we have $R^{+} = R + 2$ , $R^{-} = R - 1$ ;
$E = 2^{1023}$ —maximal machine integer degree of 2; we have $2 E \notin M$ ;
$X_{\max} = (2 - eps) E ≃ 1.80 \times 10^{308}$ —maximal machine number; it is also denoted as realmax and we have eps(realmax) = 2^971;
$X_{\min} = 2^{- 1022} ≃ 2.23 \times 10^{- 308}$ —minimal positive normal machine number; it is also denoted as realmin;
$R_{\min} = 2 eps / E = 2^{- 1074} ≃ 4.94 \times 10^{- 324}$ —minimal positive subnormal machine number; it is found as eps(0);
$fl (x) \in M$ —rounded value of $x \in R$ ;
$X = [X_{\min}, X_{max}] \subset R$ —normal range; numbers $x \in X$ are rounded with relative error less than $ρ$ ;
Inf (-Inf)—positive (negative) machine infinity;
NaN—result of mathematically undefined operation, representing Not a Number;
$M_{ext} = M \cup {\pm Inf} \cup {NaN}$ —extended machine set.

1.4. Software and Hardware

The computations were conducted with the program system MATLAB^®, Version 9.9 (R2020b), on a Lenovo X1 Carbon 7th Generation Ultrabook, Memory: 16 GB DDR4 2133 Mhz (2 × 8 GB DIMMs), Video: Intel HD 620, Wholesale China Products, Lenovo Laptop Manifacturer.

1.5. Problems with Reference Solution

An effective strategy for employing BFCTs involves solving CPRSs. CPRSs are specially designed computational problems with a priori known, or reference, solutions. It is supposed that the numerical algorithm which is to be checked does not “know” this fact and solves the computational problem in the standard way. Consider, for example, the finite equation

f (x) = 0

, where

f (x)

and x are scalars or vectors. Then, a CPRS is obtained choosing

f (x) = g (x) - g (x_{ref})

, where

x_{ref}

is the reference solution. We usually choose

x_{ref}

to be a vector with unit elements, e.g., x_ref = ones(n,1) and

∥ x_{ref} ∥ = \sqrt{n}

.

Let

x_{ref} \neq 0

be a solution of the equation

f (x) = 0

and

x_{comp}

be an approximate solution computed in FPA. The relative error

∥ x_{ref} - x_{comp} ∥ / ∥ x_{ref} ∥

of the computed solution depends on the rounding of initial data and on errors made during the computational procedure [19,20]. To eliminate the effects of rounding initial data, we may consider CPRSs with relatively small integer data, or more generally, with data consisting of machine numbers.

Example 1.

Computational algorithms and software for solving linear algebraic equations

A x = b

, where the matrix

A \in C (n)

is invertible, may be checked choosing

A \in Z (n)

and

b = A x_{ref}

, where x_ref = ones(n,1). The computed solution is x_comp = A\b and we may compare the relative error

\frac{∥ x_{ref} - x_{comp} ∥}{\sqrt{n}}

with the a priori error estimate eps∗cond(A).

Example 2.

Let

F : [a, b] \to R

be a differentiable highly oscillating function with derivative

F^{'} : [a, b] \to R

. The numerical integration of

f = F^{'}

may be a problem. For such integrands, we may formulate a numerically difficult computational problem

J = int (f, a, b)

with reference solution

J_{ref} = 1

. Systems with mathematical intelligence such as MATLAB^® [6], Maple [9] and Mathematica [10,11] may find the integral with high accuracy by the NLF

J = F (b) - F (a)

, thus avoiding numerical integration.

Example 3.

To check the codes for eigenstructure analysis of a matrix

A \in C (n)

, one can select

A = T (I_{n} + N_{n}) T^{- 1}

, where

T, T^{- 1}

are integer matrices with

det (T) = 1

. The eigenvalues of A are equal to 1 (the reference solution). Computational algorithms such as the QR algorithm and software for spectral matrix analysis may produce wrong results even for matrices

A \in Z (2)

of moderate norm.

Example 4.

When solving the differential equation

y^{'} (t) = f (t, y (t))

with initial condition

y (t_{0}) = y_{0}

, we may choose a reference solution

y_{ref}

as follows. The function

y_{ref}

satisfies the differential equation

y^{'} (t) = Φ (t, y (t))

, where

Φ (t, y (t)) = f (t, y (t)) + y_{ref}^{'} (t) - f (t, y_{ref} (t))

with initial condition

y_{ref} (t_{0}) = y_{0}

.

An important observation is that for some computational problems such as integration and finding the extrema of functions, BFCTs may produce better results than certain sophisticated algorithms used in modern computer systems for calculating mathematics; see, e.g., Section 7.

2. Machine Arithmetic

2.1. Preliminaries

In this section, we consider computations in FPA obeying the IEEE Standard [18]. Numerical computations in MATLAB^® are calculated according to this standard. The FPA consists of a set of machine numbers

M

and rules for performing operations, e.g., addition/subtraction, multiplication/division, exponentiation, etc. Rounding and arithmetic operations in FPA are performed with an RRE of order

ρ

. The sum

A + B + C

and the product

A B C

are computed as

(A + B) + C

and

(A B) C

.

A major source of errors in computations is the (catastrophic) cancellation when relatively exact close numbers are subtracted so that the information coded in their left-most digits is lost [21]. Thus, catastrophic cancellation is the phenomenon when subtracting good approximations to close numbers may cause a bad approximation to their difference. Less known is that genuine addition in FPA may also cause large and even unlimited errors.

A tutorial example of cancellation is solving quadratic equations

a x^{2} + b x + c = 0

,

a \neq 0

. During the last 4000 years, students are taught to use the formula

x_{1, 2} = (- b \pm \sqrt{d}) / (2 a)

,

d = b^{2} - 4 a c

, without warning that this may be a bad way to solve numerically quadratic equations. Similar considerations are valid for the use of explicit expressions for solving cubic and quartic algebraic equations. Genuine subtraction is numerically dangerous, and genuine addition in FPA is not less dangerous, although this fact is not very well known.

2.2. Violation of Arithmetic Laws

In FPA, the commutative law for addition and multiplication is valid, but the associative law for these operations is violated, i.e., it may happen that

(A + B) + C \neq A + (B + C), (A B) C \neq A (B C)

The distributive law

A (B + C) = A B + A C

is also violated. This may lead to unexpectedly large (in fact, arbitrarily large) errors.

For example, we have an incorrect result

1 + R - R = (1 + R) - R = R - R = 0

. The reason for this machine calculation is that

1 + R

is not a machine number (this is the smallest positive integer with this property). It lies in the middle of the successive machine numbers R and

2 + R

and is rounded to R. At the same time, we have

1 + (R - R) = 1 + 0 = 1

, which is the correct answer. Thus, we obtained the famous wrong equality

\begin{matrix} 0 = 1 \end{matrix}

arising in mathematical jokes.

This is not the end of the story. Consider the expression

S = R + 1 + 1 + \dots + 1 - R

consisting of

2 + R

members, where there are R members equal to 1. We have

S = R

. Computing S from left to right, we obtain the wrong answer

S = (R + 1 + 1 + \dots + 1) - R = R - R = 0

. Computing S starting with the ones, i.e.,

1 + 1 + \dots + 1 + R - R = R + R - R

we obtain the correct answer

S = R

. Now, we have obtained the more impressive result

\begin{matrix} 0 = 9 007 199 254 740 992 \end{matrix}

. Summing (at least theoretically) a sufficient number of 1s, we compute the sum as Inf and hence

\begin{matrix} 0 = Inf \end{matrix}

(note that the summation of arbitrarily large number of collectibles is not a problem for Turing-Post machines).

The associative law for multiplication is also violated in FPA. For example, the value of the expression

P = 2^{512} \times 2^{512} \times 2^{- 512}

is

2^{512}

and it is computed correctly as

P = 2^{512} \times (2^{512} \times 2^{- 512}) = 2^{512} \times 1 = 2^{512}

At the same time, the machine computation of P from left to right gives

(2^{512} \times 2^{512}) \times 2^{- 512} = fl (2^{1024}) \times 2^{- 512} = Inf \times 2^{- 512} = Inf

where

Inf

is the symbol for infinity in FPA. Thus, we obtained

\begin{matrix} 2^{512} = Inf \end{matrix}

.

Similar false equalities are obtained by the addition and subtraction of a small number of machine numbers. Set

E = 2^{1023}

. In standard arithmetic, we have

E + E - E - E = 0

. In FPA, it is fulfilled

fl (E + E - E - E) = fl (E + E) - E - E = Inf - E - E = Inf

. This corresponds to the wrong equality

\begin{matrix} 0 = Inf \end{matrix}

although E is a machine number at a good distance

E - 2^{971}

to

X_{max}

.

We also have

fl (E + E - E - E - E - E) = fl (E + E) - E - E - E - E = Inf - E - E - E - E = Inf

, although the exact result is

- 2 E

and should be rounded to -Inf. Thus, we obtained

\begin{matrix} - Inf = Inf \end{matrix}

. The reason is that the maximum positive machine number in FPA is

X_{max} = (2 - eps) E

and the number

E + E = 2^{1024}

is set to

Inf

. Moreover, the maximum positive integer that is still rounded to

X_{\max}

is

X_{max} + 2^{969} (2 - eps)

while the number

X_{\max} + 2^{970}

is set to Inf.

These entertaining exercises may produce undetermined results as well. For example, we have

U = 2^{1024} / 2^{1024} - 1 = 0

but the computed value is U = Inf/Inf − 1 = NaN. This may be interpreted as the exotic equality

\begin{matrix} 0 = NaN \end{matrix}

.

Working with positive numbers

x \leq X_{\min}

also leads to violation of the distributive law

(x + y) z = x z + y z

. In particular, we usually think that for

x \in R

, it is fulfilled

(x + x) / 2 = x / 2 + x / 2 = x

. But setting

x = X_{\min}

, we obtain (x + x)/2 = 2∗x/2 = x and x/2 + x/2 = 0 + 0 = 0. We stress that the minimal positive quantity that is still rounded to

X_{\min}

instead to 0 is

X_{\min} / (2 - eps)

.

The distributive law in the form (1/m + 1/n)/p = 1/m/p + 1/n/p, where at least one of the operands

m, n, p \in Z

is not an integer degree of 2, is also violated. In this case, the rounded values of both sides of the equality are usually neighboring machine numbers.

Example 5.

Let

m = 2

,

n = 3

,

p = 5

. Then,

fl (1 / 2 / 5 + 1 / 3 / 5) = fl (1 / 6) + 2^{- 55}

and

fl (1 / 2 / 5 - 1 / 3 / 5) = fl (1 / 30) + 2^{- 57}

.

Example 5 shows that the machine subtraction of close numbers

a, b

is usually performed with a small relative error. Moreover, if the operands are machine numbers and

a / 2 \leq b \leq 2 a

, then the machine subtraction is exact [21].

The rounded value

fl (X) \in M

of

X \in R

is computed by the command sym(X,’f’) in the form

\pm m / 2^{n}

, where

m, n \in Z_{0}

and m is odd when

n \geq 1

. In particular, the command sym(realmax,’f’) will give 309 decimal digits of the integer

X_{max}

= realmax.

2.3. Numerical Convergence of Series

We shall conclude our brief excursion in the machine summation of numbers with the consideration of numerical series with positive elements

a_{k}

. For

n \in Z_{1}

, set

sum (a_{k}, n) = a_{1} + a_{2} + \dots + a_{n}

.

Definition 1.

The symbol

sum (a_{k}, \infty)

is called a numerical series. The quantity

A_{n} = sum (a_{k}, n)

is the n-th partial sum of this series.

The partial sums are computed in FPA as

A_{1} = a_{1}

,

A_{m + 1} = fl (a_{m} + A_{m}) \in M

,

m \in Z_{1}

. Theoretically, the series

sum (a_{k}, \infty)

is divergent when

a_{k} \geq X_{\min}

. In FPA, there are three mutually disjoint possibilities according to the next definition.

Definition 2.

The series

sum (a_{k}, \infty)

adheres to the following:

1.: Numerically convergent if there is a positive $S \in M$ and $m \geq 1$ such that $A_{n} = S$ for $n \geq m$ ; the number S is called the numerical sum of the series $sum (a_{k}, \infty)$ ;
2.: Numerically divergent if there is $m > 1$ such that $A_{n} = Inf$ for $n \geq m$ ; in such cases, $A_{n} = Inf$ for $n \geq m$ ;
3.: Numerically undefined if there is $m > 1$ such that $A_{n} \neq NaN$ for $n \leq m$ and $A_{n + 1} = NaN$ .

Divergent numerical series may be numerically convergent. For example, the divergent harmonic series with

a_{k} = 1 / k

is numerically convergent in FPA to a sum

S ≃ 36

. The divergent series with

a_{k} = 1

is numerically convergent to

S = R

. Also, it is easy to prove the following assertion.

Proposition 1.

For any

S \in M_{+}

there is a series

sum (a_{k}, \infty)

which is numerically convergent to S.

Proposition 1 for FPA is an analogue to the Riemann series theorem [22], which states that the members of a convergent real series which is not absolutely convergent may be reordered so that the sum of the new series is equal to any prescribed real number.

2.4. Average Rounding Errors

We recall that the normal range of FPA is the interval

X = [R_{\min}, X_{max}] \subset R

such that numbers

x \in X

are rounded with an RRE less than

ρ

.

Let

x \in X

and

fl (x) \in M

be the rounded value of x. It is well known [18] that the RRE satisfies the estimate

rre (x) = \frac{| x - fl (x) |}{| x |} \leq \frac{ρ}{1 + ρ} < ρ

(1)

The code sym(x,’f’) computes

fl (x)

in the form

m / 2^{n}

, where

m \in Z

is odd if

n \geq 1

. Let

n \in Z [0, 2^{52} - 1]

. The expression

rre (x)

between two successive machine numbers

X_{n} = 1 + 2 n ρ, X_{n + 1} = 1 + 2 (n + 1) ρ

in the interval

[1, 2)

is as follows. Let

x_{n} = \frac{1}{2} (X_{n} + X_{n + 1}) = 1 + (2 n + 1) ρ \notin M

Then,

rre (x) = \{\begin{matrix} 1 - X_{n} / x & X_{n} \leq x \leq x_{n} \\ X_{n + 1} / x - 1 & x_{n} < x \leq X_{n + 1} \end{matrix}

The maximum value of

rre (x)

in the interval

[X_{n}, X_{n + 1}]

is achieved for

x = x_{n}

and is equal to

e_{n} = rre (x_{n}) = \frac{X_{n + 1} - X_{n}}{X_{n} + X_{n + 1}} = \frac{ρ}{1 + (2 n + 1) ρ}

A part of the graph of the function

rre : [1, 2) \to R

is shown in Figure 1.

For

x > X_{n}

near to

X_{n}

, the maximum RRE is about

ρ

, while for

x < X_{n + 1}

near to

X_{n + 1}

, the maximum RRE is close to

ρ / 2

. In particular, we have

\begin{matrix} rre (1 + ρ) & = & \frac{ρ}{1 + ρ} = ρ - ρ^{2} + O (ρ^{3}) \\ rre (2 - 2 ρ) & = & \frac{ρ}{2 (1 - ρ)} = \frac{1}{2} ρ + \frac{1}{2} ρ^{2} + O (ρ^{3}) \end{matrix}

(2)

That actual behavior of RRE is not governed by (1) but is considerably smaller and has been known to specialists in computer calculations for a long time. To clarify this problem, we performed intensive numerical experiments.

BFCTs showed that the actual RRE is considerably lower than

ρ

and this led to the definition and use of average RREs, which are more realistic in comparison to the relative error

ρ

. Based on this observation [23], we have defined and calculated an integral ARRE as the definite integral

int (rre, 1, 2)

. This integral is of order

K ρ + O (ρ^{2}), K = \frac{1}{2} log (2) ≃ 0.3466

(3)

At the same time, the average of the RRE maximums is

2 K ρ = log (2) ρ ≃ 0.6931 ρ .

Intensive numerical BFCT experiments [23] with the rounding of large sequences of numbers

x \notin M

had shown that the actual ARRE is about

0.40 ρ

. This is 2.5 times less than the widely used bound

ρ

in the literature [18]. In these experiments, numbers

x \in M

had been excluded, because

rre (x) = 0

. This explains why the multiplier of

ρ

was computed as 0.40 instead of 0.35.

In another experiment, we have computed and averaged the RRE for fractions

x = 1 / k

,

x = 1 + 1 / k

and

x = 2 - 1 / k

for

k = 1 : m

and m up to 1000, and we have computed the ratios

rre (x) / ρ

. The results are shown in Table 1. The actual RRE in rounding of unit fraction

1 / m

(second column) is very close to the integral ARRE (3).

For fractions

1 + 1 / m

close to 1, the computed ARRE is larger than for fractions

2 - 1 / m

close to 2. This confirms the observation that the rounding of

x \in [2^{k}, 2^{k + 1}]

,

k \in Z

, is made with a lower (although not twice as low) RRE for x close to the left limit

2^{k}

compared with close to the right limit

2^{k + 1}

of the interval; see (2). For example, the result in cell (4, 2) of Table 1 is obtained by the code

for m = 1:1000

R(m) = double(abs((sym(1/m,’e’)−1/m))∗m∗2/eps);

end

mean(R)

We stress that in such experiments, we cannot use pseudo-random number generators such as rand, randn and randi from MATLAB^® since they produce machine numbers for which the rounding errors are zero.

We summarize the above considerations in the next important proposition.

Proposition 2.

The known rounding error bounds in all computational algorithms realized in FPA may be reduced three times (!) if the integral ARREs are used instead of the maximum RRE.

It must be pointed out that some widely used computer codes for evaluation of the rounding process have serious flaws. The MATLAB^® code sym(x,’e’) should compute the rounded value

fl (x) \in M

of

x \in R

in the form

x + E (x)

with signed absolute rounding error

E (x)

. However, for

x = 2^{m} / 1161

and

m = - 6 : 5

, the value of

fl (x)

is computed with a relative error of almost 100%.

3. Numerically Stable Computations

3.1. Lipschitz Continuous Problems

The numerical stability of computational procedures is a major issue in numerical analysis [19]. A large variety of computational problems (actually, almost all computational problems) may be formulated as follows; see [19,20]. Let

D \subset C (m, 1)

and

R \subset C (n, 1)

be given sets and let

f : D \to R

be a continuous function. We shall use the following informal definition.

Definition 3.

The computational problem is described by the function f, while the particular way of computing the result

r = f (d)

for a given data

d \in D

is the computational algorithm.

Thus, the computational problem is identified with the pair

(f, d)

, or with the equality

r = f (d)

. For many computational problems, the function f is Lipschitz continuous.

Definition 4.

The computational problem

r = f (d)

is said to be (locally) Lipschitz continuous if there exist constants

L > 0

and

b > 0

such that

∥ f (d_{1}) - f (d_{2}) ∥ \leq L ∥ d_{1} - d_{2} ∥

whenever

∥ d_{1} ∥, ∥ d_{2} ∥ \leq b

. The problem is globally Lipschitz continuous if

b = \infty

.

If the function f has a locally bounded derivative, it is (locally) Lipschitz continuous. The concept of Lipschitz continuity is illustrated by the next example.

Example 6.

Let

f : D \to R

,

D \subset R

, and

a > 0

be a given constant. Then, the following assertions take place.

1.: The function $f (d) = 1 / d$ is Lipschitz continuous if $D = (- \infty, a] \cup [a, \infty)$ and is not Lipschitz continuous if $D = R ∖ {0}$ .
2.: The function $f (d) = d^{1 / 3}$ is Lipschitz continuous if $D = (- \infty, a] \cup [a, \infty)$ and is not Lipschitz continuous if $D = R ∖ {0}$ .
3.: The function $f (d) = d^{m}$ , $m \in Z_{2}$ , is Lipschitz continuous if $D = [- a, a]$ and is not Lipschitz continuous if $D = R$ .
4.: The function $f (d) = d^{4 / 3} sin (1 / d)$ for $d \neq 0$ and $f (0) = 0$ , is differentiable on $R$ but is not Lipschitz continuous.

Suppose now that the algorithm is realized in FPA, e.g., in the MATLAB^® computing environment.

Definition 5.

The computational algorithm

r = f (d)

is said to be numerically stable if the computed result

r_{comp}

is close to the result

f (d^{*})

of a close problem with data

d^{*}

.

Definition 5 includes the popular concepts of forward and backwards numerical stability. To make this definition formal, suppose that there exist non-negative constants

A, B

such that

∥ r_{comp} - f (d^{*}) ∥ \leq ρ A ∥ r ∥, ∥ d - d^{*} ∥ \leq ρ B ∥ d ∥

and

d \neq 0

,

r \neq 0

. Further on, we have

\begin{matrix} ∥ r_{comp} - r ∥ & = & ∥ r_{comp} - f (d^{*}) + f (d^{*}) - f (d) ∥ \\ \leq & ∥ r_{comp} - f (d^{*}) ∥ + ∥ f (d^{*}) - f (d) ∥ \\ \leq & ρ A ∥ r ∥ + L ∥ d^{*} - d ∥ \leq ρ A ∥ r ∥ + ρ B L ∥ d ∥ \end{matrix}

Dividing the last inequality by

∥ r ∥

, we obtain the following estimate for the relative error in the computed solution

\frac{∥ r_{comp} - r ∥}{∥ r ∥} \leq ρ C, C = A + \frac{B L ∥ d ∥}{∥ f (d) ∥}

(4)

Definition 6.

The quantity

C = C (A, B, d)

is said to be the relative condition number of the computational problem

r = f (d)

.

The remarkable Formula (4) reveals the three main factors that determine the precision of the calculated solution [20,23].

Proposition 3.

The precision of the computed solution

r_{comp}

depends on the following factors.

1.: The sensitivity of the computational problem to perturbations in the data d measured by the Lipschitz constant L of the function f.
2.: The stability of the computational algorithm expressed by the constants $A, B$ .
3.: The FPA characterized by the rounding unit ρ and the requirement that the intermediate computed results belong to the normal range $X$ .

The error estimate (4) is used in practice as follows. For a number of problems, the Lipschitz constant L may be calculated or estimated as in the solution of linear algebraic equations and the computation of the eigenstructure of a matrix with simple eigenvalues. Finally, the value of

ρ

is known exactly for FPA as well as for other machine environments.

Estimation of the constants

A, B

may be a problem. Often, the heuristic assumption

A = 0

and

B = 1

is made giving

C = C (f, d) = \frac{L ∥ d ∥}{∥ f (d) ∥} .

(5)

It follows from (5) that C will be large if L and/or

∥ d ∥

are large and/or

∥ f (d) ∥

is small. The constant L is usually outside the user’s control. The quantities

∥ d ∥

and

∥ f (d) ∥

may be changed by scaling of the computational problem.

Definition 7.

The computational problem is said to adhere to the following:

1.: Well conditioned if $ρ C ≪ 1$ ; in this case, we may expect about $- {log}_{10} (ρ C)$ true decimal digits in the computed solution;
2.: Poorly conditioned if $ρ C ≃ 1$ ; in this case, there may be no true digits in the computed solution.

More precise classification of the conditioning (well, medium and poor) of computational problems is also possible. Computational problems that are not Lipschitz but Hölder continuous may also be analyzed: see, e.g., [23].

The above considerations confirm a fundamental rule in numerical computations, namely that if the data d and/or some of the intermediate results in the computation of

f (d)

are large and/or the result

r = f (d)

is small, then large relative errors in the computed solution

r_{comp}

can be expected; see Section 3.3. For some computational problems

r = f (d)

, there are a priory error estimates for the relative error

∥ r_{comp} - r ∥ / ∥ r ∥

in the computed solution

r_{comp}

in the form of explicit expressions in

ρ

and d. The importance of such estimates is that they may be obtained a priori before solving the computational problem.

Most a priory error estimates are heuristic. Among computational problems with such estimates is the solution of linear algebraic equations

A x = b

and the computation of the eigenvalues of a matrix A with simple spectrum. To check the accuracy and practical usefulness of such error estimates by BFCTs, a set of CPRSs is designed, and the error estimates are compared with the observed errors.

3.2. Hölder Continuous Problems

Definition 8.

The computational problem

r = f (d)

is said to be Hölder continuous with exponent

h > 0

if there exist constants

L > 0

and

b > 0

such that

∥ f (d_{1}) - f (d_{2}) ∥ \leq L ∥ d_{1} - d_{2} ∥^{h}

whenever

∥ d_{1} ∥, ∥ d_{2} ∥ \leq b

.

Hölder continuity implies uniform continuity, but the converse is not true. It is supposed that

h < 1

. Indeed, if

h = 1

, the problem is Lipschitz continuous, and if

h > 1

, the function f is constant.

The machine computation of

r = f (d)

may be performed with large errors of order

{eps}^{h}

when the exponent

h > 0

is small. Such cases arise in the calculation of a multiple zero

d_{0}

of a smooth function

f : R \to R

, where

f^{(k)} (d_{0}) = 0

,

k \in Z [0, m - 1]

and

f^{(m)} (r_{0}) \neq 0

. In this case,

h = 1 / m

.

Let

r \neq 0

and

r_{comp}

be the computed value of r. Then, a heuristic accuracy estimate for the solution of a Hölder problem is

\frac{∥ r_{comp} - r ∥}{∥ r ∥} \leq H ρ^{h}, H = L \frac{{∥ d ∥}^{h}}{∥ r ∥} .

Experimental results confirming this estimate are presented later on. A typical example of a Hölder problem is the computation of the roots of an algebraic equation of n degree with multiple roots, where

n \geq 3

(the case

n = 2

is treated by special algorithms). There are three codes in MATLAB^® that can be used for solving algebraic equations which may be characterized as follows.

The code roots is fast and works on middle power computer platforms with equations of degree n up to several thousands but gives large errors in case of multiple roots. This code solves only algebraic equations.
The code vpasolve works with VPA corresponding to 32 true decimal digits with equations of degree n up to several hundreds but may be slow in some cases. This code works with general equations as well finding one root at a time.
The code fzero is fast but finds one root at a time. It may not work properly in case of multiple roots of even multiplicity. This code works with general finite equations as well.

3.3. Golden Rule of Computations

The results presented in this and previous sections confirm the golden rule of computing both exact and machine arithmetic, which may be formulated as follows.

Proposition 4 (Golden Rule of Computations).

If in a chain of numerical computations the final result is small relative to the initial data and/or to the intermediate results, then large relative errors in the computed solution are to be expected.

Note that the already mentioned catastrophic cancellation in subtraction of close numbers is a particular case of Proposition 4.

An important class of computations in FPA are the so-called computations with maximum accuracy. Let the vector r with elements

r_{k}

be the exact solution of the vector computational problem

r = f (d)

and let the vector

r_{comp}

with elements

r_{k}^{*} \in M

be the computed solution.

Definition 9.

The vector

r_{comp}

is computed with maximum accuracy if its elements

r_{k}^{*}

satisfy

r_{k}^{*} \in {r_{k}^{-}, r_{k}^{+}}

.

Solutions with maximum accuracy are the best that we may expect from a computational algorithm implemented in FPA.

4. Extremal Elements of Arrays

Finding minimal and maximal elements of real and complex arrays is a part of many computational algorithms. This problem may look simple, but it has pitfalls in some cases.

4.1. Vectors

Let

w = [w_{1}, w_{2}, \dots, w_{n}]

be a given real n-vector. The problem of determining an extremal (minimal or maximal) element

w_{k}

of w together with its index k arises as part of many computational problems. The solution of this problem is obtained in MATLAB^® by the codes [U,K] = min(w) and [V,L] = max(w), where

U = w_{K}

is the minimal element of the vector w and K is its index. Similarly,

V = w_{L}

is the maximal element of the vector w and L is its index. These codes work for complex vectors as well if the complex numbers are ordered lexicographically by the relation ⪯; see [6].

An important rule here is that if there are several elements

w_{k}

of w equal to its minimal element U, then the computed index K is supposed to be the minimal one. Similarly, if there are several elements

w_{l}

of w equal to its maximal element V, then the computed index L is again the minimal one. The delicate point here is the calculation of the indexes K and L. For example, for

w = [c, c, c]

, where

c \in R

or

c \in C

, the codes min and max produce

U = c

,

K = 1

and

V = c

,

L = 1

as expected.

Due to rounding, the above codes may produce unexpected results. Indeed, the performance of the codes depends on the form in which the elements of the vector w are written.

Example 7.

The quantities

A = 1 / 3

,

B = 4 / 3 - 1

and

C = 1 / 2 - 1 / 6

are equal but their rounded values are different. The rounded value

fl (X) \in M

of

X \in R

is computed by the code sym(X,’f’) . It produces the answer in the form

m 2^{- n}

, where

m, n \in Z_{0}

. We have

fl (A) = 6004799503160661 / 2^{54}

and

fl (B) = fl (A) - 2^{- 54}

,

fl (C) = fl (A) + 2^{- 54}

. Hence,

fl (B) < fl (A) < fl (C)

.

Example 8.

Let w be a 3-vector with elements

A, B, C

at any positions, where

A = B = C

are defined in Example 7. Then, the code [U,K] = min(w) will give

U = fl (B)

and K will be the index of B as an element of w. The code [V,L] = max(w) will give

V = fl (C)

and L will be the index of C as an element of w. At the same time, both codes are supposed to give

K = L = 1

and

U = V

.

These details of the performance of the codes min and max may not be very popular even among experienced users of sophisticated computer systems for mathematical calculations such as MATLAB^®.

4.2. Matrices

Similar problems as in Section 4.1 arise in finding extremal (minimal or maximal) elements of multidimensional arrays and in particular of real and complex matrices A.

Consider a matrix

A \in C (m, n)

with elements

A (p, q)

. Let the minimal element of A relative to the lexicographical order be

A_{\min} = A (i_{1}, i_{2})

. Then, the code

[a,b] = min(A); [A_min,N] = min(a); A_min, ind = [b(N),N]

finds the minimal element

A_{\min}

of A and the pair

(i_{1}, i_{2})

of its indexes as

A_min, ind = i_1 i_2

where i_1 = ind(1), i_2 = ind(2). Here, a and b are n-vectors and ind is a 2-vector.

If there is more than one minimal element, then the computed pair

(i_{1}, i_{2})

of its indexes is minimal relative to the lexicographical order. This result, however, may depend on the way the elements of A are specified.

Finding minimal elements of matrices may be a part of improved algorithms to compute the extrema of functions of two variables; see Section 7.2. A scheme for finding minimal elements of

n D

-arrays,

n > 2

, may also be derived and applied to the minimization of functions of n variables.

4.3. Application to Voting Theory

The above details in using the codes min and max from MATLAB^® [6], although usually neglected, may be important. For example, in certain automatic voting computations (in which one of the authors of this paper had participated back in 2013) with the Hamilton method for the bi-proportional voting system used in Bulgaria, these phenomena actually occurred and caused problems. To resolve quickly the problem, we applied BFCTs using hand calculations (!) being in an emergency situation.

The computational algorithms for realization of the Hamilton method must be improved, replacing the fractional remainders by integer remainders [23] as follows. Let n parties with votes

v_{1}, v_{2}, \dots, v_{n}

take part in the distribution of

S > n

parliamentary seats by the Hamilton method [24]. Set

V = v_{1} + v_{2} + \dots + v_{n}

and

m_{k} = S v_{k} / V

.

Next, the rationals

m_{k}

are represented as

fix (m_{k}) + μ_{k} / V

, where

fix (x) \in Z_{0}

is the integer part of

m_{k}

,

μ_{k} / V

is the fractional remainder and

μ_{k} \in Z [1, V - 1]

is the integer remainder of

m_{k}

modulo V. Initially, the kth party obtains

fix (m_{k})

seats. If the sum

S_{0}

of initial seats is less than S, then the first

S - S_{0}

parties with largest integer remainders

μ_{k}

obtain one seat more. Thus, to avoid small errors in the computation of fractional remainders that may cheat the code max, we must work with exactly computed integer remainders

μ_{k}

.

5. Problems with Reference Solutions

5.1. Evaluation of Functions

Evaluation of functions f defined by an explicit expression

y = f (x), 0 \neq x \in R, 0 \neq y \in R

is a CPRS, since

f (x)

may be found with high precision in a suitable computing environment. However, the relative error in computing y may be large even for well-behaved functions f and arguments x far from the limits of the normal range [23]. The reason is that instead with the argument

x \in R

, the corresponding computational algorithm works with the rounded value

x^{*} = fl (x) \in M

. At the same time, for

x \in M

, the computed quantity

f (x^{*})

is usually exact to working precision.

Example 9.

We have

[cos(realmax);sin(realmax)] = −0.999987689426560

0.004961954789184

The computed result is correct to full precision, although the argument realmax is written with 309 decimal digits, which may be displayed by the command sym(realmax,’f’).

Let the function f be locally Lipschitz continuous with constant

L (x)

in a neighborhood of x and let the result computed in FPA be

y^{*} = f (x^{*})

. Then, the relative error of

y^{*}

is estimated as

\frac{| y^{*} - y |}{| y |} \leq ρ C, C = C (f, x) = \frac{L (x) | x |}{| f (x) |}

where

C (f, x)

is the relative condition number of the computational problem

y = f (x)

.

Example 10.

Consider the vector

y = [cos (2 k π); sin (2 k π)]

, where

k \in Z

. We should have

y = [1; 0]

but for large k, this may not be so. Let the vector

X \in R (1, 8)

have elements

X (p) = π 2^{46 + p}

, i.e.,

X = pi∗[2^47,2^48,2^49,2^50,2^51,2^52,2^53,2^54]

The command Y = [cos(X);sin(X)] gives the matrix

Y = 0.9999 0.9994 0.9976 0.9905 0.9622 0.8517 0.4509 −0.5934

0.0172 −0.0345 −0.0689 −0.1374 −0.2723 −0.5240 −0.8926 −0.8049

While the first column Y(:,1) of Y still resembles the exact vector [1;0], the next columns are heavily wrong with Y(:,8) being a complete catastrophe with relative error 170%. The commands Y1 = [C1;S1] and Y2 = [C2;S2], where C1 = vpa(cos(X)), S1 = vpa(sin(X)), C2 = cos(vpa(X)) and S2 = sin(vpa(X)) produce the same wrong result Y1 = Y2 = Y.

The reader is advised to plot the graph of the eight functions

f (x) = sin (x + X)

,

x \in [- 2 π, 2 π]

, by the command syms x; fplot(sin(x+X)), for the vector X from Example 10. A way out from such wrong calculations is to evaluate standard

2 π

-periodic trigonometric functions for arguments modulo

2 π

.

Example 11.

The command

YY = [cos(mod(X,2∗pi));sin(mod(X,2∗pi))]

produces the matrix

YY = 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0

which is exact to full precision.

We recall that for

| x | \geq R

, the rounded value

fl (x)

of x is an even integer. For example, the quantity

a = 23 * π^{32} / 20

is rounded to

fl (a) = 9 321 670 908 397 306

, which is demonstrated by the command sym(23∗pi^32/20,’f’). Another instructive example is rounding integer decimal degrees

10^{n}

.

Example 12.

We have

fl (10^{n}) = 10^{n}

for

n \in Z [16, 22]

but

sym([10^23;10^24;10^25],’f’) = 99999999999999991611392

999999999999999983222784

10000000000000000905969664

If

x \in M

, then for most elementary and special functions f, the quantity

y = f (x)

is computed with maximum accuracy.

If the function f is locally Lipschitz with constant

L (x)

, then large relative computational errors in the evaluation of

y = f (x) \neq 0

may occur when

C (f, x)

is large, i.e., when

L (x)

and/or

| x |

is large and/or

| y |

is small. For the trigonometric functions above the quantities,

L (x)

and

| y |

are equal to 1, but the argument

x \notin M

is large.

A standard approach to improve the performance of computational algorithms is to scale the initial and/or intermediate data to avoid large errors. In computing the matrix exponential

exp (A)

,

A \in C (n)

, the scaling

A \to A 2^{- m}

is accomplished by a scaling factor

2^{- m}

,

m \in Z_{1}

, which is a binary degree in order to reduce rounding errors [19].

5.2. Linear Algebraic Equations

An instructive example, which illustrates BFCTs to the solution of vector algebraic equations

A x = b

with integer data, is designed as follows. Let

A \in Z (n)

be invertible matrix and

b \in Z (n, 1)

. The command A = fix(100∗randn(n)) generates a matrix

A \in Z (n)

that is normally distributed with integer elements of moderate magnitude and zero mean (similar effect is achieved by the command randi). Choosing the reference solution as X_0 = ones(n,1), we obtain b = A∗X_0. The computed solution is X_comp = A\b with relative error E_n = norm(X_comp − X_0)/sqrt(n).

The a priori relative error estimate is est = eps∗cond(A). Usually,

E_{n}

is a small multiple of

est

, and this may be checked by BFCTs. We consider three types of computed solutions as follows.

Standard solution X_1 = A\b obtained by Gaussian elimination with partial pivoting based on LR decomposition $P A = L R$ of A, where L is a lower triangular matrix with unit diagonal elements, R is an upper triangular matrix with nonzero diagonal elements, and P is a permutation matrix.
Solution X_2 = R\(Q’∗b) obtained by QR decomposition [Q,R] = qr(A) of A, where $A = Q R$ , the matrix Q is orthogonal and the matrix R is upper triangular.
Solution X_3 = inv(A)∗b based on finding the inverse $inv (A) = A^{- 1}$ of A.

The solution

X_{1}

is computed as a default in MATLAB^® and is accurate and fast. The solution

X_{2}

is very accurate although more expensive and is also used. The solution

X_{3}

is not recommended and is included here for tutorial purposes. Indeed, computing inv(A) and multiplying the vector b by this matrix are unnecessary operations which are expensive and reduce accuracy.

The above computational techniques are illustrated by the commands

n = 1000; A = fix(100∗randn(n)); X_0 = ones(n); b = A∗X_0;…

X_1 = A\b; [Q,R] = qr(A); X_2 = R\(Q’∗b); X_3 = inv(A)∗b;…

E_1 = norm(X_0-X_1)/sqrt(n);E_2 = norm(X_0-X_2)/sqrt(n);…

E_3 = norm(X_0-X_3)/sqrt(n);est = eps∗cond(A);…

e_1 = E_1/est; e_2 = E_2/est; e_3 = E_3/est; [e_1;e_2;e_3]

Each execution of these commands produces different results because of the code randn. The computed quantity e_1 for systems with up to

n = 1000

unknowns is usually less than 20, the quantity e_2 is about 40 and the quantity e_3 is less than 4. More precisely, intensive computations with

n = 1000

showed that

mean (e_{1}) = 18

,

mean (e_{2}) = 3

and

mean (e_{3}) = 40

.

Tests with

n = 5000

had also been conducted, but they require more computational time. For such values of n, we have observed average values

mean (e_{1}) = 30

,

mean (e_{2}) = 3

, and

mean (e_{3}) = 150

.

The comparison of

e_{1}

and

e_{2}

with

e_{3}

confirms, among others, the tutorial conclusion that solving linear vector algebraic equations by matrix inversion must definitely be avoided. Rather, the opposite is true. If the inverse matrix

B = A^{- 1}

is needed for some reason, it is found solving n equations

A b_{k} = i_{k}

for the columns

b_{k}

of B, where

i_{k}

is the kth column of

I_{n}

.

Our main observation, based on BFCTs, featuring a comparison of the methods of Gaussian elimination and QR decomposition for solving linear algebraic equations, is formulated in the next proposition.

Proposition 5.

For matrices of order up to 1000, the QR decomposition method gives a relative error that is between 5 and 10 times less than the relative error of the Gaussian elimination method.

In addition, the solution of the equation

A x = b

via QR decomposition is backward stable, which is not always the case with Gaussian elimination with partial pivoting. Thus, Proposition 5 is a message to the developers of mathematical software such as MATLAB^®.

The Gaussian elimination method based on LR decomposition is preferred to the QR decomposition method because it requires half the floating-point operations. Since fast computational platforms are now widely available, this consideration is no longer valid.

Although the implementation of the code x = A\b gives good results for matrices

A \in R (n)

with n of order up to 1000, large errors in x may occur even for

n = 2

in some cases.

Example 13.

Consider the algebraic equation

A (m) x = b (m)

, where the matrices

A (m) \in R (2)

are defined by the relation (7) below. For

x_{ref} = [1; 1]

, we have

b (m) = [1 - 10^{m}; - 1 - 10^{m}]

. The computed results for

m = 2 : 5

are presented in Table 2. In cases

m = 4, 5

, there is a warning that the matrix

A (m)

is close to singular or badly scaled, since

cond (A (m))

is larger than

10^{16}

. As shown in Table 2, for these two cases, the relative error is indeed 25% and 100%, respectively.

5.3. Eigenvalues of 2 by 2 Integer Matrices

The integer

2 \times 2

matrix

A = [\begin{matrix} - 100009999 & 100000000 \\ - 100020001 & 100010001 \end{matrix}]

(6)

with condition number

4.1 \times 10^{16}

has double eigenvalue

λ = 1

and Jordan form

J_{2} = I_{2} + N_{2}

. The elements of A are integers of magnitude

10^{8}

. This is far from the quantity

R = 2^{53}

such that integers

n > R

may not be machine numbers.

Example 14.

The MATLAB^® code e = eig(A) computes the column 2-vector e of eigenvalues of A as

e_{comp} = [- 0.3955; 2.3955]

instead of

e = [1; 1]

. The relative error is 140%, and there is no true digit in the computed vector

e_{comp}

(even the sign of the first computed eigenvalue is wrong). The trace of A is fortunately computed as trace(A) = 2 and is the sum of the wrong eigenvalues. Next, the determinant of A (which is equal to 1) is computed wrongly in MATLAB^® as det(A) = 0.2214.

Surprisingly, the vector p of the coefficients of the characteristic polynomial of A, computed as p = poly(A), is [1,−2,−0.9475]. The last element

p (3) = - 0.9475

of the vector p has nothing in common with the computed determinant 0.2214, which may further confuse the user. Next, the Schur form

S = U^{H} A U

of A is computed by the MATLAB^® command [U,S] = schur(A) as

S = [\begin{matrix} - 0.3955 & 200020001.0000 \\ 0.0000 & 2.3955 \end{matrix}]

and we have

S (1, 1) S (2, 2) = - 0.9475

. This explains the computed value of

p (3)

.

Finally, we may try to compute

det (A)

as

\pm σ_{1} σ_{2}

, where

σ_{k}

represents the singular values of A. Based on the MATLAB^® code svd, the computed result is

det (A) = 0.9711

, which is the best approximation to the exact value 1 of the determinant up to now. The same result for

det (A)

is obtained by the QR decomposition of A by the MATLAB^® command [Q,R] = qr(A). The above results confirm the rule that SVD and QR decomposition are the best numerical tools to compute the determinant of a general matrix (if this determinant is ever needed). For general

2 \times 2

matrices, however, the most reliable way to compute the spectrum is a variant of the BFCT described below. Here, we recall the modern paradigm in numerical spectral analysis and root finding of polynomials: the eigenvalues of a matrix are not computed as roots of its characteristic polynomial but rather the contrary: the roots of a polynomial are computed as the eigenvalues of its companion matrix.

The Maple [9] code jordan(A), incorporated in MATLAB^®, computes correctly the Jordan form

I_{2} + N_{2}

of A. The vector of eigenvalue condition numbers, computed by the MATLAB^® command condeig(A), is

[c, c]

, where

c ≃ 7.2 \times 10^{7}

is relatively large. This indicates that something is wrong. In reality, things are even worse, since the 2-vector of eigenvalue condition numbers of the matrix A does not exist because A has only one linearly independent eigenvector. At this moment, the user may or may not be aware that something is wrong (with exception of the code jordan which is not very popular and is of restricted use). Now comes the time of BFCTs to find the spectrum and the Jordan form of A as follows.

The sum

e_{1} + e_{2}

of the eigenvalues of A is equal to the trace

A (1, 1) + A (2, 2) = 2

which is computed exactly. If the eigenvalues are close or equal, because the eigenvalue condition numbers are large, then we should have

e_{1} + e_{2} ≃ 2 e_{1} = 2

and

e_{1} = e_{2} = 1

. The Jordan form of A is then either

J_{2}

or

I_{2}

. But the only matrix with Jordan form

I_{2}

is

I_{2}

itself, and hence the Jordan form of A must be

J_{2}

. We have obtained the correct result using BFCTs and simple logical reasoning.

The matrix (6) may be written as

A (m) = [\begin{matrix} 1 - a - a^{2} & a^{2} \\ - 1 - 2 a - a^{2} & 1 + a + a^{2} \end{matrix}], a = 10^{m}

(7)

for

m = 4

. Thus, we have a family of matrices

{A (m)}

parametrized by the integer m. The trace of

A (m)

is equal to 2, the determinant of

A (m)

is equal to 1, the eigenvalues of

A (m)

are

e_{1} = e_{2} = 1

and the Jordan form of

A (m)

is

I_{2} + N_{2}

.

Below, we give the values of the coefficients

c_{2} = tr (A) = 2

and

c_{3} = det (A) = 1

of the characteristic polynomial

λ^{2} - c_{2} λ + c_{3}

of A as found by the code poly(A), and of the eigenvalues

e_{1} = e_{2} = 1

of A, as computed by the code eig(A) in MATLAB^®.

For $m = 2$ , we have $c_{2} = 2.0000$ , $c_{3} = 1.0000$ and $e_{1, 2} = 1.0000 \pm 0.0001 i$ , where the quantity 1.0000 has at least 5 true decimal digits and $0.0001 i$ is a small imaginary tail which may be neglected. This result seems acceptable.
For $m = 3$ , we have $c_{2} = 2.0000$ , $c_{3} = 1.0000$ and $e_{1} = 0.9937$ , $e_{2} = 1.0063$ . With certain optimism, these results may also be declared as acceptable.
For $m = 4$ , a computational catastrophe occurs with $c_{2} = 2.0000$ (true), $c_{3} = - 0.9475$ (completely wrong) and $e_{1} = - 0.3955$ , $e_{2} = 2.3955$ (also completely wrong). Here, surprisingly, $det (A)$ is computed as $0.2214$ , which differs from 1 (which is to be expected) but differs also from $c_{3}$ . It is not clear what and why had happened. Using the computed coefficients $c_{2}, c_{3}$ , the roots of the characteristic equation $λ^{2} - c_{2} λ + c_{3} = 0$ now are $λ_{1} = 0.7709$ , $λ_{2} = 1.2291$ instead of the computed $e_{1}$ and $e_{2}$ . This is a strange wrong result.
We give also the results for $m = 5$ which are full trash and are served without any warning to the user. We have $c_{2} = 2.0000$ , $c_{3} = 5591.5$ and $e_{1, 2} = 1.0000 \pm 74.770 i$ . We also obtain $det (A) = - 4472.2$ for completeness.

In all cases, the sum

c_{2} = 2.0000

of the wrongly computed eigenvalues is almost exact, which is a well known fact in the numerical spectral analysis of matrices [19]. The conclusion from the above considerations is that BFCTs in this case, in contrast to the standard software, always give

e_{1} ≃ e_{2} = c_{2} / 2 = 1.0000

with five true decimal digits. This is a good approximation to the eigenvalues

e_{1} = e_{2} = 1

of the matrices

A (m)

for

m = 3, 4, 5

when the codes eig and poly fail.

The main reason for the large errors considered in this section is the bad conditioning of the integer

2 \times 2

matrices. Nevertheless, these errors are served without warning to the user. This should stimulate teaching computer methods and algorithms with an emphasis on cases where these tools produce wrong (sometimes completely wrong) results. This may also be a stimulus for producers of scientific software to warn users of the possible unreliable behavior of computer codes. Such behavior may be due to the particularities of FPA obeying the 2019 IEEE Standard but also to inappropriate algorithms and programming flaws such as in the code sym(1/1161,’e’).

6. Zeros of Functions

Computing zeros of functions and of polynomials in particular is a classical problem in numerical analysis. Although powerful computational algorithms for this purpose have been developed, the use of BFCTs in this area is still useful.

Consider first the scalar case. Calculating roots of equations

f (x) = 0

, where

f : R \to R

is a continuous function, in a given interval

T = [a, b]

, may be a difficult problem. We distinguish three main tasks.

Find a single particular solution $r_{1}$ in the interval T.
Find the general solution $f^{- 1} (0) \cap T$ in the interval T.
Find all roots $[r_{1}; r_{2}; \dots; r_{n}]$ of a given n degree polynomial $f (x) = P (x)$ .

The condition

f (a) f (b) < 0

guarantees that there is at least one solution in the interval

(a, b)

. Whether f has zeros at the points

a, b

is checked by direct computation.

Solving real and complex polynomial equations

P (x) = 0

of n degree

P (x) = p_{1} x^{n} + p_{2} x^{n - 1} + \dots + p_{n} x + p_{n + 1} = 0, c_{1} \neq 0,

is achieved by the code vpasolve(P(x) == 0) from MATLAB^®. It finds all roots of

P (x)

with quadrupole precision of 32 true decimal digits. For

n = 100

, the speed of the corresponding software on a medium computer platform is still acceptable. For

n = 1000

and higher, the performance of the code may be slow even on fast platforms, and hence it may not be applicable to RTA (if such computations are ever needed in RTA). Note that we write this at the beginning of year 2025.

A faster solution may be obtained by the code roots(p), where p is the vector of coefficients

p_{1}, p_{2}, \dots, p_{n + 1}

of

P (x)

. This code computes the column vector of roots

r = [r_{1}; r_{2}; \dots; r_{n}] \in C (n, 1)

of

P (x)

as the vector of eigenvalues of the companion matrix

C (A)

of

P (x)

using the QR algorithm, i.e., r = eig(compan(p)). The code roots(p) works relatively fast for n up to several thousands but may produce wrong results even for small values of n when the polynomial has multiple roots and the computational problem is poorly conditioned. BFCTs allow to reveal the behavior of this solver.

It is known that errors in computing multiple roots of equations

f (x) = 0

in FPA, where f has a sufficient number of bounded derivatives around a root

x = r

, are of order

γ {eps}^{1 / n}

, where n is the multiplicity of r and

γ > 0

is a constant. To estimate

γ

, consider the equation

P (x) = 0

when the n degree polynomial

P (x)

has n-tuple root

r_{1} = 1

. Let

E_{n} = ∥ r_{comp} - ones (n, 1) ∥ / \sqrt{n}

be the relative error in the vector

r_{comp}

computed by the code roots. The ratio

γ_{n} = E_{n} / {(eps)}^{1 / n}

is shown in Table 3 for the extended form

x^{n} - n x^{n - 1} + \dots + {(- 1)}^{n}

of the polynomial

P (x) = {(x - 1)}^{n}

and values of n from 3 to 20. For

n = 2

, the computed roots are exact and

γ_{2} = 0

.

The relative error in the computed solution for

n \leq 20

satisfies the heuristic estimate

E_{n} \leq γ_{n} {eps}^{1 / n}

, where the average of

γ_{n}

is

1.87

or approximately

1.9

. The expanded form of the polynomial

P (x) = {(x - 1)}^{n}

may be found by the command expand or asking the intelligent dialogue system WolframAlpha [11] to find the coefficients of this polynomial. We stress that for all n values, the code vpasolve(P(x) == 0) gives the exact solution vector

r = [1; 1; \dots; 1]

.

Usually, the computed roots are unacceptable if their relative error is greater than 1%. For example, it follows from Table 3 that the root

r = 1

of multiplicity 20 is computed by the code roots with a relative error of 32%, which is unacceptable. It must be stressed that this is not due to program disadvantages but rather to the high sensitivity of the eigenvalues of the companion matrix

C (A)

. If the polynomial equation

P (x) = 0

has a root of multiplicity n, the relative error is estimated as

1.9 {eps}^{1 / n}

. Solving the inequality

1.9 {eps}^{1 / n} \leq 0.01

, we obtain

n \geq 7

. We summarize these slightly unexpected observations as follows.

Proposition 6.

The errors in computing roots of n-degree polynomials by the code roots may be unacceptable when

n \geq 7

.

Consider finally the problem of finding all roots of the equation

f (x) = 0

in the interval

T = [a, b]

, i.e., finding the set

f^{- 1} (0) \cap T

, where, e.g.,

f (x) = x sin (cos (x)) - 1

,

T = [2, 12]

. Here, the use of BFCTs seems necessary. A direct approach is to plot the graph of the function

x \mapsto | f (x) |

,

x \in T

. Let N be a sufficiently large integer. We may use the commands X = linspace(a,b,N); Y = abs(f(X)); plot(X,Y) to plot the graph of the function

x \mapsto | f (x) |

,

x \in T

. The zeros of f are now marked as inverted peaks at the roots

r_{1} < r_{2} < r_{3} < \dots

. In the latter case, we have

r_{1} ≃ 5

,

r_{2} ≃ 8

and

r_{3} ≃ 11

. Next, the codes vpasolve(f,r_k) or fzero(f,r_k) may be used around each of the points

r_{k}

to specify the roots with full precision. This approach is well known in the computational practice and is recommended in some MATLAB^® guides.

A difficult problem is the solution of nonlinear vector equations

f (x) = 0

and of matrix equations

F (X) = 0

, where x and

f (x)

are n-vectors, and X and

F (X)

are

n \times n

matrices. The MATLAB^® code fsolve is intended to solve nonlinear vector equations, while the codes are, axxbc, care, dare, dlyap, dric, lyap, ric, sylv and others solve linear and nonlinear matrix equations. The performance of all these codes may be checked by BFCTs and CPRSs. This a challenging problem which will be considered elsewhere.

7. Minimization of Functions

7.1. Functions of Scalar Argument

The minimization of the expression

y = f (x)

,

x \in T = [a, b]

, where

f : T \to R

is a continuous function, is accomplished in MATLAB^® by a variant of the method of the golden ratio. The corresponding code [x_0,y_0] = fminbnd(f,a,b) finds a local minimum in the interval T, where

y_{0} = f (x_{0})

. To find the global minimum

[x_{\min}, y_{\min}]

of f in the interval T, we may use the graph of the function f plotted by the commands x = linspace(a,b,n); y = f(x); plot(x,y). Then, the approximate global minimum

[c, f (c)]

of f is determined visually, considering also the pairs

[a, f (a)]

and

[b, f (b)]

at the end points of T. The code

[x_min,y_min] = fminbnd(f,c-h,c+h)

is then used in a neighborhood of the approximate minimum

[c, f (c)]

. This approach requires human interference in the process and may be avoided as shown later.

The problem with the code fminbnd is not that it may find local minimums. The real problem with this and similar codes is that they can miss the global minimum even when the function f is unimodal, i.e., when it has a unique minimum. BFCTs for such functions are a must. The problem is that usually, we do not know whether the function is unimodal or not and whether the computed minimum is minimum at all: the more so when performing complicated algorithms in RTA.

Example 15.

Consider the function

f (x) = 2 - x^{2} exp (1 - x^{2}), x \in [0, \infty)

This function is unimodal with minimum

[x_{\min}; y_{\min}] = [1; 1]

. In some cases, this minimum is found as [x_min,y_min] = fminbnd(f,0,b) for some

b > 0

. But in other cases, it is not found correctly, and here, we should use BFCTs. The code works well for

1 \leq b \leq 16.85

but gives a completely wrong result for

b \geq 16.86

without warning. Indeed, we have the good result

[x_min;y_min] = fminbnd(f,0,16.85) = 0.999992123678799

1.000000000124073

and the bad result

[x_min;y_min] = fminbnd(f,0,16.86) = 16.859937887512295

2

The second result has a 1586% relative error for

x_{\min}

and 100% relative error for

y_{\min}

.

The reason for the disturbing events in Example 15 is complicated. We may expect that

f (x) = 2

for values of x such that the expression

φ (x) = x^{2} exp (1 - x^{2})

is less than

ρ

since then

f (x) = 2 - φ (x)

is rounded to 2. The root

r_{0} > 0

of the equation

φ (x) = ρ

is

r_{0} = 6.38

, and we indeed have

φ (r_{0}) = 2

. But

r_{0}

is much less then 16.86, and hence the code fminbnd works with a certain sophisticated version of VPA instead of with FPA. In particular, the quantity

α = φ (16.86)

is much less than

ρ

with

α / ρ ≃ 2.46 \times 10^{- 105} ≪ 1

. We shall summarize the above considerations in the next statement.

Proposition 7.

The software for the minimization of functions of one variable such as fminbnd(f,a,b) from MATLAB^® may fail to produce correct results even for unimodular functions f.

In contrast, BFCTs can always be used to find the global minimum automatically. Let the function

f : T \to R

be unimodal. Consider the vectors X = linspace(a,b,n+1) with elements

X (k) = a + (k - 1) (b - a) / n

and

Y = f (X)

with elements

Y (k) = f (X (k))

. The MATLAB^® command [y,m] = min(Y) computes an approximation

[X (m), y]

to the minimum

[x_{\min}, y_{\min}]

. Finally, the (almost) exact minimum is computed between the neighboring elements of

X (m)

as [x_min,y_min] = fminbnd(f,X(m−1),X(m+1)). This approach may be used to develop improved versions of the code fminbnd.

7.2. Functions of Vector Argument

For

n \geq 2

, the minimization of continuous functions

f : R (n, 1) \to R

of vector argument meets difficulties similar to those described in Section 7.1 and other specific difficulties as well. Unconstraint minimization

f (x) \to \min

is accomplished in MATLAB^® by the Nelder–Mead algorithm [25,26] and realized by the code fminsearch [6]. We stress that the code finds a local minimizer

[x_{\min}, y_{\min}]

of the function f, where

y_{\min} = f (x_{\min})

.

The function f may not be differentiable, which is an advantage of this algorithm. The algorithm compares the function values

f (x)

at the vertexes of an

(n + 1)

-simplex and then replaces the vertex with the highest value of f by another vertex. The simplex usually (but not always, as shown below) contracts on a minimum of f which may be local or global. The corresponding command is [x_min,y_min] = fminsearch(f,x_0), where

x_{0}

is the initial guess. The computed result depends on the next five factors, which include the following:

The computational problem;
The computational algorithm;
The FPA;
The computer platform;
The starting point $x_{0}$ .

Factors 1–3 are usually out of the control of the user. Factor 4 is partially under control, e.g., the user may use another platform. Only factor 5, which is very important, is entirely under control. As mentioned above, the computed value

x_{\min}

depends on the starting point

x_{0}

. This is denoted as

x_{\min} = Ψ (x_{0})

.

Definition 10.

The point

x_{0}

is said to be a numerical fixed point (NFP) of the computational procedure if

x_{0} = Ψ (x_{0})

.

The computed result

x_{\min}

is the NFP of the computational procedure. The NFP depends not only on the computational problem and the computational algorithm but also on the FPA and on the particular computer platform where the algorithm is realized. Unfortunately, computed values for

x_{\min}

which are far from any actual minimum are often NFPs of the computational procedures, and they are served without warning to the user.

For a class of optimization problems, the function f has the form

f (x) = c + p (x) e (x), e (x) = exp (- a_{1} x_{1}^{2} - a_{2} x_{2}^{2} - \dots - a_{n} x_{n}^{2})

(8)

where

c > 0

,

p (x)

is a polynomial in x and

a_{k}

are positive constants. Since

f (x) \to c

, when

∥ x ∥ \to \infty

, the function (8) has (at least one) global minimum

[x_{\min}, y_{\min}]

.

Numerical experiments with functions f of type (8) show that not only the computed valus for

x_{\min}

and

y_{\min}

depend on the initial guess

x_{0}

but that in many cases, the code fminsearch produces a wrong solution for

x_{\min}

which is far from any minimizer of the function f. The reason is that the value of

p (x) e (x)

may become very small relative to c. Then,

f (x)

is computed as c, and the algorithm stops at a point which is far from any minimizer of f. Usually, no warning is issued for the user in such cases. This may be a problem when the minimization code is a part of an automatic computational procedure which is not under human supervision. In such cases, application of the techniques of AI and MAI may be useful.

Example 16.

Let

n = 2

and consider the function of type (8)

f (x) = 2 - x_{1} x_{2} exp (1 - \frac{1}{2} (x_{1}^{2} + x_{2}^{2}))

i.e.,

c = 2

,

p (x) = - exp (1) x_{1} x_{2}

and

a_{1} = a_{2} = 1 / 2

. The function f is coded as

f = @(x)2-x(1)∗x(2)∗exp(1-x(1)^2/2-x(2)^2/2)

The function

y = f (x)

,

x \in R \times R

, has two global minimizers

x_{\min, 1} = [1, 1]

and

x_{\min, 2} = [- 1, - 1]

for which

y_{\min, 1} = y_{\min, 2} = 1 .

It also has two global maximizers

x_{max, 1} = [1, - 1]

and

x_{max, 2} = [- 1, 1]

for which

y_{max, 1} = y_{max, 2} = 3 .

Let the initial guess for the code [x_min,y_min] = fminsearch(f,x_0) be

x_{0} = [c, c]

. For

c = 6.4308

, we obtain the relatively good result

x_min = −1.000041440968428 −0.999998218805417

y_min = 1.000000001720503

For the slightly different value

c = 6.4309

, the computed result

x_{\min} = [c, c]

is a numerical fixed point which is not a minimizer. Indeed, we have

x_min = 6.430900000000000 6.430900000000000

y_min = 2.000000000000000

The second computed result in Example 16 is a numerical catastrophe with 543% relative error in

x_{\min}

and 100% relative error in

y_{\min}

. The reason is that

c^{2} exp (1 - c^{2}) \leq ρ

for

c \geq 6.4391

and

f (c)

is rounded to 2. The conclusion is that the algorithm of the moving simplex uses FPA rather than VPA (compare with Example 15).

The minimization problem in Example 16 is not unimodal, since it has two solutions. Solving unimodal problems by the code fminsearch may also lead to large errors, as shown in the next example.

Example 17.

Let

n = 2

and consider the function of type (8)

f (x) = 2 - x_{1} exp (\frac{1}{2} - \frac{1}{2} x_{1}^{2} - {(x_{2} - 1)}^{2}))

The function f is coded as

f = @(x)2−x(1)∗exp(1/2−x(1)^2/2−(x(2)−1)^2)

The minimization problem has unique solution

x_{\min} = [1, 1]

,

y_{\min} = 1

. For

c = 5.81

, the code

[x_min,y_min] = fminsearch(f,[c,c])|

gives a quite acceptable result, namely

x_min = 1.000039724696251 1.000017661431510

y_min = 1.000000001889957

For a slightly different value of c, the computed result

x_{\min} = [c, c]

is a numerical fixed point which is not a minimizer. Indeed, for

c = 5.82

, the computed result is wrong, namely

x_min = 5.820000000000000 5.820000000000000

y_min = 2

There is a 482% relative error in the computed argument

x_{\min}

and 100% relative error in the computed minimum

y_{\min}

.

The output of such situations is, at least for small n, to use BFCTs as follows. Let

n = 2

and let

f : T = T_{1} \times T_{2} \to R

be the function which has to be minimized, where

T_{k} = [a_{k}, b_{k}] \subset R \times R

. We choose positive integers

n_{1}, n_{2}

and compute the grids

X_k = linspace(a_k,b_k,n_k)

and the quantities

A (i, j) = f (X_{1} (i), X_{2} (j)) .

Thus, we construct a matrix

F \in R (n_{1}, n_{2})

which is a discrete analogue of the surface

z = f (x, y)

,

(x, y) \in T

.

Next, the minimal element

F (k_{1}, k_{2})

and its indexes

(k_{1}, k_{2})

are found by the BFCTs described in Section 4.2. Now, the approximate global minimum of f is

[X_{\min}, Y_{\min}]

, where

X_{\min} \in R (2, 1)

may be used as a starting point

x_{0}

for the code fminsearch(f,x_0). Extensive numerical experiments show that the code now works properly.

The simplex method [25], utilized in fminsearch, lacks gradient use and may not ensure convergence to a local minimum as Example 17 demonstrates. To enhance robustness, alternative optimization techniques such as Differential Evolution, Simulated Annealing, and Trust-Region Methods may be considered for their potential to improve convergence and scalability. These techniques are promising due to their ability to navigate complex search spaces and avoid local and/or false minima, making them particularly beneficial in optimization problems with nonlinear constraints or multiple variables. To develop reliable codes for multivariate static optimization is still an actual task in modern computer methods. These problems shall be addressed in more detail elsewhere in connection with the use of BFCTs.

8. Canonical Forms of Matrices

8.1. Preliminaries

An interesting observation is that important properties of linear operators in n-dimensional vector spaces can be revealed for dimensions as low as

n = 2

; see [27]. In this section, we consider some modified definitions of canonical forms of matrices and give instructive examples for the sensitivity of

2 \times 2

Jordan forms using BFCTs.

8.2. Jordan and Generalized Jordan Forms

In this section, we use some not very popular definitions of canonical forms of matrices, e.g., a definition of a Jordan form of a matrix, without using eigenvalues, eigenvectors and generalized eigenvectors.

Definition 11.

Let

n \in Z_{2}

. The upper triangular bi-diagonal matrix

J \in C (n)

is said to be a Jordan matrix if

1.: $J (k, k + 1) \in {0, 1}$
2.: $J (k, k + 1) = 1$ implies $J (k, k) = J (k + 1, k + 1)$
3.: $J (k, k) \neq J (k + 1, k + 1)$ implies $J (k, k + 1) = 0$

for

k = 1, 2, \dots, n - 1

. The set of Jordan matrices is denoted as

J (n)

.

Example 18.

The set

J (2)

consists of matrices

diag (λ, μ)

and

λ I_{2} + E_{1, 2}

, where

λ, μ \in C

.

Definition 12.

The Jordan matrix J is spectrally ordered if

J (k, k) ⪯ J (k + 1, k + 1)

for

k = 1, 2, \dots, n - 1

. The set of spectrally ordered Jordan matrices is denoted as

\tilde{J} (n)

.

Definition 13.

The Jordan problem for a nonzero matrix

A \in C (n)

is to find a pair

(X_{A}, J_{A}) \in G L (n) \times J (n)

such that

X_{A} J_{A} = A X_{A}

. Here,

X_{A}

is the transformation matrix and

J_{A}

is the Jordan form of A.

We do not use the spectrum of A or the concept of Jordan blocks. Indeed, given a general matrix A (even with integer elements and of very low size as in (6)), then a priori, we know neither its spectrum nor its Jordan structure. Note that the Jordan form

J_{A}

is not unique and is defined by the stacking of 1s on its super-diagonal unless

A = λ I_{n}

when

J_{A} = A

is unique and

X_{A}

is any invertible matrix. The transformation matrix

X_{A}

is always non-unique, since the matrix

γ X_{A}

,

0 \neq γ \in C

, is also a transformation matrix. The Jordan form may be defined uniquely to become a canonical Jordan form in terms of group theory [28] and integer partition theory [29], although this is rarely seen in the literature.

Example 19.

If the matrix

A \in C (n)

has a single eigenvalue, then there are

2^{n - 1}

orderings of the 1s on the

n - 1

positions

(k, k + 1)

of the super-diagonal of A.

Definition 14.

Let

J \in J (n)

. The matrix

X \in G L (n)

is said to be a stabilizer of J if

X^{- 1} J X \in J (n)

. The set of stabilizers of A is denoted as

Stb (A) \subset G L (n)

. The matrix

I_{n}

is called the center of

Stb (A)

.

Note that

Stb (A)

is not necessarily a group unless

A = λ I_{n}

, and

{I_{n}}

is not a center in the group-theoretical sense.

Example 20.

Let

n = 2

and

A \in J (2)

. Then, the following three cases are possible.

1.: If $A = λ I_{2}$ , then $Stb (A)$ is the group $G L (2)$ .
2.: If $A = diag (λ_{1}, λ_{2})$ , $λ_{1} \neq λ_{2}$ , then $Stb (A)$ is the set of matrices $diag (a_{1}, a_{2})$ and $a_{1} E_{1, 2} + a_{2} E_{2, 1}$ , where $a_{1} a_{2} \neq 0$ .
3.: If $A = λ I_{2} + E_{1, 2}$ , then $Stb (A)$ is the set of matrices $a (E_{1, 2} + E_{2, 1})$ , $a \neq 0$ .

Definition 15.

The upper triangular bi-diagonal matrix

G \in C (n)

is said to be a generalized Jordan matrix if

1.: $G (k, k + 1) \neq 0$ implies $G (k, k) = G (k + 1, k + 1)$ ;
2.: $G (k, k) \neq G (k + 1, k + 1)$ implies $G (k, k + 1) = 0$ .

The set of generalized Jordan matrices is denoted as

G (n)

.

Generalized Jordan matrices G have the structure of Jordan matrices, where the nonzero elements

G (k, k + 1)

(if any) are not necessarily equal to 1. Hence, these matrices are less sensitive to perturbations in the matrix A.

Example 21.

The elements of

G (2)

are

diag (λ, μ)

and

λ I_{2} + a E_{1, 2}

, where

λ, μ, a \in C

and

a \neq 0

.

Definition 16.

The generalized Jordan problem for a nonzero matrix

A \in C (n)

is to find a pair

(Y_{A}, G_{A}) \in G L (n) \times G (n)

such that

Y_{A} G_{A} = A Y_{A}

. Here,

Y_{A}

is the transformation matrix and

G_{A}

is the generalized Jordan form of A.

Consider now perturbations of Jordan forms. Let

A (ε) = A_{0} + ε A_{1}

, where

A_{0}, A_{1} \in C (n)

,

ε \in (- ε_{0}, ε_{0})

and

ε_{0} > 0

is a small parameter. If the matrix

A_{0}

has multiple eigenvalues, then the transformation matrix

X (ε) = X_{A (ε)}

and the Jordan form

J (ε) = J_{A (ε)}

of

A (ε)

may be discontinuous at the point

ε = 0

. If the matrix

A_{0}

has simple eigenvalues, then the matrices

X (ε)

and

J (ε)

depend continuously on

ε

. To illustrate these facts by BFCTs, consider matrices

A \in C (2)

. For such matrices A with rational elements, the matrices

X_{A}

and

J_{A}

are computed by the MATLAB^® command [X_A,J_A] = jordan(A) exactly.

Example 22.

Let

A_{0} = I_{2}

,

A_{1} = E_{2, 1}

and

ε = 2^{- m}, m \in Z_{1}

. Then,

X (ε) = E_{2, 1} + E_{1, 2} / ε

and

J (ε) = I_{2} + E_{1, 2}

. The command [X,J] = jordan(A) computes the matrices

X (ε)

,

J (ε)

exactly for

m \leq 1074

. For

m \geq 1075

, the matrices

X_{A}

and

J_{A}

are computed as

I_{2}

, since ε is rounded to zero.

Example 23.

Consider the matrix

A = I_{2} + E_{1, 2}

and let

A (ε) = I_{2} + ε E_{1, 1}, ε = 2^{- m}, m \in Z_{1}

. Then, the Jordan forms of

A (ε)

seem to be

J_{1} (e) = diag (1 + ε, 1)

and

J_{2} (ε) = diag (1, 1 + ε) .

The form

J_{1} (ε)

is unachievable. For the form

J_{2} (ε)

, we have infinitely many transformation matrices

X_{2} (ε)

, e.g.,

X_{2} (ε) = a E_{1, 2} - E_{2, 1} - ε^{- 1} E_{1, 1}

, where

0 \neq a \in C

is arbitrary. The code [X,J] = jordan(A) computes exactly the matrices

X_{2} (ε)

(with

a = 1

) and

J_{2} (ε)

for

ε = 2^{- m}

and

m \leq 44

. For

m = 45

, these matrices are computed wrongly as

X_{2} (ε) = I_{2}

and

J_{2} (ε) = I_{2}

.

Proposition 8.

A Jordan form J of a matrix A with multiple eigenvalues may be sensitive to perturbations in A due to three main restrictions:

1.: Because $J (i, j) = 0$ for $i \geq j + 1$ ;
2.: Because $J (i, j) = 0$ for $j \geq i + 2$ (if $n \geq 3$ );
3.: Because the nonzero elements $J (i, i + 1)$ (if any) are equal to 1.

A generalized Jordan form of A is usually less sensitive to perturbations in A, since restriction 3 is absent. A Schur form is even less sensitive, since both restrictions 2 and 3 are absent. A generalized Schur form is least sensitive since, in addition, any of its diagonal blocks is either upper triangular or lower triangular [30]. All these observations are confirmed in practice by BFCTs.

8.3. Condensed Schur Forms

Schur forms

S_{A}

of matrices A satisfy only the condition

S_{A} (i, j) = 0

for

i \geq j + 1

and are thus less sensitive to perturbations in A. Nevertheless, they may be discontinuous in a neighborhood of a matrix with multiple eigenvalues. The case of simple eigenvalues is studied in more detail; see, e.g., [30].

Definition 17.

The matrix

S \in C (n)

is said to be a Schur matrix if it is upper triangular. The set of Schur matrices is denoted as

S (n)

.

We recall that we study only cases

n \geq 2

. Now, it is easy to see that

S (n) = G (n)

if and only if

n = 2

.

Each matrix

A \in C

is unitarily similar to a Schur matrix

S_{A} \in S (n)

such that

U_{A} S_{A} = A U_{A}^{H}

for some matrix

U_{A} \in U (n)

. The matrix

U_{A}

is the transformation matrix and the matrix

S_{A} \in U (n)

is the condensed Schur form of A. The Schur problem for A is to describe the set of pairs

(U_{A}, S_{A})

and to study their sensitivity relative to perturbations in A.

Although less sensitive to perturbations in A, the solution

(U_{A}, S_{A})

of the Schur problem may also be sensitive and even discontinuous as a function of A at points

A_{0}

, where the matrix

A_{0}

has multiple eigenvalues.

Let

A = λ I_{2}

and

ε E

be a perturbation in A. The Schur form of A is A itself, and we may consider the trivial (or central) solution

(I_{2}, A)

of the Schur problem for A. For arbitrary small

ε \neq 0

, the Schur forms of

A + ε E_{2, 1}

are

S = I_{2} \pm ε E_{1, 2}

with transformation matrices

V = \pm E_{1, 2} \pm E_{2, 1}

. Thus, the transformation matrix jumped from

U = I_{2}

to

U = V

for

ε \neq 0

.

Example 24.

Let

m = 50

,

a = 2^{- m}

and

A = [1, 0; a, 1]

. The command [X,J] = schur(A) gives the correct answer

X = 0 −1 J = 1.000000000000000 −0.000000000000001

1 0 0 1.000000000000000

For

m = 51

, we obtain the wrong answer

X = J = I_{2}

due to rounding errors. Obviously, the code schur works with FPA in contrast to jordan, which uses VPA.

9. Least Squares Revisited

9.1. Preliminaries

In least squares methods (LSMs), the aim is to minimize the sum of squared residuals. A not very popular fact is that smaller residuals may correspond to larger errors [31,32]. Moreover, this phenomenon may be observed on an infinite set of residuals and errors. This may have far-reaching consequences, so the LSM has to be revisited, taking into account such cases. Note that a least squares problem (LSP) may arise naturally in connection with a given approximation scheme or as an equivalent formulation of a zero finding method.

9.2. Nonlinear Problems

The solution of the (overdetermined) equation

f (x) = 0

, where x and

f (x)

are vectors and f is a continuous function, may be reduced to minimization of the quantity

{∥ f (x) ∥}^{2}

. Such minimization problems arise also in a genuinely optimization statement in, e.g., approximation of data.

Let

φ : C (n, 1) \to R_{+}

be a continuous function and let the problem

φ (x) \to \min

be unimodal, i.e., there is a unique (global) minimum

[x_{0}; y_{0}] \in C (n, 1) \times R_{+}

, where

y_{0} = φ (x_{0})

. Denote by

e_{k} = ∥ x_{k} - x_{0} ∥

the distance between the vectors

x_{k}

and

x_{0}

or the absolute error of the vector

x_{k}

when

x_{k}

is interpreted as the computed solution.

Let

n = 1

and

φ : R \to R_{+}

and let

(x_{k})

be a sequence of computed solutions. Then, it is possible that

r_{k} = φ (x_{k})

tends to 0 and

e_{k}

tends to ∞ when

k \to \infty

, as the next example shows.

Example 25.

Let

φ (x) = 1 + \frac{{(x - 1)}^{2}}{1 + {(x - 1)}^{4}}, x \in R

The function φ is unimodal with minimum

[x_{0}, y_{0}] = [1, 1]

. Let

x_{k} = k

. Then,

r_{k} = φ (x_{k}) \to 0

and

e_{k} = | k - 1 | \to \infty

as

k \to \infty

.

This result is easily extended to the case

n \geq 2

.

Example 26.

Let

φ (x) = 1 + \frac{{(∥ x ∥ - 1)}^{2}}{1 + {(∥ x ∥ - 1)}^{4}}, x \in R (n, 1)

The function φ has infinitely many minimums

[x_{0}, 1]

on the unit sphere

∥ x_{0} ∥ = 1

. Let

x_{k} = k [1; 0; \dots; 0]

for

k \in Z_{1}

, i.e.,

∥ x_{k} ∥ = k

. Then,

r_{k} \to 0

and

e_{k} \to \infty

as

k \to \infty

.

Thus, the residual

r_{k}

and the error

e_{k}

may have opposite behavior. This phenomenon, namely larger the remainder

r_{k}

, the smaller the error

e_{k}

, is known as Remainders vs. Errors; see [32]. It may concern the stopping rule in the implementation of LSM.

9.3. Linear Problems

Consider the matrix

A \in R (m, n)

of full column rank

n < m

and let

b \in R (m, 1)

be a given vector which is not in the range of A. The linear least squares problem (LLSP) is to find the vector

x \in R (n, 1)

such that

r (x) = ∥ A x - b ∥ \to \min

. This problem is unimodal and its theoretical solution is

A^{†} b

, where

A^{†} = {(A^{⊤} A)}^{- 1} A^{⊤} \in R (n)

is the pseudo-inverse of the matrix A. This representation of the solution is not used in computational practice, since it is ineffective and may be connected with a substantial loss of accuracy. Instead, the solution is found by, e.g., QR decomposition

A = Q R

of the matrix A. Let

R = [S; T]

and

Q^{⊤} b = [c; d]

, where the matrix

S \in R (n)

is upper triangular and invertible, and

c \in R (n, 1)

. The solution

x^{*}

is then obtained by the code x = A\b, which in fact computes x = S\c.

Intensive experiments based on BFCTs with random data show that the error of the “bad” solution

A^{†} b

is between 5 and 10 times larger than the error of the “good” solution

x^{*} = A ∖ b

. In addition, finding

x_{0}

requires many more computations.

Let

x_{1}, x_{2} \in R (n, 1)

be two approximate solutions of the above LLSP which are different from the exact solution

x_{0}

. Set

r_{k} = ∥ A x_{k} - b ∥

and

e_{k} = ∥ x_{k} - x_{0} ∥

,

k = 1, 2

. We have

σ_{n} \leq r_{k} / e_{k} \leq σ_{1}

, where

σ_{1} = ∥ A ∥

and

σ_{n} = 1 / {∥ A^{- 1} ∥}^{- 1}

are the maximal and minimal singular values of A. Next, it is possible [32] to choose

x_{k}

so that

r_{1} / e_{1} = σ_{1}

and

r_{2} / s_{2} = σ_{n}

. It is fulfilled

\frac{r_{1}}{r_{2}} = cond (A) \frac{e_{1}}{e_{2}}, cond (A) = \frac{σ_{1}}{σ_{n}} .

(9)

when the matrix A is poorly conditioned, i.e.,

cond (A) ≫ 1

, the behavior of remainders and errors is opposite: the larger the remainder, the smaller the error. This fact deserves a separate formulation.

Proposition 9.

When

cond (A)

is large, larger remainders may correspond to smaller errors. In this case, checking the accuracy of the computed solutions by the remainders is misleading.

The situation with remainders and errors is ironic. If the matrix A is well conditioned with

cond (A)

close to 1, the computed approximations are most probably good, and there is no need to check their accuracy by remainders. But if the matrix A is poorly conditioned and the accuracy test seems necessary, the accuracy check by remainders may lead to wrong conclusions.

Example 27.

Consider the simplest case

n = 2

and

A = [B; 0] \in R (m, 2)

,

b = 0

, where

B = [μ^{3}, - 1; μ^{3}, 1] \in R (2)

and

μ > 0

is a small parameter. The only solution of the LSP is

x_{0} = 0

. Let

x_{1} = [0; μ]

and

x_{2} = [1 / μ; 0]

be two approximate solutions (the vector

x_{2}

with error

e_{2} = 1 / μ ≫ 1

can hardly be recognized as an approximate solution, but we shall ignore this fact). We have

r_{1} = μ \sqrt{2}

,

e_{1} = μ

,

r_{2} = μ^{2} \sqrt{2}

,

e_{2} = 1 / μ

and

\frac{r_{1}}{r_{2}} = \frac{1}{μ} ≫ 1, \frac{e_{1}}{e_{2}} = μ^{2} ≪ 1

This is possible, since the matrix B is very badly conditioned with

cond (B) = μ^{- 3} ≫ 1

and the estimate (9) is achieved.

Denote

T = (0, 1) \subset R_{+}

and let

{x (t) : t \in T} \subset R (n, 1)

be a family of approximate solutions of the equation

A x = b

, where the function

x : T \to R (n, 1)

is differentiable. Let

x (t)

tends to

x_{0}

as

t \to \infty

and denote by

e (t) = ∥ x (t) - x_{0} ∥

and

r (t) = ∥ A x (t) - b ∥

the error and the remainder of the approximate solution

x (t)

. These relations define the remainder r as a parametric function of e and vice versa.

Proposition 10.

There exists a smooth function

x : T \to R (n, 1)

such that the following assertions hold true.

1.: The function $e : T \to R_{+}$ is smooth and decreasing.
2.: The function $r : T \to R_{+}$ is smooth and non-monotone in each interval $(0, β)$ , where $0 < β < 1$ .
3.: The function $e \mapsto r (e)$ (whenever defined) is smooth and non-monotonic in each arbitrarily small subinterval $(0, θ)$ of the interval $(0, ε)$ where $ε = sup {e (t) : t \in T}$ .

Example 28.

Let

n = 2

,

A = [\begin{matrix} a & 0 \\ 0 & 1 \end{matrix}], x (t) = [\begin{matrix} ξ_{1} (t) \\ ξ_{2} (t) \end{matrix}], b = [\begin{matrix} 0 \\ 0 \end{matrix}]

and

a \notin {- 1, 0, 1}

. Then,

x_{0} = 0

and

r (t) = \sqrt{a^{2} ξ_{1}^{2} (t) + ξ_{2}^{2} (t)}, e (t) = \sqrt{ξ_{1}^{2} (t) + ξ_{2}^{2} (t)}

Let

t \in T

and

ξ_{1} (t) = t^{3} cos (1 / t)

,

ξ_{2} (t) = t^{3} sin (1 / t)

. We have

e (t) = t^{3}

and

r (t) = t^{3} \sqrt{1 + (a^{2} - 1) {cos}^{2} (1 / t)}

Therefore,

r (e) = e \sqrt{1 + (a^{2} - 1) {cos}^{2} (e^{- 1 / 3})}

The differentiable function

e \mapsto r (e)

is not monotone at each interval

(0, θ)

This non-monotonicity means that for any

θ > 0

, there exist infinitely many points

e_{1}, e_{2} \in (0, θ)

with

e_{1} < e_{2}

and

r (e_{1}) > r (e_{2})

. Thus, the LSM may be misleading not only for linear algebraic equations but also for LLSPs for which the method had been specially designed.

10. Integrals and Derivatives

10.1. Integrals

The calculation of the definite integral

I = int (f, a, b)

is one of the oldest computational problems in numerical mathematical analysis. According to [33], a variant of the trapezoid rule was used in Babylon 50 years BCE for integrating the velocity of Jupiter along the ecliptic.

Usually, the integral is solved for

a, b \in R

,

a < b

, but the case when one or both of the limits a, b are infinite and the integral is improper is also considered. There are many sophisticated algorithms and computer codes for solving such integrals; see [1]. In MATLAB^®, definite integrals of functions of one variable

f : [a, b] \to R

are computed by the command integral(f,a,b) or by a variant of the symbolic command int.

In this section, we apply BFCTs to solving integrals with highly oscillating integrands and with power integrands by the standard trapezoidal rule [34]. Let

a = x_{1} < x_{2} < \dots < x_{n} = b

be a partition of

[a, b]

. Sophisticated integration schemes for computing I using values

y_{k} = f (x_{k})

had been developed in the pre-computer ages (before 1950) when the computation of

y_{k}

for n of order

10^{3}

was a problem. Now, BFCTs allow to obtain results for large n up to

10^{8}

, using simple quadratures such as the formula of trapezoids with equal spacing [35] described below as the MATLAB^® code

n = 10^m; X = linspace(a,b,n+1); Y = f(X); h = (b−a)/n; …

T_n = h∗(sum(Y(2:n)) + (Y(1) + Y(n+1))/2)

If the function f is twice continuously differentiable, then the absolute error of the trapezoidal method with equal spacing is estimated as

| T_{n} - I | \leq E_{n} (f, a, b) = \frac{γ (f, a, b)}{n^{2}}

(10)

where

γ (f, a, b) = \frac{{(b - a)}^{3}}{12} μ_{2} (f), μ_{2} (f) = max {| f^{''} (x) | : x \in [a, b]}

Example 29.

Consider the integral

I = int (f, a, b) = F (b) - F (a)

, where

f = F^{'}

,

a > 0

and

F (x) = c x + x^{2} sin (x^{- 1}), x \in [a, b]

Then,

f (x) = c + 2 x sin (x^{- 1}) - cos (x^{- 1}), f^{''} (x) = x^{- 4} cos (x^{- 1})

For

a = \frac{1}{m π}, b = \frac{1}{π}, c = \frac{m π}{m - 1}, m \in Z_{2}

we have

sin (a^{- 1}) = 0, cos (a^{- 1}) = - 1, F (a) = a c, sin (b^{- 1}) = 0, F (b) = b c

and

I = c (b - a) = 1

.

For

m = 11

, the number

a ≃ 0.0289

is relatively small, and the function f is highly oscillating close to a; see Figure 2. Next, we have

μ_{2} (f) = | f^{″} (a) | = {(11 π)}^{4}, γ (f, a, b) = 2750 π / 3 ≃ 2880

and

E_{n} (f, a, b) = 2880 n^{- 2}

.

The error

| T_{n} - 1 |

and the ratio

r_{n} (f, a, b) = | T_{n} - 1 | / E_{n} (f, a, b)

for

n = 10^{m}

and m up to 8 are shown in Table 4.

For n from 1000 to 1,000,000, the error (which is both absolute and relative since

I = 1

) decreases as

0.4 n^{- 2}

, while the ratio of the error and the estimate

E_{n} (f, a, b)

is approximately constant about

1.5 \times 10^{- 4}

. The behavior of the error

| T_{n} - 1 |

is correctly evaluated, but the bound

μ_{2} (f) = a^{- 4}

for

| f^{″} (x) |

considerably overestimates the real behavior of

| f^{″} (x) |

, although this bound is achieved for

x = a

and hence is not improvable.

For n between 1,500,000 and 100,000,000, the accuracy of the computed solution does not increase with n and shows chaotic behavior due to the accumulation of rounding errors. The result

T_{n}

for

n = 1000

and even for

n = 100

is satisfactory from a practical viewpoint.

For f and

a, b

as in Example 29, the numerical MATLAB^® code integral(f,a,b) produces the answer with error

7.1720 \times 10^{- 14}

. At the same time, the NLF

F (b) - F (a)

gives result

1 - ρ

with error

ρ ≃ 1.1011 \times 10^{- 16}

.

The next example shows that a sophisticated quadrature may give completely wrong results for a polynomial integrand, while a BFCT approach still gives satisfactory results.

Example 30.

Let

f_{N} (x) = (N + 1) x^{N}

, where

N \in Z_{1}

. Consider the integral

I_{N} = int (f_{N}, 0, 1) = 1

. For

N \leq N_{0} = 51, 221, 433

, the MATLAB^® code integral(f_N,0,1) works well with errors of order

10^{- 9}

. The BFCT result

T_{N}

is also acceptable, although it is slightly worse. However, for

N = N_{0} + 1

, the code integral(f_N,0,1) gives a completely wrong result

≃ 10^{- 10}

with 100% error, while the BFCT formula

T_{N}

still works relatively well.

Of course, in cases as in Example 30 with known primitives, we may use the MATLAB^® symbolic code int based on NLF, e.g., syms x; I_N = int((N+1)∗x^N,0,1). This gives the exact answer

I_{N} = 1

for large

N \leq X_{max}

.

This leads to the following proposition.

Proposition 11.

To check the performance of sophisticated quadrature formulas by BFCT such as the trapezoidal rule with a large number of knots is a good numerical strategy.

10.2. Derivatives

The numerical differentiation of a scalar function f of scalar or vector argument x is used in many cases, i.e., when analytical expression

f (x)

is not available or is too complicated, when numerical algorithms are applied that use approximate derivatives f, etc. In the simplest yet most spread case, the derivative

f^{'} (x)

is approximated numerically by the first-order finite difference

Δ_{h} (f, x) = (f (x + h) - f (x)) / h

, where

| h |

is small but not too small. The behavior of

Δ_{h} (f, x)

as a function of h resembles a saw; see Example 31.

Example 31.

Consider the first difference

Δ_{h} (x, 1)

for the simplest non-constant function

f (x) = x

at the point

x = 1

. We have

Δ_{h} (x, 1) = (1 + h - 1) h^{- 1}

. The graph of the computed function

h \mapsto (1 + h - 1) h^{- 1}

for

h \in [- 3 eps, 3 eps] ∖ {0}

is shown in Figure 3 in blue. The graph of the exact function

h \mapsto 1

is the horizontal red line.

Let

f^{'} (x) \neq 0

and

x \neq 0

. When

h / x

is too small, e.g.,

h / x \in [- ρ / 2, ρ]

, the computed value of

f (x + h)

is

f (x)

, hence

Δ_{h} (f, x) = 0

and the derivative

f^{'} (x)

is computed as 0 with 100% relative error. This numerical catastrophe may be avoided by BFCTs revealing the real behavior of

Δ_{h} (f, x)

for small values of h. The general principle here is that h must be chosen as

C {(eps)}^{1 / 2} ≃ C \times 10^{- 8}

, and the problem is how to estimate the constant C especially in RTA. Note that when no information for a constant

C > 0

is available, the heuristic rule is to set

C = 1

.

11. VPA Versus FPA

There are many types of variable precision arithmetics (VPAs) and computer languages in which these VPAs are realized. Here, we include systems using symbolic objects like

π

,

\sqrt{2}

, etc. VPAs may work with very high precision (actually, with infinite precision), but the performance of numerical algorithms with VPAs may be slow and not suitable for RTA. In contrast, binary double-precision floating point arithmetic (FPA) works with a finite set of machine numbers, has finite precision and works relatively fast. Thus, FPA is suitable for RTA. Algorithms combining FPA and VPA based on MAI are also used recently. Here, BFCTs may be very useful as well.

12. Conclusions

Our main results are summarized in Propositions 1–6, 11 and 20. Most of the results presented are valid for using BFCTs in FPA. Using extended precision as in VPA, some of the observed effects may no longer occur. This may be accomplished, although it increases the computation time, potentially limiting RTA.

The combination of BFCTs and CPRS provides a powerful tool for solving problems and checking computed solutions and a priori accuracy estimates for many computational problems. According to the authors, BFCTs may become a useful complement in the implementation of techniques of artificial intelligence to the systems for calculating mathematics such as MATLAB^® [6,7] and Maple [9].

Author Contributions

Validation, P.H.P.; Formal analysis, E.B.M.; Investigation, M.M.K. The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during the current study are available from the authors upon reasonable request.

Acknowledgments

The authors are thankful to anonymous reviewers for their very helpful comments and suggestions.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

AI	artificial intelligence
ARRE	average rounding relative error
BFCT	brute force computational technique
CPRS	computational problem with reference solution
CPU	central processing unit
FPA	double precision binary floating-point arithmetic
LLSP	linear least squares problem
LSM	least squares method
LSP	least squares problem
MAI	mathematical artificial intelligence
MRRE	maximal rounding relative error
NLF	Newton–Leibniz formula
NFP	numerical fixed point
RRE	rounding relative error
RTA	real-time application
VPA	variable precision arithmetic

References

Stoer, J.; Bulirsch, R. Introduction to Numerical Analysis; Springer: New York, NY, USA, 2002; ISBN 978-0387954523. [Google Scholar] [CrossRef]
Faires, D.; Burden, A. Numerical Analysis; Cengage Learning: Boston, MA, USA, 2016; ISBN 978-1305253667. [Google Scholar]
Chaptra, S. Applied Numerical Methods with MATLAB for Engineers and Scientists, 5th ed.; McGraw Hill: New York, NY, USA, 2017; ISBN 978-12644162604. [Google Scholar]
Novak, K. Numerical Methods for Scientific Computing; Equal Share Press: Arlington, VA, USA, 2022; ISBN 978-8985421804. [Google Scholar]
Driscoll, T.; Braun, R. Fundamentals of Numerical Computations; SIAM: Philadelphia, PA, USA, 2022. [Google Scholar] [CrossRef]
TheMathWorks, Inc. MATLAB Version 9.9.0.1538559 (R2020b); The MathWorks, Inc.: Natick, MA, USA, 2020; Available online: www.mathworks.com (accessed on 1 December 2024).
Higham, D.; Higham, N. MATLAB Guide, 3rd ed.; SIAM: Philadelphia, PA, USA, 2017; ISBN 978-1611974652. [Google Scholar]
Eaton, J.; Bateman, D.; Hauberg, S.; Wehbring, R. GNU Octave Version 5.2.0 Manual: A High-Level Interactive Language for Numerical Computations. 2019. Available online: www.gnu.org/software/octave/doc/v5.2.0/ (accessed on 5 February 2025).
Maplesoft. Maple 2017.3, Ontario. 2017. Available online: www.maplesoft.com (accessed on 1 December 2024).
Mathematica. Available online: www.wolfram.com/mathematica/ (accessed on 1 December 2024).
WolframAlpha. Available online: www.wolframalpha.com/ (accessed on 1 December 2024).
Grama, A.; Gupta, A.; Karypis, G.; Kumar, V. Introduction to Parallel Computing, 2nd ed.; Addison-Wesley: Boston, MA, USA, 2003; ISBN 0201648652. [Google Scholar]
Kirk, D.; Hwu, W. Programming Massively Parallel Processors: A Hands-on Approach, 3rd ed.; Morgan Kaufmann: San Francisco, CA, USA, 2016; ISBN 978-0128119860. [Google Scholar]
Kavis, M. Architecting the Cloud: Design Decisions for Cloud Computing Service Models (SaaS, PaaS, and IaaS); Wiley: Hoboken, NJ, USA, 2014; ISBN 978-1118617618. [Google Scholar]
Borbu, A.; Zhu, S. Monte Carlo Methods; Springer: Singapore, 2020; ISBN 978-9811329708. [Google Scholar] [CrossRef]
Davies, A.; Veličković, P.; Buesing, L.; Blackwell, S.; Zheng, D.; Tomašev, N.; Tanburn, R.; Battaglia, P.; Blundell, C.; Juhász, A.; et al. Advancing mathematics by guiding human intuition with AI. Nature 2021, 600, 70–74. [Google Scholar] [CrossRef]
Fink, T. Why mathematics is set to be revolutionized by AI. Nature 2024, 629, 505. [Google Scholar] [CrossRef]
754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society: Piscataway, NJ, USA, 2019. [CrossRef]
Higham, N. Accuracy and Stability of Numerical Algorithms; SIAM: Philadelphia, PA, USA, 2002; ISBN 0-898715210. [Google Scholar]
Higham, N.; Konstantinov, M.; Mehrmann, V.; Petkov, P. The sensitivity of computational control problems. IEEE Control. Syst. Mag. 2004, 24, 28–43. [Google Scholar] [CrossRef]
Goldberg, D. What every computer scientist should know about floating-point arithmetic. Acm Comput. Surv. 1991, 23, 5–48. [Google Scholar] [CrossRef]
Pugh, C. Real Mathematical Analysis; Springer Nature: Cham, Switzerland, 2015; ISBN 978-3319177700. [Google Scholar] [CrossRef]
Konstantinov, M.; Petkov, P. Computational errors. Int. J. Appl. Math. 2022, 35, 181–203. [Google Scholar] [CrossRef]
Balinski, M.; Young, H. Fair Representation: Meeting the Ideal One Man One Vote, 2nd ed.; Brookings Institutional Press: New Haven, CT, USA; London, UK, 2001; ISBN 978-0815701118. [Google Scholar] [CrossRef]
Lagarias, J.; Reeds, J.; Wright, M.; Wright, P. Convergence properties of the Nelder-Mead simplex method in low dimensions. Siam J. Optim. 1998, 9, 112–147. [Google Scholar] [CrossRef]
Nelder, J.; Mead, R. A simplex method for function minimization. Comput. J. 1965, 7, 308–313. [Google Scholar] [CrossRef]
Glazman, M.; Ljubich, J. Finite-Dimensional Linear Analysis: A Systematic Presentation in Problem Form; Dover Books in Mathematics; Dover Publications: New York, NY, USA, 2006; ISBN 978-04866453323. [Google Scholar]
Jackobson, N. Basic Algebra I; W. Freeman & Co Ltd.: London, UK, 1986; ISBN 978-07167114804. [Google Scholar]
Andrews, G. The Theory of Partitions; Revisited Edition; Cambridge University Press: Cambridge, UK, 2008; ISBN 978-0521637664. [Google Scholar]
Konstantinov, M.; Petkov, P. On Schur forms for matrices with simple eigenvalues. Axioms 2024, 13, 839. [Google Scholar] [CrossRef]
Booth, A. Numerical Methods, 3rd ed.; Butterworths Scientific Publications: London, UK, 1966; LCCN 57004892. [Google Scholar]
Konstantinov, M.; Petkov, P. Remainders vs. errors. AIP Conf. Proc. 2016, 1789, 060007. [Google Scholar] [CrossRef]
Ossendrijver, M. Ancient Babylonian astronomers calculated Jupiter’s position from the area under a time-velocity graph. Science 2016, 351, 482–484. [Google Scholar] [CrossRef] [PubMed]
Rahman, Q.; Schmeisser, G. Characterization of the speed of convergence of the trapezoidal rule. Numer. Math. 1990, 57, 123–138. [Google Scholar] [CrossRef]
Atkinson, K. An Introduction to Numerical Analysis, 2nd ed.; John Wiley & Sons: New York, NY, USA, 1989; ISBN 978-0471500230. [Google Scholar]

Figure 1. Scaled rounding errors.

Figure 2. An oscillating function.

Figure 3. Computed first difference of the function

f (x) = x

.

Figure 3. Computed first difference of the function

f (x) = x

.

Table 1. Ratios

rre (x) / ρ

for certain fractions.

Table 1. Ratios

rre (x) / ρ

for certain fractions.

m	$1 / m$	$1 + 1 / m$	$2 - 1 / m$
10	$0.3000$	$0.2423$	$0.1950$
100	$0.3417$	$0.2443$	$0.2413$
1000	$0.3323$	$0.2649$	$0.2316$

Table 2. Errors and estimates for low order systems.

m	`err`	`est`	`err/est`
2	$3.1601 \times 10^{- 9}$	$9.0612 \times 10^{- 8}$	0.0349
3	$1.6691 \times 10^{- 5}$	$8.9027 \times 10^{- 4}$	0.0187
4	$2.4926 \times 10^{- 1}$	9.1475	0.0272
5	1.0002	4.6283	0.2161

Table 3. Constants in error estimates for code roots.

n	3	4	5	6	7	8	9	10	11
$γ_{n}$	1.73	1.78	1.53	1.95	1.62	1.60	1.84	1.72	2.03
n	12	13	14	15	16	17	18	19	20
$γ_{n}$	1.76	2.04	1.94	2.10	1.92	2.06	2.01	1.97	1.91

Table 4. Errors and estimates for the trapezoidal rule.

n	$\| T_{n} - 1 \|$	$r_{n} (f, a, b)$
5	$2.7522 \times 10^{- 2}$	$2.3724 \times 10^{- 4}$
10	$9.5472 \times 10^{- 4}$	$3.2918 \times 10^{- 5}$
100	$2.0633 \times 10^{- 5}$	$7.1142 \times 10^{- 5}$
1000	$4.3650 \times 10^{- 7}$	$1.5051 \times 10^{- 4}$
10,000	$4.3842 \times 10^{- 9}$	$1.5117 \times 10^{- 4}$
100,000	$4.3844 \times 10^{- 11}$	$1.5117 \times 10^{- 4}$
1,000,000	$4.4420 \times 10^{- 13}$	$1.5316 \times 10^{- 4}$
1,500,000	$1.9340 \times 10^{- 13}$	$1.5004 \times 10^{- 4}$
2,000,000	$1.0059 \times 10^{- 13}$	$1.3873 \times 10^{- 4}$
5,000,000	$3.4084 \times 10^{- 14}$	$1.5189 \times 10^{- 4}$
10,000,000	$1.9762 \times 10^{- 14}$	$6.8139 \times 10^{- 4}$
50,000,000	$2.1316 \times 10^{- 14}$	$1.8375 \times 10^{- 2}$
100,000,000	$5.0848 \times 10^{- 14}$	$1.7533 \times 10^{- 1}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Konstantinov, M.M.; Petkov, P.H.; Madamlieva, E.B. Brute Force Computations and Reference Solutions. Foundations 2025, 5, 7. https://doi.org/10.3390/foundations5010007

AMA Style

Konstantinov MM, Petkov PH, Madamlieva EB. Brute Force Computations and Reference Solutions. Foundations. 2025; 5(1):7. https://doi.org/10.3390/foundations5010007

Chicago/Turabian Style

Konstantinov, Mihail Mihaylov, Petko Hristov Petkov, and Ekaterina Borisova Madamlieva. 2025. "Brute Force Computations and Reference Solutions" Foundations 5, no. 1: 7. https://doi.org/10.3390/foundations5010007

APA Style

Konstantinov, M. M., Petkov, P. H., & Madamlieva, E. B. (2025). Brute Force Computations and Reference Solutions. Foundations, 5(1), 7. https://doi.org/10.3390/foundations5010007

Article Menu

Brute Force Computations and Reference Solutions

Abstract

1. Introduction and Notation

1.1. Preliminaries

1.2. General Notations

1.3. Machine Arithmetic Notations

1.4. Software and Hardware

1.5. Problems with Reference Solution

2. Machine Arithmetic

2.1. Preliminaries

2.2. Violation of Arithmetic Laws

2.3. Numerical Convergence of Series

2.4. Average Rounding Errors

3. Numerically Stable Computations

3.1. Lipschitz Continuous Problems

3.2. Hölder Continuous Problems

3.3. Golden Rule of Computations

4. Extremal Elements of Arrays

4.1. Vectors

4.2. Matrices

4.3. Application to Voting Theory

5. Problems with Reference Solutions

5.1. Evaluation of Functions

5.2. Linear Algebraic Equations

5.3. Eigenvalues of 2 by 2 Integer Matrices

6. Zeros of Functions

7. Minimization of Functions

7.1. Functions of Scalar Argument

7.2. Functions of Vector Argument

8. Canonical Forms of Matrices

8.1. Preliminaries

8.2. Jordan and Generalized Jordan Forms

8.3. Condensed Schur Forms

9. Least Squares Revisited

9.1. Preliminaries

9.2. Nonlinear Problems

9.3. Linear Problems

10. Integrals and Derivatives

10.1. Integrals

10.2. Derivatives

11. VPA Versus FPA

12. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI