The five algorithms described in the previous section have been implemented both in MATLAB on a desktop computer and on a Cortex-M4F microcontroller.
For algorithms requiring a symmetric matrix, the input matrix is converted to a symmetric form as follows. First it is converted to bidiagonal form by applying Algorithm A1, then, being B the upper triangular portion of the bidiagonal matrix, the SVD algorithm is applied to . In such a way the singular values of the original matrix are the square roots of eigenvalues of the latter tridiagonal matrix.
3.2. Cortex-M4F Implementation
The STM32F429ZI microcontroller is based on an ARM 32-bit Cortex-M4F [
60] CPU clocked at 180 MHz, with 2 MB of flash memory for code and read-only data, and 256 KB of RAM. In addition it has several hardware peripherals that are not relevant to this work.
A main feature of the Cortex-M4F core is the presence of a 32-bit hardware floating-point unit (FPU), as implied by the additional “F” in its name. An FPU is essential for any kind of heavy computational work in the floating-point domain, as is the case for the experiments on SVD performed in this article. The Cortex-M4F FPU is limited to 32 bits (
https://developer.arm.com/docs/ddi0439/latest/floating-point-unit/about-the-fpu), so the algorithms have been implemented using single-precision (32 bits) values. Implementing this kind of algorithms on a CPU with no FPU, or with larger precision than that managed by the hardware, would require the use of software floating-point mathematical libraries, that would be prohibitive in an already resource-constrained system.
The algorithms were implemented in C language. No particular development environment was used, the code was compiled with the GCC software suite for ARM on a GNU-Linux machine, using a custom makefile and with the aid of the STM32F4xx Standard Peripherals Drivers (STMicroelectronics, Geneva, Switzerland), a set of libraries provided by ST for their specific microcontrollers and encompassing all aspects of hardware management, from low-level initialization to use of hardware peripherals. The firmware is of the “bare-metal” kind, so no real-time operating system (RTOS) or other middlewares have been added.
The hardware system requested no particular design besides what was already provided by the 32F429IDISCOVERY board (STMicroelectronics, Geneva, Switzerland). The device has been clocked at its maximum speed of 180 MHz. The board also integrates all the hardware needed for programming and debugging the microcontroller, namely the ST-LINK/V2 interface (STMicroelectronics, Geneva, Switzerland), offering USB connectivity with the computer. On the computer side, communication with such interface has been established by using OpenOCD (
http://openocd.org), a free software for debugging and programming of ARM and other systems. OpenOCD finally acts as a server for GDB, the standard debugger from the GCC suite, used when needed to transfer the code to the device and examine its memory for the results of the tests.
Regarding input and output, read-only data for the program, like the bigger matrix from which smaller ones are generated, or the reference vectors of singular values to compute the accuracy, are stored in the program memory (flash memory) that is more abundant than RAM. Once the program is run for a series of tests, the numerical outputs can be examined through the debugger, by interrupting the program in convenient points. The timing of the single routines is computed by the software itself, using the SysTick timer built in the Cortex-M core.
Besides the optimizations performed by the compiler, special care has been exercised in trying to optimize the critical mathematical routines, while keeping a low usage of program memory and needed RAM, in order to speed up the computation as much as possible while fitting in the constrained resources of the system.
A first optimization comes from choosing the best arrangement of data in memory. Bidimensional matrices can be stored in RAM (which is a linear array of bytes) in two different ways: row-major or column-major order, that is storing data sequentially by row or column respectively (see
Figure 2). The former one is the most common way in the C language or anywhere the mathematical convention of having the row as the first index is followed. This has a much more crucial impact in CPUs with cache memory, where cache is filled with sequential data from RAM, and so it gives a huge speed boost in accessing sequential data with respect to sparse ones. As said, a microcontroller has no cache memory, so this is not directly the case; nevertheless a non-negligible advantage exists in accessing data sequentially, due to the load/store assembly instructions with auto-increment, that is instructions that read or write data from memory and increment the address register in a single execution cycle.
As a quantitative example of the effect of row-major or column-major data storing, we can consider the one-sided Jacobi rotation algorithm, that differs from the other ones for accessing the input matrix exclusively by column.
Table 2 shows the different timings of the algorithm for the biggest set of matrices, both in absolute times and in percentage of speed increase. As said, the increase in speed is minimal, even if appreciable. The last column shows the time needed to transpose the matrix, in the case it is stored in the opposite way, which is approximately one order of magnitude less than the time improvement obtained. Moreover, often the input matrix to SVD is generated from previous computations, and so it can be generated already in the more convenient order.
Besides accessing data in the convenient order, that results in a modest speed increment, and lacking hardware resources to be further exploited, other optimizations must be necessarily obtained by reducing the unneeded computations as much as possible. A significant example is matrix multiplication, one of the most computationally expensive operations in matrix algebra. Generally speaking, if
, the generic element of
C is given by
where
n is the number of columns of
A. Computing all the elements of
C in this way requires a triple nested loop which is very computationally expensive, especially for large matrices.
The number of operations performed for a matrix multiplication can be reduced by observing the properties of the specific matrices. For example a recurrent operation in the exposed algorithms requires the multiplication of a square matrix by its transpose, like
. In this case (
14) becomes
First we can notice that in the inner loop
A is always traversed by row, so we can have the advantage of reading data always in the most convenient order if the matrix is stored in row-major order. Most importantly it can easily be seen that
is a symmetric matrix, so we can actually compute approximately half of its elements, sensibly reducing the number of operations (
Figure 3).
A further reduction of the number of operations is possible in the case of
where
A is an upper-diagonal matrix, another common case in the given algorithms. Given that
only for
or
, from (
15) follows that
only for
,
i or
(indeed the resulting matrix is tridiagonal) and that the only non-zero terms in the sum are those for which
and
are both non-zero. The resulting formula is:
where the
element may not exist. Being also symmetric, the reduction of the previous case also applies (
Figure 4).
Another special example is the multiplication of a matrix by a Givens matrix, in the form of (
A14), to perform a Givens rotation. Let us call
G the Givens matrix to avoid confusion with indices, and let’s limit for simplicity to the case of left multiplication, as in
. If we initially set
, it is clear from the definition of
G that only rows
p and
q of
C are affected by the multiplication. Moreover, only elements
p and
q of a given column of
A are used in the computation (
Figure 5). So the only elements of
C that need to be updated are those at rows
p and
q, and their values are:
for every column
j. A similar formula holds for right-side multiplication.
The complexity of the previous computations, corresponding to matrix products for different special cases of input matrices, can be compared in terms of number of scalar multiplications with respect to the size of input matrices. Results are shown in
Table 3, together with the code size of the specific software routines.
The impact of such optimizations on computation speed can be measured, in particular those from (
15) and (
16) and from the property of symmetry (Givens rotation is never actually implemented as a matrix multiplication, since its matrix is expressly constructed to only modify a few elements in the result).
Table 4 shows the speed increase for the set of biggest matrices when using two algorithms where these optimizations are mostly relevant.
Besides trying to optimize mathematical routines, it is important to avoid the problems arising from the limited power and precision of the microcontroller’s FPU. In case of operations that can be carried out in different ways, choosing the right procedure can make a substantial difference. For example the assembly multiplication instruction (VMUL.F32) takes 1 clock cycle to execute, while the division (VDIV.F32) takes 14 cycles (
https://developer.arm.com/docs/ddi0439/b/floating-point-unit/fpu-functional-description/fpu-instruction-set). In some cases the compiler can automatically substitute an expensive operation with a cheaper one during the optimization phase of compilation, for example when dividing by a constant.
Another problem arising from the limits of the FPU is the loss of precision in certain operations in the 32-bit domain. For example a recurring problem in the given algorithms is computing . This kind of operation causes the loss of many significant bits in the original value of x when . It must be verified experimentally when this is tolerable and when not. In some cases the loss of precision causes a worsening of the final accuracy by an order of magnitude. A possible solution is switching temporarily to 64-bit precision, then converting back to single precision when the sensitive computation is done. Of course this sensibly increases the execution time, using software libraries instead of the hardware FPU. A better solution is applying the logarithm to the value to be computed, performing intermediate computations in the logarithm domain and finally applying exponentiation. In this case , which can take advantage of a special function in the C mathematical library, called “log1pf”, that is optimized to compute with high accuracy even if the value of x is near zero. Tests show that using logarithm and exponentiation software libraries while remaining in the 32-bit domain is faster and gives similar or better results than computing the root square in software in double precision.
Figure 6 shows the timings of the full SVD algorithms implemented on the microcontroller for the three sets of matrices of different row/column ratio, with respect to the total matrix size (
). The same results are also reported in
Table 5, but with the three matrix sets listed separately.
From the experimental results it can be seen that the time of the algorithms is roughly dependent on the total matrix size, but with irregularities suggesting that other factors, like the matrix rank, affect the total time.
On a comparative basis, as you can see the one-sided Jacobi rotation algorithm gives the lowest execution time, compared to the others.
Table 6 reports the accuracy of the specific tests, implemented on the microcontroller, with respect to the MATLAB built-in SVD function. As in
Section 3.1, the errors reported are computed as the average of relative errors of matching pairs of singular values, as computed by MATLAB built-in function and the given routines. As you can see, the accuracy of Cortex-M4F implementation is significantly lower than the equivalent MATLAB code; this is due to the lower precision (32 bits) of the Cortex-M4F hardware floating-point unit. In this case, both the one-sided Jacobi rotation and divide and conquer algorithms achieve a better accuracy than the others.
Finally,
Table 7 reports the energy consumption of the tests relative to one matrix set, measured by sampling the voltage and current with an INA226 from Texas Instruments (
https://www.ti.com/product/INA226). The INA226 is a current and voltage monitor with I²C interface, with a maximum gain error of 0.1%, maximum input offset of 10 µV and 16-bit resolution. A 0.1
shunt resistor has been used in series with the microcontroller power supply, and the data have been acquired through an external single-board computer.
As you can see, the results are coherent with execution times. As a matter of comparison,
Figure 7 shows side-by-side the current consumption of the five algorithms for one of the matrices. The one-sided Jacobi rotation algorithm, besides being faster, clearly has the lowest average current consumption. It is worth noting that the other algorithms exhibit the same pattern at the beginning, corresponding to the Householder’s bidiagonalization step. This step has therefore a significant relevance, both in time and energy, for the algorithms that need it.