Skip to Content
EntropyEntropy
  • Article
  • Open Access

17 August 2024

Singular-Value-Decomposition-Based Matrix Surgery

and
School of Computing, The University of Buckingham, Buckingham MK18 1EG, UK
*
Author to whom correspondence should be addressed.

Abstract

This paper is motivated by the need to stabilise the impact of deep learning (DL) training for medical image analysis on the conditioning of convolution filters in relation to model overfitting and robustness. We present a simple strategy to reduce square matrix condition numbers and investigate its effect on the spatial distributions of point clouds of well- and ill-conditioned matrices. For a square matrix, the SVD surgery strategy works by: (1) computing its singular value decomposition (SVD), (2) changing a few of the smaller singular values relative to the largest one, and (3) reconstructing the matrix by reverse SVD. Applying SVD surgery on CNN convolution filters during training acts as spectral regularisation of the DL model without requiring the learning of extra parameters. The fact that the further away a matrix is from the non-invertible matrices, the higher its condition number is suggests that the spatial distributions of square matrices and those of their inverses are correlated to their condition number distributions. We shall examine this assertion empirically by showing that applying various versions of SVD surgery on point clouds of matrices leads to bringing their persistent diagrams (PDs) closer to the matrices of the point clouds of their inverses.

1. Introduction

Despite the remarkable success and advancements of deep learning (DL) models in computer vision tasks, there are serious obstacles to the deployment of AI in different domains related to the challenge of developing deep neural networks that are both robust and generalise well beyond the training data [1]. Accurate and stable numerical algorithms play a significant role in creating robust and reliable computational models [2]. The source of numerical instability in DL models is partially due to the use of a large number of parameters/hyperparameters and data that suffer from floating-point errors and inaccurate results. In the case of convolutional neural networks (CNNs), an obvious contributor to the instability of their large volume of weights is the repeated action of backpropagation algorithms for controlling the growth of the gradient descent to fit the model’s performance to the different patches of training samples. This paper is concerned with empirical estimation of CNN training-caused fluctuations in the condition numbers of various weight matrices as a potential source of instability at convolutional layers and their negative effects on overall model performance. We shall propose a spectral-based approach to reduce and control the undesirable fluctuation.
The condition number  κ ( A ) of a square n × n matrix A, which is considered as a linear transformation R n × n R , measures the sensitivity of computing its action to perturbations to input data and round-off errors, which are defined as s u p A x / x over the set of nonzero x. The condition number depends on how much the calculation of its inverse suffers from underflow (i.e., how much d e t ( A ) is significantly different from 0). Stable action of A means that small changes in the input data are expected to lead to small changes in the output data, and these changes are bound by the reciprocal of the condition number. Hence, the higher the condition number of A is, the more unstable A’s action is in response to small data perturbations, and such matrices are said to be ill-conditioned. Indeed, the distribution of the condition numbers of a random matrix simply describes the loss in precision, in terms of the number of digits, as well as the speed of convergence due to ill-conditioning when solving linear systems of equations iteratively [3]. Originally, the condition number of a matrix was introduced by A. Turing in [4]. Afterwards, the condition numbers of matrices and numerical problems were comprehensively investigated in [5,6,7]. The most common efficient and stable way of computing κ ( A ) is by computing the SVD of A and calculating the ratio of A’s largest singular value to its smallest non-zero one [8].
J. W. Demmel, in [6], investigated the upper and lower bounds of the probability distribution of condition numbers of random matrices and showed that the sets of ill-posed problems including matrix inversions, eigenproblems, and polynomial zero finding all have a common algebraic and geometric structure. In particular, Demmel showed that in the case of matrix inversion, the further away a matrix is from the set of noninvertible matrices, the smaller is its condition number. Accordingly, the spatial distributions of random matrices in their domains are indicators of the distributions of their condition numbers. These results provide clear evidence of the viability of our approach to exploit the tools of topological data analysis (TDA) to investigate the condition number stability of point clouds of random matrices. In general, TDA can be used to capture information about complex topological and geometric structures of point clouds in metric spaces with or without prior knowledge about the data (see [9] for more detail). Since the early 2000s, applied topology has entered a new era exploiting the persistent homology (PH) tool to investigate the global and local shape of high-dimensional datasets. Various vectorisations of persistence diagrams (PDs) generated by the PH tool encode information about both the local geometry and global topology of the clouds of convolution filters of CNN models [10]. Here, we shall attempt to determine the impact of the SVD surgery procedure on the PDs of point clouds of CNNs’ well- and ill-conditioned convolution filters.
Contribution: We introduce a singular-value-decomposition-based matrix surgery (SVD surgery) technique to modify the matrix condition numbers that is suitable for stabilising the actions of ill-conditioned convolution filters in point clouds of image datasets. The various versions of our SVD surgery preserve the norm of the input matrix while reducing the norm of its inverse away from non-invertible matrices. PH analyses of point clouds of matrices (and those of their inverses) post SVD surgery bring the PDs of point clouds of filters of convolution filters and those of their inverses closer to each other.

2. Background to the Motivating Challenge

The ultimate motivation for this paper is related to specific requirements that arose in our challenging investigations of how to “train an efficient slim convolutional neural network model capable of learning discriminating features of Ultrasound Images (US) or any radiological images for supporting clinical diagnostic decisions”. In particular, the developed model’s predictions are required to be robust against tolerable data perturbation and less prone to overfitting effects when tested on unseen data.
In machine learning and deep learning, vanishing or exploding gradients and poor convergence are generally due to an ill-conditioning problem. The most common approaches to overcome ill-conditioning are regularisation, data normalisation, re-parameterisation, standardisation, and random dropouts. When training a deep CNN with extremely large datasets of “natural” images, the convolution filter weights/entries are randomly initialised, and the entries are changed through an extensive training procedure using many image batches over a number of epochs, at the end of each of which, the back-propagation procedure updates the filter entries for improved performance. The frequent updates of filter entries result in non-negligible to significant fluctuation and instability of their condition numbers, causing sensitivity of the trained CNN models [11,12]. CNN model sensitivity is manifested by overfitting, reduced robustness against noise, and vulnerability to adversarial attacks [13].
Transfer learning is a common approach when developing CNN models for the analysis of US (or other radiological) image datasets, wherein the pretrained filters and other model weights of an existing CNN model (trained on natural images) are used as initialising parameters for retraining. However, condition number instabilities increase in the transfer learning mode when used for small datasets of non-natural images, resulting in suboptimal performance and the model suffering from overfitting.

4. Topological Data Analysis

In this section, we briefly introduce persistent homology preliminaries and describe the point cloud settings of randomly generated matrices to investigate their topological behaviours.
Persistent homology of point clouds: Persistent homology is a computational tool of TDA that encapsulates the spatial distribution of point clouds of data records sampled from metric spaces by recording the topological features of a gradually triangulated shape by connecting pairs of data points according to an increasing distance/similarity sequence of thresholds. For a point cloud X and a list { α i } 0 m of increasing thresholds, the shape S ( X ) generated by this TDA process is a sequence { S ( X ) i } 0 m of simplicial complexes ordered by inclusion. The Vietoris–Rips simplicial complex (VR) is the most commonly used approach to construct S ( X ) due to its simplicity, and Ripser [40] is used to construct VR. The sequence of distance thresholds is referred to as a filtration of S ( X ) . The topological features of S ( X ) consist of the number of holes or voids of different dimensions, which is known as the Bettie number, in each constituents of { S ( X ) i } 0 m . For j 0 , the j-th Bettie number B j ( S ( X ) i ) is obtained, respectively, by counting B 0 = #(connected components), B 1 = #(empty loops with more than three edges), B 2 = #(3D cavities bounded by more than four faces), etc. Note that B j ( S i ( X ) ) is the set of generators of the j-th singular homology of the simplicial complex S i ( X ) . The TDA analysis of X with respect to a filtration { α i } 0 m is based on the persistency of each element of B j ( S ( X ) i ) as i m . Here, the persistency of each element is defined as the difference between its birth (first appearance) and its death (disappearance). It is customary to visibly represent B j ( S ( X ) i ) as a vertically stacked set of barcodes, with each element having a horizontal straight line joining its birth to its death. For more-detailed and rigorous descriptions, see [41,42,43]). For simplicity, the barcode set and the PD of the B j ( S i ( X ) ) are referred to by H j .
Analysis of the resulting PH barcodes of point clouds in any dimension is provided by the persistence diagram (PD) formed by a multi-set of points in the first quadrants of the plane ( x = b i r t h , y = d e a t h ) above or on the line y = x . Each marked point in the PD corresponds to a generator of the persistent homology group of the given dimension and is represented by a pair of coordinates ( b i r t h , d e a t h ) . To illustrate these visual representations of PH information, we created a point cloud of 1500 points sampled randomly on the surface of a torus:
T = { ( x , y , z ) R 3 : ( x 2 + y 2 a ) 2 + z 2 = b 2 }
Figure 1 and Figure 2 below display this point cloud together with the barcodes and PD representation of its PH in both dimensions. The two long 1 d i m persisting barcodes represent the two empty discs whose Cartesian product generates the torus. The persistency lengths of these two holes depend on the radii ( a b , b ) of the generating circles. In this case, a = 2 b . The persistency lengths of the set of shorter barcodes are inversely related to the point cloud size. Noisy sampling will only have an effect on the shorter barcodes.
Figure 1. An illustration of a point cloud: (a) points from a toru, (b,c) connecting nearby points up to the distances d = 0.1 and 0.2 , respectively.
Figure 2. The topological representation of the torus point cloud as persistence barcodes and diagram.
Demmel’s general assertion that the further away a matrix is from the set of non-invertible matrices, the smaller is its condition number [6] implies that the distribution of condition numbers of a point cloud of filters is linked to its topological profile as well as that of the point cloud of their inverses. In relation to our motivating application, the more ill-conditioned the convolutional filter is, the closer it is to being non-invertible, resulting in unstable feature learning. Accordingly, the success of condition-number-reducing matrix surgery can be indirectly inferred by its ability to reduce the differences between the topological profiles (expressed by PDs) of point clouds of filters and those of their inverses. We shall first compare the PDs of point clouds of well-conditioned matrices and ill-conditioned ones, and we do the same for the PDs of their respective inverse point clouds.
Determining the topological profiles of point clouds using visual assessments of the corresponding point clouds’ persistent barcodes/diagrams is subjective and cumbersome. A more quantitatively informative way of interpreting the visual display of PBs and PDs can be obtained by constructing histograms of barcode persistency records in terms of uniform binning of birth data. Bottleneck and Wasserstein distances provide an easy quantitative comparison approach but may not fully explain the differences between the structures of PDs of different point clouds. In recent years, several feature vectorisations of PDs have been proposed that can be used to formulate numerical measures to distinguish topological profiles of different point clouds. The easiest scheme to interpret is the statistical vectorisation of persistent barcode modules [44]. Whenever reasonable, we shall complement the visual display of PDs with an appropriate barcode binning histogram of barcodes’ persistency, alongside computing the bottleneck and Wasserstein distances using the GUDHI library [45] to compare the topological profiles of point clouds of matrices.
To illustrate the above process, we generated a set of 10 4 random Gaussian filters of size 3 × 3 matrices sorted in ascending order of their condition number, and we created two point clouds: (1) X 1 of the 64 matrices with the lowest condition numbers and (2) X 2 with the 64 matrices of the highest condition numbers. X 1 is well-conditioned, with condition numbers in the range [1.19376, 1.67], while X 2 is highly ill-conditioned, with condition numbers in the range [621.3677, 10,256.2265]. Below, we display the PDs in both dimensions of X 1 , X 2 and their inverse point clouds in Figure 3.
Figure 3. Persistence diagrams of point clouds representing well-conditioned and ill-conditioned matrices and their inverses.
In dimension zero, there are marginal differences between the connected component persistency of X 1 and that of X 1 1 . In contrast, considerable differences can be found between the persistence of the connected components of X 2 and that of X 2 1 . In dimension one, there are slightly more marginal differences between the hole persistency of X 1 and that of X 1 1 . However, these differences are considerably more visible between the hole persistency of X 2 and that of X 2 1 . One easy observation in both inverse point clouds, as opposed to the original ones, is the early appearance of a hole that dies almost immediately, being very near to the line death = birth.
A more informative comparison between the various PDs can be discerned by examining Table 1 below, which displays the persistency-death-based binning of the various PDs. Note that in all cases, there are 64 connected components born at time 0. The pattern and timing of death (i.e., merging) of connected components in the well-conditioned point clouds X 1 and X 1 1 are nearly similar; however, in the case of ill-conditioned point clouds, most connected components of X 2 1 merge much earlier than those of X 2 .
Table 1. The persistency binning of well-conditioned and ill-conditioned point cloud PDs.
The above results are analogous to Demmel’s result in that the well-conditioned point cloud exhibits similar topological profiles to that of its inverse point cloud, while the topological profile of the ill-conditioned point cloud differs significantly from that of its inverse. In order to estimate the proximity of the PDs of the well- and ill-conditioned point clouds to those of their inverses, we computed both the bottleneck and the Wasserstein. The results are included in Table 2 below, which also includes these distances between other pairs of PDs. Again, both distance functions confirm the close proximity of the PD of X 1 with that of X 1 1 in comparison to the significantly bigger distances between the PDs of X 2 and X 2 1 .
Table 2. Comparison of bottleneck and Wasserstein distances.
Next, we introduce our matrix surgery strategy and the effects of various implementations on point clouds of matrices, with emphasis on the relations between the PDs of the output matrices and those of their inverse point clouds.

5. Matrix Surgery

In this section, we describe the research framework to perform matrix surgery that aims to reduce and control the condition numbers of matrices. Suppose matrix A R m × n is non-singular and is based on a random Gaussian or uniform distribution. The condition number of A is defined as:
κ ( A ) = A A 1
where . is the norm of the matrix. In this investigation, we focus on the Euclidean norm ( L 2 -norm), where κ ( A ) can be expressed as:
κ ( A ) = σ 1 / σ n
where σ 1 and σ n are the largest and smallest singular values of A, respectively. A matrix is said to be ill-conditioned if any small change in the input results in big changes in the output, and it is said to be well-conditioned if any small change in the input results in a relatively small change in the output. Alternatively, a matrix with a low condition number (close to one) is said to be well-conditioned, while a matrix with a high condition number is said to be ill-conditioned, and the ideal condition number of an orthogonal matrix is one. Next, we describe our simple approach of modifying singular-value-matrix-based SVD since the condition number is defined by the largest and smallest singular values. We recall that the singular value decomposition of a square matrix A R n × n is defined by:
A = U Σ V T
where U R m × m and V R n × n are left and right orthogonal singular vectors (unitary matrices); diagonal matrix Σ = d i a g ( σ 1 , , σ n ) R m × n are singular values, where Σ = σ 1 σ 2 . . . σ n 0 . SVD surgery, described below, is equally applicable to rectangular matrices.

5.1. SVD-Based Surgery

In the wide context, SVD surgery refers to the process of transforming matrices to improve their conditioning numbers. In particular, it targets matrices that are far from having orthogonality/orthonormality characteristics to replace them with improved well-conditioned matrices by deploying their left and right orthogonal singular vectors along with the new singular value diagonal matrix. SVD surgery can be realised in a variety of ways according to the expected properties of the output matrices to fit the use case. Given any matrix A, SVD surgery on A outputs a new matrix of the same size as follows: Entropy 26 00701 i001
Changes to the singular values amount to rescaling the effect of the matrix action along the left and right orthogonal vectors of U and V, and the monotonicity requirement ensures reasonable control of the various rescalings. The orthogonal regularisation scheme of [22] and the SVB scheme of [24] do reduce the condition numbers when applied for improved control of overfitting of DL models trained on natural images, but both make changes to all the singular values and cannot guarantee success for the application of DL training of US image datasets. Furthermore, the SVB scheme is a rather strict form of SVD-based matrix surgery for controlling the condition numbers, but no analysis is conducted on the norms of these matrices or their inverses.
Our strategy for using SVD surgery is specifically designed for the motivating application and aims to reduce extremely high condition number values, preserve the norm of the input filters, and reduce the norm of their inverses away from non-invertible ones. Replacing all diagonal singular value entries with the largest singular value will produce an orthogonal matrix with a condition number equal to one, but this approach ignores or reduces the effect of significant variations in the training data along some of the singular vectors, leading to less effective learning. Instead, we propose a less drastic, application-dependent strategy for altering singular values. In general, our approach involves scaling all singular values to be less than < σ 1 in order to minimise σ 1 σ n while ensuring the maintenance of their monotonicity property. To reduce the condition numbers of an ill-conditioned matrix, it may only be necessary to adjust the relatively low singular values to bring them closer to σ 1 . There are numerous methods for implementing such strategies, including the following linear combination scheme. Here, we follow a less drastic strategy to change singular values:Entropy 26 00701 i002
The value of j can be chosen to be any singular value that is very close to σ 1 , and the linear combination parameters can be customised based on the application and can possibly be determined empirically. In extreme cases, this strategy allows for the possibility of setting σ k = σ j for all k > j . This is rather timid in comparison to the orthogonal regularisation strategies, which preserve the monotonicity of the singular values. Regarding our motivating application, parameter choices would vary depending on the layer, but the linear combination parameters should not significantly rescale the training dataset features along the singular vectors. While SVD surgery can be applied to inverse matrices, employing the same replacement strategy and reconstruction may not necessarily result in a significant reduction in the condition number.
Example: Suppose B is a square matrix with n = 3 that is drawn from a normal distribution with mean μ = 0 and standard deviation σ = 0.01 as follows:
B = 0.01960899999 0.02908008031 0.01058180258 0.00197698226 0.00825218894 0.00468615581 0.01207845485 0.01378971978 0.00272469409
Singular values of B are Σ = d i a g ( σ 1 , σ 2 , σ 3 ) , and it is possible to modify and reconstruct B ˜ 1 , B ˜ 2 , and B ˜ 3 by replacing one and/or two singular values such that Σ ˜ 1 = d i a g ( σ 1 , σ 2 , σ 2 ) , Σ ˜ 2 = d i a g ( σ 1 , σ ˜ 2 , σ ˜ 3 ) , and Σ ˜ 3 = d i a g ( σ 1 , σ 1 , σ 1 ) , respectively. New singular values in Σ ˜ 2 are convex linear combinations such that σ ˜ 2 = 2 σ 1 / 3 + σ 2 / 3 and σ ˜ 3 = σ ˜ 2 . After reconstruction, the condition numbers of B ˜ 1 , B ˜ 2 , and B ˜ 3 are significantly lower compared to those of the original matrix, as shown in Table 3, by using the Euclidean norm.
Table 3. Euclidean norms and condition numbers before and after matrix surgery.

5.2. Effects of SVD Surgery on Large Datasets of Convolution Filters and Their Inverses

In training CNN models, it is customary to initialise the convolution filters of each layer using random Gaussian matrices of sizes that are layer- and CNN-architecture-dependent. Here, we shall focus on the effect of surgery on 3 × 3 Gaussian matrices. To illustrate the effect of SVD surgery on point clouds of convolutions, we generate a set of 10 4 3 × 3 matrices drawn from the Gaussian distribution N ( 0 , 0.01 ) . We use the norm of the original matrix, the norm of the inverse, and the condition number to illustrate the effects of SVD surgery and observe the distribution of these parameters per set. Figure 4 below shows a clear reduction in the condition numbers of modified matrices compared to the original ones. The reduction in the condition numbers is a result of reducing the norms of the inverses of the matrices (see Figure 5). The minimum and maximum condition numbers for the original set are approximately 1.2 and 10,256, respectively. After only replacing the smallest singular value σ 3 with σ 2 , after reconstruction, the new minimum and maximum values are 1.006 and 17.14, respectively.  
Figure 4. Distribution of 3 × 3 matrices pre- and post-surgery: (a,d) original matrix norms, (b,e) inverse matrix norms, and (c,f) matrix condition numbers.
Figure 5. Illustration of 3 × 3 random Gaussian matrices pre- and post-matrix surgery, displaying norms, inverse norms, and logarithmic condition numbers: (a) σ 3 replaced with σ 2 and (b,c) σ 2 and σ 3 replaced with a new linear combination of σ 1 and σ 2 .
Figure 4 shows a significant change in the distribution of the norms of the inverses of 3 × 3 matrices post-surgery, which is consequently reflected in their condition number distribution. The use of a linear combination formula helps keep the range of condition numbers below a certain threshold depending on the range of singular values. For instance, 3D illustrations in Figure 5 show a significant reduction in the condition number by keeping the ranges below 3 in (b) and 2 in (c), where σ 2 and σ 3 are replaced with σ 1 / 3 + 2 σ 2 / 3 and ( σ 1 + σ 2 ) / 2 , respectively. The new minimum and maximum condition number values for both sets after matrix surgery are [ 1.004 , 2.687 ] and [ 1.003 , 1.88 ] , respectively.

5.3. Effects of SVD Surgery on PDs of Point Clouds of Matrices

For the motivating application, we need to study the impact of SVD surgery on point clouds of matrices (e.g., layered sets of convolution filters) rather than single matrices. Controlling the condition numbers of the layered point clouds of CNN filters (in addition to the fully connected layer weight matrices) during training affects the model’s learning and performance. The implementation of SVD surgery can be integrated into customised CNN models as a filter regulariser for the analysis of natural and US image datasets. It can be applied at filter initialisation when training from scratch, on pretrained filters during transfer learning, and on filters modified during training by backpropagation after every batch/epoch.
In this section, we investigate the topological behaviour of a set of matrices represented as a point cloud using persistent homology tools, as discussed in Section 4. For any size n × n filters, we first generate a set of random Gaussian matrices. By normalising their entries and flattening them, we obtain a point cloud in S n × n 1 residing on its ( n × n 1 ) s p h e r e . Subsequently, we construct a second point cloud in S n × n 1 by computing the inverse matrices, normalising their entries, and flattening. Here, we only illustrate this process for a specific point cloud of 3 × 3 matrices for two different linear combinations of the two lower singular values. The general case of larger-size filters is discussed in the first author’s PhD thesis [46].
Figure 6 below shows the H 0 and H 1 persistence diagrams for point clouds (originals and inverses) plus those for post-matrix-surgery with respect to the linear combinations: (1) replacing both σ 2 and σ 3 with σ 1 (i.e., κ ( A ) = 1 ) and (2) replacing σ 3 with σ 2 . The first row corresponds to the effect of SVD on the PD of the original point cloud, while the second row corresponds to the inverse point cloud.
Figure 6. Persistence diagram of point clouds A and A 1 before and after SVD-based surgery.
The original point cloud A ˜ includes extremely wide-ranging matrices in relation to their conditioning, which means their proximity to the non-invertible set of matrices is also wide-ranging. That accounts for the observable visual differences between the PDs of A ˜ and those of A ˜ 1 1 in both dimensions. The PDs of A ˜ 1 and A ˜ 1 1 are not significantly dissimilar in dimension 0, but in dimension 1, we can notice that many holes in A ˜ 1 1 have longer lifespans, while many others are born later than the time that all holes in A ˜ 1 vanish. In fact, in dimension 0, the dissimilarities appear as a result of many connected components in A ˜ 1 1 living longer than those in A ˜ 1 . The PDs of A ˜ 2 and A ˜ 2 1 are visually equivalent in both dimensions as a reflection of the fact that this surgery produces optimally well-conditioned orthonormal matrices (i.e., the inverse matrices are simply the transpose of the original ones). This means that the strict surgery that produces the A ˜ 2 point cloud is useful for applications that require orthogonality, whereas the less-relaxed surgery is beneficial for applications where condition numbers are in a reasonable range of values as long as they are not ill-conditioned.
For a more informative description of these observations, we computed the death-based binning table, which is shown below as Table 4. The results confirm that the topological profiles (represented by their PDs) of A ˜ and A ˜ 1 1 are indeed different in both dimensions. There is less quantitative similarity in dimension 0 between the PDs of A ˜ 1 and A ˜ 1 1 than reported by visual examination. In dimension 1, the visual observations are to some extent supported by the number of holes in the various bins. The table also confirms the exact similarity in both dimensions of the PDs of A ˜ 2 and A ˜ 2 1 , as reported using visual examination.
Table 4. The persistency binning of the various PDs before and after SVD surgery.
Again, we estimated the proximities of the PDs of the various related pairs of point cloud matrices and their inverses in term of the bottleneck and the Wasserstein distance functions. The results are shown in Table 5 below. The significantly large distances in dimension 0 explain the noted differences between the PD of A ˜ and that of A ˜ 1 . In dimension 1, the surprisingly small bottleneck distance between the PD of A ˜ and that of A ˜ 1 1 indicates that bottleneck distances may not reflect the dissimilarities in visual representations. The distances between the PDs of A ˜ 1 and A ˜ 1 1 in both dimensions are reasonably small, except that in dimension 1, the distance increased slightly post the A ˜ 1 surgery. This may be explained by the observation made earlier that “many holes have longer lifespans, while many others are born later than the time that all holes in A ˜ 1 vanish” when visually examining the 1-dimensional PDs. Finally, these distance computations confirm the strict similarity reported above between the PDs of A ˜ 2 and A ˜ 2 1 .
Table 5. Comparison of bottleneck and Wasserstein distances.

SVD Surgery for the Motivating Application

The need for matrix surgery to improve the condition number arose during our previous investigation [46], which aimed to develop a CNN model for ultrasound breast tumour images that has reduced overfitting and is robust to reasonable noise. During model training, we observed that the condition numbers of a large number of the initialised convolution filters were fluctuating significantly over the different iterations [12]. Having experimented with various linear-combination-based SVD surgery techniques, the work eventually led to a modestly performing customised CNN model with reasonable robustness to tolerable data perturbations and generalisability to unseen data. This was achieved with a carefully selected constant linear combination SVD surgery applied to all convolutional layer filters at (1) initialisation from scratch, (2) pretrained filters, and (3) during training batches and/or epochs.
Our ongoing attempt to improve the previous work for improved CNN model performance is based on using more convolution layers and investigating the conditioning of the large non-square matrix of the fully connected layers (FCLs) of neurons. A major obstacle to the training aspects of this work is the selection of appropriate linear-combination-based SVD surgery for different point clouds for a larger range of filter sizes. In our motivating application as well as in many other tasks, it is specifically desirable to control the condition numbers of filters/matrices within a specific range and with reasonable upper bounds. Such requirements significantly increase the toughness of the challenge of finding different linear-combination-based surgery schemes (suitable for various convolutional layers and FCLs) that guarantee maintaining condition numbers within specified ranges.
There may exist many alternatives to using linear-combination-based reconditioning SVD surgery. The PH investigations of the last section indicate the need to avoid adopting crude/strong reconditioning algorithms to avoid slowing down learning and/or underfitting effects. Below is pseudocode, Algorithm 1, for a simple but efficient SVD surgery strategy that we developed more recently for “reconditioning” each of the convolution filters (as well as the components of the FCL weight matrices) after each training epoch that maintains the condition numbers within a desired range.
Algorithm 1 SVD surgery and condition number threshold.
  • Input: Filter F of size k × k , thresholding constant C
1:
Compute the SVD of F: F = U Σ V T , and let ( σ 1 , , σ k ) be the singular values of Σ in descending order.
2:
x σ 1 / C , j k                                     ▹ Initial threshold
3:
while  ( σ j > x ) ( j > 1 )  do
4:
     σ j x                              ▹ Threshold small singular values
5:
     j j 1
6:
     x σ j                              ▹ Update threshold for next iteration
7:
for  i = j to k 1  do                       ▹ Smoothen the remaining singular values
8:
     σ i + 1 ( σ i + 1 + σ i ) / 2
9:
Reconstruct F using the modified singular values: F U Σ V T
  • Output: Filter F of size k × k
Note that the above algorithm does not change any input matrix that has a condition number in the specified range, while it makes minimal essential adjustments to the singular values. We are incorporating this efficient SVD-based reconditioning procedure during the training of specially designed SLIM CNN models for tumour diagnosis from ultrasound images. The results are encouraging, and future publications will cover the implications of such “reconditioning” matrix surgery on the performance of Slim-CNN models and the topological profiles of the filters’ point clouds during training.
Future works include (1) assessment of topological profiles of point clouds of matrices (and those of their inverses) in terms of their condition number distribution and (2) quantifying Demmel’s assertion that links condition numbers of matrices to their proximity to non-invertible matrices. For such investigation, the SVD surgery scheme is instrumental in generating sufficiently large point clouds of matrices for any range of condition numbers.

6. Conclusions

We introduced simple SVD-based procedures for matrix surgery to reduce and control the condition number of an n × n matrix by conducting surgery on its singular values. Persistent homology analyses of point clouds of matrices and their inverses helped formulate a possible PD version of Demmel’s assertion. Recognising the challenge of using the convex linear combination strategy to stabilise the performance of CNN models, a new, simpler-to-implement matrix reconditioning surgery is presented.

Author Contributions

Conceptualisation, J.G. and S.J.; Methodology, J.G. and S.J.; Investigation and analysis, J.G. and S.J.; Writing, J.G. and S.J.; Visualisation, J.G.; Supervision, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Colbrook, M.J.; Antun, V.; Hansen, A.C. The difficulty of computing stable and accurate neural networks: On the barriers of deep learning and Smale’s 18th problem. Proc. Natl. Acad. Sci. USA 2022, 119, e2107151119. [Google Scholar] [CrossRef] [PubMed]
  2. Higham, N.J. Accuracy and Stability of Numerical Algorithms; SIAM: Philadelphia, PA, USA, 2002. [Google Scholar]
  3. Edelman, A. Eigenvalues and Condition Numbers. Ph.D. Thesis, MIT, Cambridge, MA, USA, 1989. [Google Scholar]
  4. Turing, A.M. Rounding-off errors in matrix processes. Q. J. Mech. Appl. Math. 1948, 1, 287–308. [Google Scholar] [CrossRef]
  5. Rice, J.R. A theory of condition. SIAM J. Numer. Anal. 1966, 3, 287–310. [Google Scholar] [CrossRef]
  6. Demmel, J.W. The geometry of III-conditioning. J. Complex. 1987, 3, 201–229. [Google Scholar] [CrossRef][Green Version]
  7. Higham, D.J. Condition numbers and their condition numbers. Linear Algebra Its Appl. 1995, 214, 193–213. [Google Scholar] [CrossRef]
  8. Klema, V.; Laub, A. The singular value decomposition: Its computation and some applications. IEEE Trans. Autom. Control 1980, 25, 164–176. [Google Scholar] [CrossRef]
  9. Chazal, F.; Michel, B. An introduction to Topological Data Analysis: Fundamental and practical aspects for data scientists. Front. Artif. Intell. 2017, 4, 667963. [Google Scholar] [CrossRef] [PubMed]
  10. Adams, H.; Moy, M. Topology applied to machine learning: From global to local. Front. Artif. Intell. 2021, 4, 668302. [Google Scholar] [CrossRef] [PubMed]
  11. Ghafuri, J.; Du, H.; Jassim, S. Topological aspects of CNN convolution layers for medical image analysis. In Proceedings of the Mobile Multimedia/Image Processing, Security, and Applications 2020, Online, 27 April–9 May 2020; SPIE: Bellingham, WA, USA, 2020; Volume 11399, pp. 229–240. [Google Scholar] [CrossRef]
  12. Ghafuri, J.; Du, H.; Jassim, S. Sensitivity and stability of pretrained CNN filters. In Proceedings of the Multimodal Image Exploitation and Learning 2021, Online, 12–17 April 2021; SPIE: Bellingham, WA, USA, 2021; Volume 11734, pp. 79–89. [Google Scholar] [CrossRef]
  13. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
  14. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  15. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Volume 9, pp. 249–256. [Google Scholar]
  16. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
  17. Zhu, C.; Ni, R.; Xu, Z.; Kong, K.; Huang, W.R.; Goldstein, T. GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training. Adv. Neural Inf. Process. Syst. 2021, 20, 16410–16422. [Google Scholar] [CrossRef]
  18. Dauphin, Y.N.; Schoenholz, S. MetaInit: Initializing learning by learning to initialize. Adv. Neural Inf. Process. Syst. 2019, 32, 12645–12657. [Google Scholar]
  19. Xie, D.; Xiong, J.; Pu, S. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6176–6185. [Google Scholar]
  20. Mishkin, D.; Matas, J. All you need is a good init. arXiv 2015, arXiv:1511.06422. [Google Scholar]
  21. Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv 2013, arXiv:1312.6120. [Google Scholar]
  22. Wang, J.; Chen, Y.; Chakraborty, R.; Yu, S.X. Orthogonal Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  23. Sinha, A.; Singh, M.; Krishnamurthy, B.; Sinha, A.; Krishnamurthy, B.; Singh, M.; Krishnamurthy, B. Neural networks in an adversarial setting and ill-conditioned weight space. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11329, pp. 177–190. [Google Scholar] [CrossRef]
  24. Jia, K.; Li, S.; Wen, Y.; Liu, T.; Tao, D.; Jia, K.; Wen, Y.; Liu, T.; Tao, D. Orthogonal Deep Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1352–1368. [Google Scholar] [CrossRef]
  25. Huang, L.; Liu, X.; Lang, B.; Yu, A.; Wang, Y.; Li, B. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  26. Qiao, S.; Wang, H.; Liu, C.; Shen, W.; Yuille, A. Micro-batch training with batch-channel normalization and weight standardization. arXiv 2019, arXiv:1903.10520. [Google Scholar]
  27. Salimans, T.; Kingma, D.P.; Openai, T.S.; Openai, D.P.K. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Adv. Neural Inf. Process. Syst. 2016, 29, 901–909. [Google Scholar]
  28. Huang, L.; Liu, X.; Liu, Y.; Lang, B.; Tao, D. Centered weight normalization in accelerating training of deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2803–2811. [Google Scholar]
  29. Huang, L.; Liu, L.; Zhu, F.; Wan, D.; Yuan, Z.; Li, B.; Shao, L. Controllable Orthogonalization in Training DNNs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6428–6437. [Google Scholar] [CrossRef]
  30. Rothwell, E.; Drachman, B. A unified approach to solving ill-conditioned matrix problems. Int. J. Numer. Methods Eng. 1989, 2, 609–620. [Google Scholar] [CrossRef]
  31. Turkeš, R.; Montúfar, G.; Otter, N. On the effectiveness of persistent homology. arXiv 2022, arXiv:2206.10551. [Google Scholar]
  32. Gabrielsson, R.B.; Carlsson, G.; Bruel Gabrielsson, R.; Carlsson, G.; Gabrielsson, R.B.; Carlsson, G. Exposition and interpretation of the topology of neural networks. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1069–1076. [Google Scholar] [CrossRef]
  33. Magai, G.; Ayzenberg, A. Topology and geometry of data manifold in deep learning. arXiv 2022, arXiv:2204.08624. [Google Scholar]
  34. Hofer, C.; Kwitt, R.; Niethammer, M.; Uhl, A. Deep learning with topological signatures. Adv. Neural Inf. Process. Syst. 2017, 30, 1633–1643. [Google Scholar]
  35. Rieck, B.; Togninalli, M.; Bock, C.; Moor, M.; Horn, M.; Gumbsch, T.; Borgwardt, K. Neural persistence: A complexity measure for deep neural networks using algebraic topology. arXiv 2018, arXiv:1812.09764. [Google Scholar]
  36. Ebli, S.; Defferrard, M.; Spreemann, G. Simplicial neural networks. arXiv 2020, arXiv:2010.03633. [Google Scholar]
  37. Hajij, M.; Istvan, K. A topological framework for deep learning. arXiv 2020, arXiv:2008.13697. [Google Scholar]
  38. Hu, C.S.; Lawson, A.; Chen, J.S.; Chung, Y.M.; Smyth, C.; Yang, S.M. TopoResNet: A Hybrid Deep Learning Architecture and Its Application to Skin Lesion Classification. Mathematics 2021, 9, 2924. [Google Scholar] [CrossRef]
  39. Gonzalez-Diaz, R.; Gutiérrez-Naranjo, M.A.; Paluzo-Hidalgo, E. Topology-based representative datasets to reduce neural network training resources. Neural Comput. Appl. 2022, 34, 14397–14413. [Google Scholar] [CrossRef]
  40. Bauer, U. Ripser: Efficient computation of Vietoris–Rips persistence barcodes. J. Appl. Comput. Topol. 2021, 5, 391–423. [Google Scholar] [CrossRef]
  41. Edelsbrunner, H.; Letscher, D.; Zomorodian, A. Topological persistence and simplification. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA, 12–14 November 2000; IEEE: Piscataway, NJ, USA, 2000; pp. 454–463. [Google Scholar]
  42. Ghrist, R. Barcodes: The persistent topology of data. Bull. Am. Math. Soc. 2008, 45, 61–75. [Google Scholar] [CrossRef]
  43. Otter, N.; Porter, M.A.; Tillmann, U.; Grindrod, P.; Harrington, H.A. A roadmap for the computation of persistent homology. EPJ Data Sci. 2017, 6, 17. [Google Scholar] [CrossRef] [PubMed]
  44. Ali, D.; Asaad, A.; Jimenez, M.J.; Nanda, V.; Paluzo-Hidalgo, E.; Soriano-Trigueros, M. A survey of vectorization methods in topological data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14069–14080. [Google Scholar] [CrossRef]
  45. The GUDHI Project. GUDHI User and Reference Manual, 3.10.1 ed.; GUDHI Editorial Board, 2024; Available online: https://gudhi.inria.fr/doc/3.10.1/ (accessed on 2 June 2024).
  46. Ghafuri, J.S.Z. Algebraic, Topological, and Geometric Driven Convolutional Neural Networks for Ultrasound Imaging Cancer Diagnosis. Ph.D. Thesis, The University of Buckingham, Buckingham, UK, 2023. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.