Next Article in Journal
Proposed Long Short-Term Memory Model Utilizing Multiple Strands for Enhanced Forecasting and Classification of Sensory Measurements
Previous Article in Journal
Harnack Inequality for Self-Repelling Diffusions Driven by Subordinated Brownian Motion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Flatter Loss Landscape Surface via Sharpness-Aware Minimization with Linear Mode Connectivity

School of Computer Science and Engineering, Macau University of Science and Technology, Macau 999078, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(8), 1259; https://doi.org/10.3390/math13081259
Submission received: 10 February 2025 / Revised: 25 March 2025 / Accepted: 9 April 2025 / Published: 11 April 2025

Abstract

:
The Sharpness-Aware Minimization (SAM) optimizer connects flatness and generalization, suggesting that loss basins with lower sharpness are correlated with better generalization. However, SAM requires manually tuning the open ball radius, which complicates its practical application. To address this, we propose a method inspired by linear connectivity, using two models initialized differently as endpoints to automatically determine the optimal open ball radius. Specifically, we introduce distance regularization between the two models during training, which encourages them to approach each other, thus dynamically adjusting the open ball radius. We design an optimization algorithm called ’Twin Stars Entwined’ (TSE), where the stopping condition is defined by the models’ linear connectivity, i.e., when they converge to a region of sufficiently low distance. As the models iteratively reduce their distance, they converge to a flatter region of the loss landscape. Our approach complements SAM by dynamically identifying flatter regions and exploring the geometric properties of multiple connected loss basins. Instead of searching for a single large-radius basin, we identify a group of connected basins as potential optimization targets. Experiments conducted across multiple models and in varied noise environments showed that our method achieved a performance on par with state-of-the-art techniques.

1. Introduction

In recent years, deep neural networks have advanced machine learning, with over-parameterization increasing fitting capabilities and expanding complex solution spaces, risking overfitting [1]. This reduces saddle points and increases local minima [2], each with varying generalization abilities [3]. Current studies have shown a positive correlation between loss basin flatness and generalization [4,5,6,7,8,9]. Flatness can be described using a Hessian matrix, which represents the second-order derivative [10]. However, computing a Hessian matrix during gradient descent in a large network incurs significant computational costs. To improve the efficiency of gradient descent, machine learning typically relies on first-order optimizers, such as stochastic gradient descent (SGD), which, however, lack awareness of the flatness direction [11].
Reconstructing the flatness perception of SGD is a key challenge for improving optimization. The Sharpness-Aware Minimization (SAM) strategy [9] perturbs the gradient to approximate second-order curvature information [12]. Specifically, SAM minimizes the worst-case loss within a perturbation region of radius ρ around the current solution. However, due to the diverse shapes of the loss landscapes across different networks and datasets [13], selecting an appropriate ρ is challenging. A larger ρ may result in the inability to perceive loss basins, as there may not exist a loss basin in the loss landscape that meets the required radius, leading to divergence and higher training losses.
To address this, Algorithm 1 ASAM [14] normalizes the solution space before setting ρ , aiming to stabilizing optimization. GSAM [15] further normalizes gradients to mitigate uncertainty, allowing ρ to be set within a controlled range (typically between 0 and 1). However, manual tuning remains necessary. Inspired by the concept of linear mode connectivity [16], we propose an approach that models ρ dynamically using two points, A and B, at the end point in linear mode connectivity of the solution space. This eliminates the need for manual tuning, making SAM-based optimization more adaptive and robust.
Algorithm 1 ASAM [14]
  • Input: DataSet S, Loss function L, Batch size b, Step size η > 0 , Neighborhood size ρ > 0 , Initial model θ 0 i n i t ,
  • Output: Model trained with ASAM
  • Start at θ = θ 0 i n i t
  • repeat
  •    Sample batch R = x 1 , y 1 , x b , y b
  •    Compute gradient θ L B ( θ ) of the batch’s training loss;
  •    Compute ϵ ^ ( θ ) = ρ T θ 2 L B ( θ ) / T θ L B ( θ ) 2
  •    Compute gradient approximation for the SAM objective g = w L B ( θ ) θ + ϵ ^ ( θ )
  •    Update weights: θ t + 1 = θ t η g
  •     t = t + 1
  • until converged
  • return θ t
In summary, our heuristic strategy gradually reduces the distance between two points in the solution space, guiding the search toward flatter regions that improve the generalization. By incorporating connectivity awareness into sharpness-aware minimization, we enable the optimization process to target these flatter regions, making the search more effective. Linear mode connectivity improves our understanding of the relationships between minima, allowing the algorithm to converge to connected minima along linear modes, thereby enhancing the optimization. Experimental results confirmed that our method achieved improved convergence and better generalization.
We propose a heuristic training strategy aimed at generating linearly connected mode endpoints [16]. The radius ρ of the SAM method is modeled using two points, A and B, in the solution space, to avoid manually setting the hyperparameter ρ . Specifically, we redefine the SAM open ball using A and B. The midpoint between A and B serves as the center, and the radius ρ is defined as half the distance between them (see Figure 1). In each training iteration, our heuristic algorithm optimizes the models corresponding to points A and B, while adding a distance regularization constraint that simulates the SAM optimization process. To ensure connectivity, we employ a subspace learning strategy [17]. Specifically, our algorithm initializes the optimization process at the center of points A and B in each iteration and applies subspace optimization techniques to maintain a linearly connected path [17,18]. The goal of our strategy is to gradually reduce the distance between A and B, guiding them to converge toward the same flat local minimum basin or neighboring local minimum basins with low loss barriers, as illustrated in Figure 1. This process is akin to a binary star system, where A and B behave like stars in space, mutually influencing each other’s gravitational pull, thus reducing the distance between them over time.
Our contributions are summarized as follows:
  • We explored how to achieve linear mode connectivity in sharpness-aware optimization, and we propose that connectivity can be leveraged to dynamically adjust the radius ρ in SAM optimization, removing the need for manual tuning.
  • We developed a heuristic search algorithm that combines flatness optimization with linear connectivity. This algorithm simulates the gradual convergence of two points in the solution space, ensuring that when the search stops, the two solutions exhibit both linear connectivity and flatness within a geometric range that includes their neighbors.
  • We conducted comprehensive validation experiments, including testing across multiple network architectures and visualization studies, confirming the effectiveness of our approach.

2. Related Works

2.1. Sharpness-Aware Minimization

Recent works have suggested that flat local minima are associated with better generalization ability [19,20,21,22]. However, flatness does not directly equate to generalization, and flat regions may even exist in areas of high loss. Furthermore, some studies have argued that flatness does not exhibit a strong statistical correlation with generalization performance [23]. The Probably Approximately Correct (PAC) theory defines generalization, and SAM [9] uses the upper bound of generalization as the optimization objective. Subsequent works have focused on gradient estimation methods using this upper bound, including scale-invariant approaches [14], and zero-order and first-order flatness estimation methods [15], among others. SAM has also been extended beyond Euclidean space to statistical manifolds [24] and Riemannian manifolds [25], where the optimization process was adapted to these more complex spaces. Potential improvements to SAM optimization could involve refining the gradient vector estimation, leading to more accurate directions for gradient descent optimization.

2.2. (Linear) Mode Connectivity

Seeking linear mode connectivity during training implies partial flatness, meaning the existence of low-loss regions along the direction of a specific vector [26,27,28,29]. In sharpness-aware optimization, a surface represents the local minimum, with a defined center and radius, ensuring linear connectivity between any two points within the basin. This implies that the line connecting the two furthest points in the basin corresponds to the basin’s diameter in the flatness model of the local minimum, as discussed by [17,29]. Methods for identifying connectivity include improvements in gradient descent (SGD) [26,30] and heuristic approaches [16,31]. Building on previous work, the process begins by selecting two points that are separated by significant loss barriers [16]. Then, the distance between them is gradually reduced while maintaining flatness, until a termination point is reached, signifying linear connectivity. This process helps identify a broad flattened basin. When combined with SAM optimization using a fixed hyperparameter ρ , this resembles an implicit pinching strategy, aimed at locating a region with a broader ρ .

3. Preliminary

3.1. Sharpness-Aware Minimization

The classification model operates within the joint probability distribution space ( X , Y ) , where X represents the sample space and Y denotes the label space. The training set, denoted as D , adheres to the distribution ( X , Y ) , and consists of n samples, S = x i , y i i = 1 n , independently sampled from D . The parameters of the model, denoted by θ Θ R d , are used to model the region around θ with an open ball B ( θ , ρ ) , where ρ > 0 is the radius and θ is the center for the open ball B in the Euclidean space, i.e., B ( θ , ρ ) = θ : θ θ p < ρ .
The training loss, L S ( θ ) = i = 1 n l y i , f x i ; θ / n , is the average of individual losses over the training set. The generalization gap, which compares the expected loss L D ( θ ) = E ( x , y ) D [ l ( y , f ( x ; θ ) ) ] to the training loss L S ( θ ) , measures the ability of the model to generalize to unseen data.
Sharpness-Aware Minimization (SAM) aims to minimize the following PAC-Bayesian generalization error upper bound:
L D ( θ ) max ϵ p ρ L S ( θ + ϵ ) + h θ 2 2 ρ 2 .
With optimization operators for L S , there is an empirical risk of minimization. h is a strictly increasing function that can be replaced by a norm function. Sharpness-Aware Minimization (SAM) [9] employs a one-step backward approach to obtain gradient approximation:
ϵ ^ ( θ ) = ρ θ L S ( θ ) / θ L S ( θ ) .
Then, SAM computes the gradient with respect to the perturbed model θ + ϵ ^ for the second step update:
θ L S S A M ( θ ) θ L S ( θ ) θ + ϵ ^ .
The detailed SAM is presented in Algorithm 2.
Algorithm 2 SAM [9]
  • Input: DataSet S, Loss function L, Batch size b, Step size η > 0 , Neighborhood size ρ > 0 , Initial model θ 0 i n i t
  • Output: Model trained with SAM
  • Start at θ = θ 0 i n i t
  • repeat
  •    Sample batch R = x 1 , y 1 , x b , y b
  •    Compute gradient θ L B ( θ ) of the batch’s training loss;
  •    Compute ϵ ^ ( θ ) = ρ θ L S ( θ ) / θ L S ( θ )
  •    Compute gradient approximation for the SAM objective g = w L B ( θ ) θ + ϵ ^ ( θ )
  •    Update weights: θ t + 1 = θ t η g
  •     t = t + 1
  • until converged
  • return θ t

3.2. Analysis of Sharpness-Aware Minimization

An optimal perturbation radius ρ is crucial for Sharpness-Aware Minimization (SAM) convergence. The ideal ρ target should balance the local curvature and flat region geometry, but an explicit search is challenging due to non-hyperspherical flat regions and their network-dependent variability [13]. In non-convex optimization, geometric uncertainty forces a reliance on trial-and-error for ρ selection.
To address the challenge of hyperparameter selection, we propose a principled method for selecting ρ from the perspective of linear mode connectivity, to enable a more efficient screening. Through observation, we identified that flat regions in the loss landscape inherently satisfy linear connectivity. Based on this, we present the following theory.
Theorem 1.
The optimal ρ target is the maximal diameter of linearly mode connected regions [16], defined as:
ρ target = sup ρ > 0 | θ A , θ B B ( θ , ρ ) , linear mode connectivity holds .
Theorem 1 ensures optimization within flat regions, but non-convexity complicates direct implementation. We instead use backward reasoning, approximating ρ target via nonlinear mode connectivity.
Theorem 2.
For any pair of parameters θ A , θ B . If θ A and θ B exhibit a loss barrier (i.e., are not linear mode connected), they cannot coexist in a ball B ( θ , ρ ) of any radius ρ > 0 , also including ρ target .
Although Theorem 2 cannot directly lead us to an appropriate solution, it can help us eliminate distracting solutions.
Building on both insights, we propose an implicit ρ -search algorithm that operationalizes linear connectivity as a dynamic selection heuristic. First, we reformulate the optimization of SAM by modeling the endpoints of its hyperspherical perturbation as linear mode connectivity (Section 4). Subsequently, we introduce a Twin Stars Entwined (TSE) algorithm to co-optimize these paired models, steering the gradient search toward flatter regions (Section 5).

4. Modeling the SAM with Endpoints

4.1. Definition

We redefine the SAM open ball by taking the midpoint C of two points A and B, which are the endpoints of a line passing through the center of the sphere, as the center, and the half-distance between these two points as the radius ρ . Mathematically, the center C and radius ρ are expressed as
C = A + B 2 , ρ = A B 2 ,
where center C is a virtual center as the ball center definition in SAM. The optimization goal of SAM is to minimize the loss function on the set of points z B ( C , ρ ) , where B ( C , ρ ) is the open ball centered at C with radius ρ :
L S S A M avgmin z B ( C , ρ ) max ρ L ( z ) .
However, Equation (6) cannot be directly optimized. The reason for this is that individually optimizing A and B cannot guarantee they reside within the same SAM (Sharpness-Aware Minimization) conceptual ball. In other words, when optimized separately, A and B may converge to different basins, due to factors like the initialization and optimization algorithms.
Inspired by Theorem 2, we change the target of the optimization process to achieve linear mode connectivity. In our framework, this is equivalent to minimizing the loss between points A and B directly. The premise is to ensure linear mode connectivity [16], as in (7).

4.2. Train Subspace

We introduce the target radius ρ target , defined as the distance at which the path between points A and B is linearly connected. This means that there exists a straight line segment connecting A and B, and all points along this segment lie within a relatively flat region of the loss function. Mathematically, we define linear connectivity as follows:
Linear Mode Connectivity : γ ( α ) R d , α [ 0 , 1 ]
such that γ ( 0 ) = A , γ ( 1 ) = B , α [ 0 , 1 ] , L S ( γ ( α ) ) 2 = 0 ,
where γ ( α ) = ( 1 α ) A + α B , α [ 0 , 1 ] .
In the context of linear mode connectivity, the ideal linear interpolation between two optima θ 1 and θ 2 should satisfy L S ( γ ( α ) ) 2 = 0 for all α , implying a connected low-loss path in the parameter space. Subspace learning [17] provides a framework to guarantee linear mode connectivity, defined as
L subspace = E α U [ 0 , 1 ] L S ( γ ( α ) ) 2 .
However, subspace learning requires a large number of weighted sampling operations on the two weights. In practical applications, the model definition needs to be significantly modified, which is not convenient for integration. Note that SAM provides a linear connectivity guarantee around the hypersphere ρ of radius from Theorem 1. Inspired by this, we combine subspace learning with the SAM process to explore a flatter basin.

5. Proposed Algorithm

To better find flatter local minimum basins, SAM (Sharpness-Aware Minimization) and subspace learning provide several factors. The details are illustrated in Figure 2.
To provide an intuitive description of our motivation, we present the following facts:
1.
Ideal SAM guarantees connectivity for the convergent solution and a hyperball centered at this solution with radius ρ .
2.
A larger ρ to attempt convergence to flatter regions. However, if ρ is too large, the algorithm may oscillate due to the absence of basins in the solution space that meet the current ρ requirement.
3.
Based on Theorem 2, the connectivity of two points A and B in the solution space determines whether they lie within the hyperball of ideal SAM.
4.
If SAM optimization with radius ρ is applied to endpoints A and B separately, under ideal conditions, A and B will converge, and if the distance between A and B is less than or equal to 2 ρ , then A and B are linearly connected.
5.
When Fact 4 is satisfied, the flat regions guaranteed by A and B are adjacent. The combined flat region guaranteed by A and B is larger than either of them individually, and thus can be regarded as a flatter region.
6.
If the line connecting A and B is treated as the diameter of a virtual ball, and Fact 4 is satisfied, then the virtual ball formed by A and B may also conform to a larger ideal SAM hyperball with radius ρ .
Summarizing the above facts, we propose a new optimization approach to implicitly obtain a larger flat region. We define an optimization process where we model the line connecting A and B as a virtual ball with radius ρ . Assuming A and B are optimized using ρ -SAM and exhibit nonlinear connectivity upon convergence, we use distance regularization to bring A and B closer, and restart training until linear connectivity is achieved, at which point we obtain a larger flat region. We further model the process of A and B approaching each other as a process with T iterative steps, as detailed in Section 6. During optimization, the target radius is determined by the iterates A t and B t at step t, defined as
ρ target = min ρ ρ A t B t 2 ,
s . t . L S ( γ ( α ) ) 2 = 0 , α U [ 0 , 1 ] ,
γ ( 0 ) = A t , γ ( 1 ) = B t .
As optimization progresses, A t and B t move closer, until they enter the same or adjacent loss basins, satisfying
lim t T A t B t 2 = ρ target , L S ( A t ) 2 = L S ( B t ) 2 = 0 .
Our algorithm has the potential to converge if ρ -SAM converges. If the manual setting of ρ for the SAM algorithm converges, there exists at least one region in the solution space where both points A and B can converge simultaneously. The extreme case is when A and B collapse into a single point as they approach each other.
Our algorithm offers two key advantages. First, our algorithm eliminates the need for laborious manual tuning of ρ —instead of exhaustively searching for an optimal value, it only requires an initial ρ that ensures SAM convergence, then automatically expands the search for flatter regions. Second, our algorithm inherently avoids poor solutions by design: supported by convergence analysis and Facts 5–6, the method stabilizes SAM while providing a provable lower-bound guarantee on solution quality.

6. Twin Stars Entwined (TSE) Problem

6.1. Problem Definition

The problem mentioned in Section 5 which we aim to solve involves identifying a pair of local minima, ( θ t A , θ t B ) , in the solution space at step t, and then locating the next pair of local minima, ( θ t + 1 A , θ t + 1 B ) , at step t + 1 . We require that the distance between these two minima decreases with each step, meaning that D i s t ( θ t + 1 A , θ t + 1 B ) D i s t ( θ t A , θ t B ) , where D i s t ( · ) represents the Euclidean distance. If we view the discovered pairs of local minima as a trajectory over time, the sequence { ( θ 0 A , θ 0 B ) , , ( θ t A , θ t B ) } can be imagined as two stars, A and B, gradually drawing closer to each other in the vast universe of the solution space. We refer to this problem as the Twin Stars Entwined (TSE) problem.
We propose a toy optimization approach for the Twin Stars Entwined (TSE) problem, defined by the following optimization problems:
min θ t + 1 A α A θ t + 1 A θ t A 2 2 + ( 1 α A ) θ t + 1 A θ t B 2 2
min θ t + 1 B α B θ t + 1 B θ t A 2 2 + ( 1 α B ) θ t + 1 B θ t B 2 2 .
In this toy optimization approach, the current solutions θ t A and θ t B serve as anchor points to determine the next positions. The updates to the solutions, θ t + 1 A and θ t + 1 B , are guided by their corresponding values in the previous step, θ t A and θ t B . For example, setting α A = 0.9 will make the optimal solution θ t + 1 A close to θ t A . Similarly, if α B = 0.1 , the resulting optimal solution θ t + 1 B will be closer to θ t B . As the iterations progress, the toy optimization process forms a trajectory where the two solutions gradually converge towards each other.
We employ weight importance evaluation to enhance the toy TSE optimization. Taking endpoint A as an example, we aim to minimize the distance between θ t + 1 A and θ t A . It is crucial to apply weighted importance when evaluating distances. Based on the dropout feasibility assumptions and the lottery hypothesis [28,32], changes in redundant parameters do not impact model performance. We adopt the weight importance evaluation method used in continual learning tasks, specifically utilizing the Fisher Information Matrix [18,33]. To simplify, we only consider the diagonal elements of the Fisher matrix, assuming parameter independence. The Fisher matrix is computed as follows:
F θ i = 1 | S | j = 1 | S | log p y j x j ; θ θ i 2 .
We rewrite the toy optimization from Equations (11) and (12) as a loss function:
L D i s ( θ , θ A , θ B , α ) = α F ( θ A ) ( θ θ A ) 2 2 + ( 1 α ) ( F ( θ B ) ( θ θ B ) 2 2 .
Here, θ represents the current gradient descent point in the solution space, while θ A and θ B are the outputs from the previous round. Weight importance introduces flexibility into the search space. If the weight importance is zero, this allows the parameter to change freely, enabling backpropagation to refine the model without disrupting existing knowledge. This flexibility is vital because we cannot ensure the existence of a stopping condition between the points in the current step.
Weight importance enhances the flexibility of TSE but does not guarantee generalization. By incorporating TSE into SAM optimization, we improve sensing and ensure better generalization. However, SAM struggles with perceiving loss barriers [16], as its forward steps may bypass loss barriers and move directly to another basin of loss-consistent contours. The perception of relationships between basins involves two key aspects. If a low-loss barrier path connects two basins, the gradient descent should follow this path [16,17,34]. If no such path exists, the optimizer must escape the current local minimum. Assuming a connected path between two global minima [2], a low-energy point [16] can guide the gradient descent. Thus, we can reformulate the optimization problem as
min θ max ϵ 2 ρ L S ( θ + ϵ ) + β L D i s ( θ + ϵ , θ A , θ B , α ) .
This process is repeated independently twice at each step. In the first step, gradient descent searches for θ t + 1 A , aiming to stay close to θ t A with α = 0.9 . In the second step, gradient descent searches for θ t + 1 B , aiming to stay close to θ t B with α = 0.1 . The details of each step are outlined in Algorithm 3.
Algorithm 3 TSE-SAM
  • Input: DataSet S, Loss function L, Batch size b, Step size η > 0 , Neighborhood size ρ > 0 , Initial model θ 0 i n i t , Pre-step models θ A and θ B , trade off factor α
  • Output: Model trained with TSE
  • Start at θ = θ 0 i n i t , parameter set [ θ ] = [ θ , θ A , θ B ]
  • repeat
  •    Sample batch R = x 1 , y 1 , x b , y b
  •    Compute gradient θ L B ( θ ) and θ L B D i s ( [ θ ] , α ) of the batch’s training loss;
  •    Joint Gradient L ^ B ( θ ) = L B ( θ ) + L B D i s ( [ θ ] , α )
  •    Compute ϵ ^ ( θ ) = ρ T ` 2 L ^ B ( θ ) T ` L ^ B ( θ ) 2
  •    Compute gradient approximation for the SAM objective g = w L B ( θ ) θ + ϵ ^ ( θ )
  •    Update weights: θ t + 1 = θ t η g
  •     t = t + 1
  • until converged
  • return θ t

6.2. Iterative Strategy for Twin Stars Entwined (TSE) Problem

We design a iterative search process to solve the Twin Stars Entwined (TSE) problem. Initially, we randomly initialize θ 0 i n i t . We apply two training strategies, so that the gradient trajectories starting from θ 0 i n i t can move in two different directions and converge to two distinct solutions. We do not start iterations from two separate random initializations to accelerate convergence. The initialization is defined as follows:
θ 0 A = S A M ( θ 0 i n i t , S )
θ 0 B = A S A M ( θ 0 i n i t , S ) .
After obtaining the two converged models A and B, we select the center of the ball as the starting point for the next round of the search. This idea is inspired by subspace optimization [17], aiming to also achieve low loss at the interpolation points. It is defined as follows:
θ t + 1 i n i t = θ t A + θ t B 2 .
If the stopping condition is not met, we consider the radius ρ represented by the current models A and B to still be too large. Using A and B from the previous iteration as anchor points, we conduct a new round of search, defined as follows:
θ t + 1 A = T S E ( θ t i n i t , θ t A , θ t B , α A )
θ t + 1 B = T S E ( θ t i n i t , θ t A , θ t B , α B ) .
The object of our heuristic is to ensure that d i s t ( θ t + 1 A , θ t + 1 B ) < d i s t ( θ t A , θ t B ) , d i s t ( · ) is the distance between two models. TSE has demonstrated empirical effectiveness in achieving this objective.
Stop condition Under the objective set for TSE to stop in flatter fields, as discussed in Section 4, the onset of linear mode connectivity serves as the stopping criterion for an iterative search. We measure linear mode connectivity using linear interpolation, defined as
θ i = θ A + i N ( θ B θ A ) .
We derive a set { θ i | i = 0 , 1 , , N } comprising N elements and compute the average and variance of each model’s loss on the train set. Subsequently, we employ the variance as a stop criterion. Essentially, our iterative search halts when the variance falls below a threshold δ . Otherwise, our iterative search will conclude upon reaching the maximum iteration step I, as outlined in Algorithm 4:
μ ( θ A , θ B ) = 1 N i = 0 N L S ( θ i )
σ ( θ A , θ B ) = i = 0 N ( L S ( θ i ) μ ( θ A , θ B ) ) 2 .
Algorithm 4 Iterative Search
  • Input: DataSet S, Max Rounds I, Trade Off Factor α A and α B , Threshold δ
  • Output: θ A , θ B
  • Initialize weights θ 0 i n i t
  • Get initial point A θ 0 A = S A M ( θ 0 i n i t , S )
  • Get initial point B θ 0 B = A S A M ( θ 0 i n i t , S )
  • for  i = 0  to  I 1  do
  •    Get next step initial point θ t + 1 i n i t = ( θ t A + θ t B ) / 2
  •    Get point A:
  •     θ t + 1 A = T S E ( θ t + 1 i n i t , S , θ t A , θ t B , α A )
  •    Get point B:
  •     θ t + 1 B = T S E ( θ t + 1 i n i t , S , θ t A , θ t B , α B )
  •    if  σ ( θ t + 1 A , θ t + 1 B ) < δ  then
  •      Stop loop
  •    end if
  • end for
  • return  θ t A , θ t B

7. Experiment Setup

7.1. Image Classification: CIFAR-10/100

To evaluate the effectiveness of our proposed approach, we compared TSE with three baseline optimizers: Stochastic Gradient Descent (SGD), Sharpness-Aware Minimization (SAM) [9], Adaptive SAM (ASAM) [14], and GAM [15]. These methods served as baselines to assess the improvements introduced by our approach in terms of optimization stability and generalization performance. To ensure reproducibility, we retained the original hyperparameter settings without modification. We extended the baseline code from sam-main (Baseline code of CIFAR-10/100 was sourced from the GitHub repository sam-main: https://github.com/davda54/sam) (accessed on 1 February 2025) by adding the TSE training pipeline for datasets CIFAR-10/100 and the architecture definitions of resnet [35], wideresnet [36] and pyramidnet [37]. Our code is available at https://github.com/Timiain/TSE.git (accessed on 1 February 2025).
In the training setup, the same as the sam-main default setting, SAM with ρ was set to 0.05 and Adaptive SAM(ASAM) with ρ was set to 2, as the default in original paper. The experiment was conducted on the CIFAR-10 and CIFAR-100 datasets. The base optimizer was the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and using a linearly decreasing learning rate scheduler. The initial learning rate was 0.1, and L2 regularization with a weight decay of 0.0005 was used to enhance generalization. Label smoothing was applied with a smoothing factor of 0.1. The batch size was set to 128, and the model was trained for 200 epochs for each iteration, with a dropout rate of 0.0. Note that SAM-family is an optimization framework whose first step is to add a forward perturbed gradient and in a second step generate new a gradient after the first step update. The specific strategy for updating the model’s learnable parameters was implemented by the base optimizer. The experiment was run in a single Titan X gpu environment.
Then, we divided the training process into two stages, which were the Initial and TSE stages. In the initial stage, we specially trained models A and B with the SAM and ASAM optimizers. In the TSE stage, we set I = 5 as the maximum number of iterations. In each iteration, new models of A and B were trained with hyperparameters and initial steps consistent with the github sam-main setting. Our key modifications were as follows:
  • The first-step perturbation was replaced with the perturbation defined by the TSE Algorithm 3.
  • The training loss was updated to Equation (19), which incorporates an additional EWC regularization term.

7.2. Semantic Segmentation: Vaihingen

To validate the adaptability of our TSE algorithm, we selected a remote sensing semantic segmentation task, which was distinct from image classification. For ease of replication, we selected the Vaihingen dataset [38] along with its state-of-the-art (SOTA) baselines: UnetFormer [39] and LSKNet [40]. Firstly, UnetFormer and LSKNet are UNet-like architectures. They blend self-attention mechanisms with convolutional networks. UnetFormer comprises a CNN encoder and a decoder that integrates a Swin-like [41] Transformer block and CNN structures in each decoder layer. LSKNet introduces a Large Kernel Selection (LSK) mechanism for dynamically selecting kernels in the encoder and UnetFormer decoder. Secondly, there was’ a significant visual disparity in the semantic content between the dataset for remote sensing semantic segmentation and the classification of the CIFAR dataset.
We used UNetFormer (Baseline code of UNetformer was sourced from the GitHub repository Unetformer: https://github.com/WangLibo1995/GeoSeg) (accessed on 1 February 2025) and LSKNet (Baseline code of LSKNet-UNetformer was sourced from the GitHub repository LSKNet: https://github.com/zcablii/GeoSeg) (accessed on 1 February 2025) for repository for replication. We inherited the hyperparameter and optimizer scheduler settings from the chosen baseline implementations. Specifically, the base optimizer was AdamW [42], and the learning rate scheduler was set to CosineAnnealingWarmRestarts. The total training epochs for each baseline was 105. We modified the baselines by adding SAM [9], ASAM [14], and TSE. Similarly, we divided the training process into the Initial and TSE stages, the same as in Section 8.1. We retained the recommended settings from the original paper, with ρ = 2 for SAM and ρ = 0.05 for ASAM, without manual tuning of hyperparameters ρ .

8. Experimental Results

8.1. Image Classification: CIFAR-10/100

In order to validate the performance of our method, we conducted verification on multiple network structures, including resnet [35], wideresnet [36], and pyramidnet [37]. We present the best results from our iterative iterations in Table 1 and Table 2. As indicated in (11) and (12), we reproduced the results of SAM [9] and ASAM [14]. To evaluate the algorithm’s performance more accurately, we compared the latest results reported by GAM [15].
We divided the experimental setup into two stages. When training with the SAM and ASAM optimizers, we set the training epoch to 200, fixed the initial learning rate at 0.03, and used a linearly decreasing learning rate scheduler. The batch size was 128. For the ρ hyperparameter, we did not search for better hyperparameters, and all experiments used a fixed ρ value, with SAM using ρ = 0.05 and ASAM using ρ = 2.0 , as suggested by the authors. Both the SAM and ASAM optimizers utilized Stochastic Gradient Descent (SGD) with a momentum of 0.9. We set I = 5 as the maximum number of iterations. To ensure significant divergence from the initialized model during the iterative iteration steps, we initially set a high learning rate of 0.1 for each iteration.
Table 1 presents the maximum test accuracies of the different optimization algorithms (SGD, SAM, ASAM, GAM, and TSE) on the CIFAR-10 dataset. TSE generally performed the best across most models, especially on ResNet18, WRN16-8, WRN28-2, WRN28-10, and PyramidNet-110-48. GAM achieved the highest accuracy on ResNet50 and PyramidNet-110-270. Overall, TSE led in most cases, with GAM excelling for specific models like ResNet50 and PyramidNet-110-270. Table 2 shows the maximum test accuracies of the various optimization algorithms (SGD, SAM, ASAM, GAM, and TSE) on the CIFAR-100 dataset. Our TSE consistently outperformed the other algorithms for most models.

8.2. Semantic Segmentation: Vaihingen

We established a new baseline suitable for applying SAM-like optimization algorithms to the remote sensing semantic segmentation task. As shown in Table 3, TSE achieved a mIoU of 85.1% (85.07%) on UNetFormer, while TSE in LSKNet-S achieved a mIoU of 85.5% (85.47%). We observed that directly using SAM and ASAM with default parameters on a new task yielded varying performance across the different network architectures. Specifically, SAM showed positive effects both on UNetFormer and LSKNet, whereas ASAM had a negative impact on LSKNet-S. SAM and ASAM, without manual adjustment of hyperparameters, may not bring improvements in new scenarios, and our TSE process could mitigate this phenomenon.

8.3. Analysis of Iterative Search

In this section, we analyzed the behavior of our proposed iterative search algorithm.

8.3.1. Can ρ Be Infinite?

To confirm our hypothesis that ρ cannot be infinite, we conducted a hyperparameter search experiment on ρ . We set the grid search element ρ = [ 10 3 , 10 2 , 10 1 , 10 0 , 10 1 ] , as shown in Figure 3. When ρ = 10 , the results of both SAM and ASAM showed degradation, as in the report in [14]. The experiments in Figure 4 demonstrate that setting the radius ρ to infinity was not feasible, indicating that the requirement to implicitly explore flatter surfaces is reasonable.

8.3.2. Results of Adaptive Radius

We proposed a heuristic strategy aimed at adaptively searching for an appropriate connectivity radius ρ for each specific network architecture. We recorded the average Euclidean distance between corresponding layers of models A and B at each iteration, as illustrated in Figure 4. In Figure 4, the Y-axis represents the Euclidean distance between points A and B. It is observed that the inter-model distance gradually decreased as the iterations progressed. The Euclidean distance at the stopping criterion varied across the different models and dataset configurations. The experimental results also demonstrate the difficulty of manually setting the radius ρ for SAM.
Another interesting finding is that, for each dataset, the ranking of the radius size at the stopping criterion across different networks was same as for the model architecture. For example, the largest radius was observed for ResNet18 and WRN16-8, while the smallest radius was found for PyramidNet-110-64. This finding provides a guide for further analysis of the geometric properties of the loss landscape associated with different network architectures.

8.3.3. Results of Linear Mode Connectivity

We employed visualization techniques, as described in [43], to illustrate the relationship between pairs of snapshots produced by our algorithm, as shown in Figure 5 and Figure 6. For this demonstration, we selected PyramidNet110 with a configuration of α = 48 as a representative example. Initially, both snapshots were trained from the same starting weight, allowing us to compare the differences between them.
When independently trained with SAM (Sharpness-Aware Minimization) and ASAM (Adaptive Sharpness-Aware Minimization), the resulting pairs of snapshots exhibited notable differences in their loss landscapes. Specifically, these snapshots showed distinct loss barriers, reflecting the impact of the different training methods. However, as the iteration steps progressed, the interpolation loss—the loss computed between the two snapshots generated by our algorithm—gradually decreased, indicating that the relationship between the snapshots became more aligned. This behavior suggests that our algorithm successfully minimized the loss discrepancy between the snapshots over time, leading to a smoother convergence in the training process.
We observed a distinct pattern in the behavior of the solution space. When two points were randomly selected within this space, the resulting loss was often quite high. This suggests that starting from different initialization points can lead to a reduction in loss during optimization. However, there is a significant risk that the solutions will converge to disconnected local minima, leading to suboptimal results. In this scenario, the interpolation between the two models will exhibit both high loss and low accuracy, as the solutions are not well-aligned.
We applied the proposed TSE search. As shown in Figure 4, when endpoints A and B gradually converged, we observed a consistent reduction in the interpolation loss between the two points. Initially, the loss was at the 10 4 level, but by step 5 of the optimization process, it decreased to the 10 2 level, demonstrating significant progress. Throughout this process, the accuracy of the interpolated models remained consistently high, indicating a linear connectivity between the two models, which supported smoother transitions.
Furthermore, through detailed visualization analysis, the interpolation results provided additional confirmation that our model closely approximated a flatter loss surface. This suggests that the models, as they converged, approached a region of the solution space with less steep gradients, which was beneficial for stability and generalization.

8.3.4. Results of Model Diversity

In deep learning, the diversity of models can be measured by calculating the hierarchical cosine similarity of the model’s learnable parameters. Specifically, the learnable parameters of different models with the same architecture in various layers can be compared using cosine similarity to quantify their similarity in the solution space. As cosine similarity moves closer to 0, this indicates a low correlation at the parameter level [17,44], suggesting that the models may have learned complementary features, which helps improve the generalization ability of the ensemble model. In contrast, a higher cosine similarity may indicate significant redundancy among the models, which affects their diversity. By utilizing this metric, model architecture selection, ensemble learning methods, and distillation strategies can be optimized to enhance the overall performance.
We visualize the similarity of the endpoints of our search strategy with Figure 7, Figure 8 and Figure 9. We calculated the cosine similarity of the parameters at each layer when models A and B converged, and used kernel density analysis to visualize the distribution of layer-wise cosine similarities between the endpoints. The blue region represents the similarity after the first iteration, where we observed peak distributions close to −1 and 1, indicating a higher correlation between the models at this point. The orange region is more concentrated and centered around 0, suggesting that there was greater divergence between endpoints A and B at this stage.
We designed an ensemble learning experiment to assess the performance of the two models, Model A and Model B, on a test dataset. The objective was to compare the individual predictions from the models and evaluate the effectiveness of combining them. In each experiment batch, the predictions from Model A and Model B were combined using a weighted averaging method. Afterward, the softmax function was applied to these combined predictions to compute the final class probabilities.
The experimental results, as presented in Table 4, show the performance of both models individually, as well as when combined bagging. Specifically, we focused on integrating the logits (the raw output values before the softmax transformation) from both models after they made their final predictions. By combining these logits, we aimed to leverage the strengths of both models and improve the overall performance.
The results from the experiment demonstrate that combining models with different characteristics led to better predictive outcomes. The diversity between Model A and Model B allowed the ensemble method to capitalize on their complementary strengths. Additionally, our approach introduced an implicit form of orthogonal regularization, meaning that the models’ outputs, when integrated, exhibited reduced correlation. This regularization effect can help improve generalization and prevent overfitting, which further contributes to the enhanced performance of the ensemble model.

8.3.5. Results of Robustness to Label Noise

As demonstrated in previous work [9,14], both SAM (Sharpness-Aware Minimization) and ASAM (Adaptive Sharpness-Aware Minimization) have shown strong robustness to label noise in training data, making them effective in noisy environments. Building on these findings, we performed experiments using symmetric label noise at various noise levels: 0.2, 0.4, 0.6, and 0.8. This symmetric noise was introduced to mimic the effect of noisy or mislabeled data during the training process.
In our experiments, we evaluated the performance of our algorithm against that of standard SGD (Stochastic Gradient Descent), SAM, and ASAM at these different noise levels. The results indicate that, unlike SGD, which suffered from a significant performance drop as label noise increased, SAM and ASAM maintained relatively stable accuracies, even under noisy conditions. However, our proposed algorithm consistently outperformed both SAM and ASAM in terms of test accuracy across all noise levels, except for the 80% noise radio. This demonstrates the superior robustness of our approach, which effectively mitigated the impact of label noise, while boosting the overall model performance.
The detailed experimental results, which highlight the performance improvements of our algorithm(TSE) in various noisy conditions, are shown in Table 5.

9. Discussion and Limitations

9.1. Failure Case Analysis

The main challenge with the algorithm arises from obtaining the model in the initial step. Factors influencing this include incorrectly set hyperparameters, such as ρ and learning rate. If the initial models A and B are difficult to converge, this implies that the iterative training process may produce jumps. To avoid excessive consumption of computational resources, the training process can be further analyzed for biases through a visual analysis at each step, as shown in the Figure 10.
Visualizing the weights obtained at each step helps with error analysis. First, compare the weights obtained by the different optimizers and their performance metrics. In Figure 10, the linear interpolations among SGD, SAM, and ASAM exhibit higher loss barriers, suggesting that these optimization methods converge to distinct solutions. Meanwhile, there was no significant difference in performance, suggesting that the current training hyperparameter settings were reliable. Second, analyzing the performance connectivity between the current and previous steps helps determine whether the search iteration has introduced any biases, such as checking whether the distance regularization term produced the expected effect. Third, a comprehensive analysis of the connectivity between the solutions at each step can help determine if the process is converging to a flat region. In a good optimization process, the linear interpolation loss barriers between the models at later steps show a decreasing trend, indicating that the search direction is likely becoming flatter. Otherwise, it may be necessary to return and adjust the training hyperparameters.

9.2. Consumption Analysis

Our algorithm only adds a regularization term to each step of the test, resulting in no significant increase in memory consumption, apart from the overhead introduced by the regularization term. The regularization term requires storing the Fisher matrix of the network’s learnable weights.
The computational time of our algorithm is difficult to estimate directly. Therefore, we provide a table listing the number of forward and backward propagations during training. Assuming that the computational cost of forward and backward propagation for one epoch of stochastic gradient descent (SGD) is taken as 1 unit, and one full dataset pass consists of n epochs, we define the maximum number of iterations of our algorithm as I. The table below shows the worst-case computational cost of our algorithm:
The Table 6 describes the worst-case scenario, where the number of training epochs per iteration is the same as in standalone training. To accelerate the process, an early stopping strategy can be applied during iterations, while setting a smaller number of epochs. The two additional passes added in each iteration are required to generate the Fisher matrix.

9.3. Stop Condition Analysis

The loss variance between two endpoints is a robust criterion for terminating iterations, ensuring consistent performance. Table 7 shows the loss variance of linear interpolation models between the endpoint models at each step. As iterations increased, this variance decreased monotonically. Optimal results typically occurred near the minimum variance step, except for ResNet-50 on CIFAR-100. Due to shared hyperparameters, PyramidNet-110-270 required more steps to reduce variance. We proposed a threshold of 0.1 as a stopping criterion. Additionally, visualizing the loss curves of the interpolation models at the conclusion of each iteration step provides significant insights, as shown in Figure 5 and Figure 6. Such visualizations enable the assessment of endpoint model convergence and facilitate a comparative analysis of the performance of the two endpoints, thereby informing decisions on whether to proceed with further iterations.
Note that we prefer to set a threshold where iterations are terminated once the variance falls below this threshold, thereby avoiding excessive iterations. One of the motivations behind our algorithm was to model the endpoints in search of flatter regions. Excessive iteration steps may cause the endpoints to gradually converge and eventually collapse into a single point.

9.4. Meta and Heuristic Learning

Our approach primarily emphasizes regularization and fine-tuning, aligning more closely with greedy algorithms within the heuristic optimization paradigm. However, the discussion of heuristic methods and meta-learning extends beyond the core scope of this paper. Nevertheless, our algorithm can be regarded as a foundation for further applications of heuristic algorithms, which, in turn, have the potential to enhance our method.
First, from the perspective of meta-learning, our method (TSE) lacks a memory mechanism. Meta-learning focuses on learning to learn, where the optimization process may carry more generalizable information about the loss landscape geometry [45,46,47]. In contrast, our TSE relies solely on the model state from the previous step for updates. A potential direction for improvement would be to incorporate multi-step historical information to enhance the optimization performance. Furthermore, integrating more comprehensive metrics could facilitate a more accurate assessment of whether the current search direction is appropriate.
Second, heuristic algorithms offer advantages in refining regularization terms and selecting search starting points. For instance, combinatorial optimization strategies such as evolutionary algorithms [48] and genetic algorithms [49] can optimize the selection of iteration starting points, while providing greater flexibility in adjusting the weights of regularization terms. The incorporation of these methods has the potential to further improve the efficiency and robustness of model optimization.

9.5. Future Directions

Our algorithm and visualization can provide a more intuitive way to search for flat regions. Our algorithm method offers an adaptive approach for flat region detection. This method follows a bounding strategy: it begins by assuming a sufficiently large flat region, and if the current loss landscape does not satisfy this assumption, the region is reduced and the search is repeated. Compared to the SAM cosine method, which sets a radius ρ , our approach is more purposeful. SAM lacks intuitive feedback to guide further hyperparameter tuning. As shown in Figure 4, the stop distance achieved for different network architectures indicates that SAM requires specific ρ settings for each architecture, which involves extensive tuning and hyperparameter adjustments. Our method can be seen as a tool for intuitively understanding the flatness properties of the loss landscape in different network solution spaces.
Our method has led to two interesting observations, providing a foundation for further research. Assuming the results represent the flattest loss regions, future research may focus on the following:
1.
Model Architecture and Flat Region Size. As shown in Figure 4, the same network architecture corresponds to different flattest regions, depending on the dataset. Similarly, for the same dataset, different network architectures exhibit distinct flattest regions. It is possible that a normalization framework could unify this phenomenon. Furthermore, the size of the flattest region could guide the designing of model architectures.
2.
Euclidean Distance vs. Cosine Distance. As the Euclidean distance between two points decreases, the learned parameter vectors of the two models tend to become more orthogonal. This suggests that, although the Euclidean distance diminishes, model diversity increases. This observation is beneficial for better application of ensemble learning. When using ensemble learning to obtain a set of models, orthogonal regularization is often applied without considering the Euclidean distance between models. A potential issue arises: the generalization ability of an ensemble of models after orthogonal regularization is related to whether these models reside in the same loss basin. Our TSE algorithm provides a controllable tool to support further research in this area.
3.
Combining heuristic algorithms and meta-learning. A meta-learning algorithm memorizes gradient trajectory information, which is beneficial for designing better regularization and optimization iteration steps. Heuristic algorithms provide more diverse and efficient combinatorial forms, facilitating the selection of starting points and anchor points for the optimization iteration algorithm.

10. Conclusions

In this paper, we proposed an optimization strategy based on linear mode connectivity to adaptively adjust the hyperparameter ρ in Sharpness-Aware Minimization (SAM). We innovatively employed a proof by contradiction to establish the theoretical relationship between the SAM hypersphere and connectivity. Furthermore, we introduced an iterative algorithm that progressively approximates flat regions. By analyzing the connectivity between A and B, our approach enables the adaptive adjustment of the SAM hypersphere radius ρ without requiring additional manual tuning. Additionally, we proposed a distance regularization constraint and a corresponding TSE training process to guide A and B toward convergence in the same flat local minimum basin or adjacent low-loss basin, thereby enhancing optimization stability and generalization performance. Experimental results demonstrate that our method achieved superior convergence and generalization performance across multiple network architectures, significantly enhancing the model effectiveness.

Author Contributions

Conceptualization, H.L. (Hailun Liang) and Y.L.; methodology, H.L. (Hailun Liang); software, H.L. (Hailun Liang); validation, Y.L., H.Z., H.W., L.H. and H.L. (Haoyi Lin); formal analysis, H.L. (Hailun Liang), H.Z., H.W. and L.H.; investigation, H.L. (Hailun Liang), Y.L.; resources, H.L. (Hailun Liang); data curation, H.L. (Hailun Liang); writing—original draft preparation, H.L. (Hailun Liang); writing—review and editing, Y.L., H.Z., H.W., L.H. and H.L. (Haoyi Lin); visualization, H.L. (Hailun Liang) and H.Z.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This document is the result of a research project funded by the National Key Research and Development Plan under Grant 2021YFE0205700, Science and Technology Development Fund of Macau project 0070/2020/AMJ, 00123/2022/A3, 0096/2023/RIA2.

Data Availability Statement

The CIFAR dataset is publicly available and can be accessed at [https://www.cs.toronto.edu/~kriz/cifar.html] (accessed on 1 February 2025). The Vaihinegn dataset is publicly available and can be accessed at [https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx] (accessed on 1 February 2025).

Acknowledgments

All contributions have been included by the authors, and there are no other acknowledgments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Allen-Zhu, Z.; Li, Y.; Song, Z. A convergence theory for deep learning via over-parameterization. Int. Conf. Mach. Learn. 2019, 97, 242–252. [Google Scholar]
  2. Simsek, B.; Ged, F.; Jacot, A.; Spadaro, F.; Hongler, C.; Gerstner, W.; Brea, J. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. Int. Conf. Mach. Learn. 2021, 139, 9722–9732. [Google Scholar]
  3. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]
  4. Yue, X.; Nouiehed, M.; Al Kontar, R. Salr: Sharpness-aware learning rates for improved generalization. IEEE Trans. Neural Netw. Learn. Syst. 2020, 35, 12518–12527. [Google Scholar] [CrossRef] [PubMed]
  5. Sun, X.; Zhang, Z.; Ren, X.; Luo, R.; Li, L. Exploring the vulnerability of deep neural networks: A study of parameter corruption. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11648–11656. [Google Scholar] [CrossRef]
  6. Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. J. Stat. Mech. Theory Exp. 2019, 2019, 124018. [Google Scholar] [CrossRef]
  7. Mobahi, H. Training recurrent neural networks by diffusion. arXiv 2016, arXiv:1601.04114. [Google Scholar]
  8. Hochreiter, S.; Schmidhuber, J. Flat minima. Neural Comput. 1997, 9, 1–42. [Google Scholar] [CrossRef]
  9. Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. arXiv 2020, arXiv:2010.01412. [Google Scholar]
  10. Böttcher, L.; Wheeler, G. Visualizing high-dimensional loss landscapes with Hessian directions. arXiv 2022, arXiv:2208.13219. [Google Scholar] [CrossRef]
  11. Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern. 2019, 50, 3668–3681. [Google Scholar] [CrossRef] [PubMed]
  12. Wen, K.; Ma, T.; Li, Z. How does sharpness-aware minimization minimize sharpness? arXiv 2022, arXiv:2211.05729. [Google Scholar]
  13. Li, X.C.; Tang, J.L.; Zhang, B.; Li, L.; Zhan, D.C. Exploring and exploiting the asymmetric valley of deep neural networks. arXiv 2024, arXiv:2405.12489. [Google Scholar]
  14. Kwon, J.; Kim, J.; Park, H.; Choi, I.K. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. Int. Conf. Mach. Learn. 2021, 139, 5905–5914. [Google Scholar]
  15. Zhang, X.; Xu, R.; Yu, H.; Zou, H.; Cui, P. Gradient norm aware minimization seeks first-order flatness and improves generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Paris, France, 1–6 October 2023; pp. 20247–20257. [Google Scholar]
  16. Draxler, F.; Veschgini, K.; Salmhofer, M.; Hamprecht, F. Essentially no barriers in neural network energy landscape. Int. Conf. Mach. Learn. 2018, 80, 1309–1318. [Google Scholar]
  17. Wortsman, M.; Horton, M.C.; Guestrin, C.; Farhadi, A.; Rastegari, M. Learning neural network subspaces. Int. Conf. Mach. Learn. 2021, 139, 11217–11227. [Google Scholar]
  18. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
  19. Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
  20. Kaur, S.; Cohen, J.; Lipton, Z.C. On the maximum hessian eigenvalue and generalization. Proc. Mach. Learn. Res. 2023, 187, 51–65. [Google Scholar]
  21. Jia, Z.; Su, H. Information-theoretic local minima characterization and regularization. Int. Conf. Mach. Learn. 2020, 119, 4773–4783. [Google Scholar]
  22. Zhuang, J.; Gong, B.; Yuan, L.; Cui, Y.; Adam, H.; Dvornek, N.; Tatikonda, S.; Duncan, J.; Liu, T. Surrogate gap minimization improves sharpness-aware training. arXiv 2022, arXiv:2203.08065. [Google Scholar]
  23. Andriushchenko, M.; Croce, F.; Müller, M.; Hein, M.; Flammarion, N. A modern look at the relationship between sharpness and generalization. arXiv 2023, arXiv:2302.07011. [Google Scholar]
  24. Kim, M.; Li, D.; Hu, S.X.; Hospedales, T. Fisher sam: Information geometry and sharpness aware minimisation. Int. Conf. Mach. Learn. 2022, 162, 11148–11161. [Google Scholar]
  25. Yun, J.; Yang, E. Riemannian SAM: Sharpness-Aware Minimization on Riemannian Manifolds. In Proceedings of the Thirtyseventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  26. Entezari, R.; Sedghi, H.; Saukh, O.; Neyshabur, B. The role of permutation invariance in linear mode connectivity of neural networks. arXiv 2021, arXiv:2110.06296. [Google Scholar]
  27. Neyshabur, B.; Sedghi, H.; Zhang, C. What is being transferred in transfer learning? Adv. Neural Inf. Process. Syst. 2020, 33, 512–523. [Google Scholar]
  28. Frankle, J.; Dziugaite, G. Karolina; Roy, D.; Carbin, M. Linear mode connectivity and the lottery ticket hypothesis. Int. Conf. Mach. Learn. 2020, 119, 3259–3269. [Google Scholar]
  29. Juneja, J.; Bansal, R.; Cho, K.; Sedoc, J.; Saphra, N. Linear connectivity reveals generalization strategies. arXiv 2022, arXiv:2205.12411. [Google Scholar]
  30. Ainsworth, S.K.; Hayase, J.; Srinivasa, S. Git re-basin: Merging models modulo permutation symmetries. arXiv 2022, arXiv:2209.04836. [Google Scholar]
  31. Freeman, C.D.; Bruna, J. Topology and geometry of half-rectified network optimization. arXiv 2016, arXiv:1611.01540. [Google Scholar]
  32. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  33. Shi, W.; Chen, Y.; Zhao, Z.; Lu, W.; Yan, K.; Du, X. Create and find flatness: Building flat training spaces in advance for continual learning. arXiv 2023, arXiv:2309.11305. [Google Scholar]
  34. Li, Z.; Lin, J.; Li, Z.; Zhu, D.; Ye, R.; Shen, T.; Lin, T.; Wu, C. Improving Group Connectivity for Generalization of Federated Deep Learning. arXiv 2024, arXiv:2402.18949. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  36. Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
  37. Han, D.; Kim, J.; Kim, J. Deep pyramidal residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5927–5935. [Google Scholar]
  38. ISPRS: 2D Semantic Labeling-Vaihingen. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 1 February 2025).
  39. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote. Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  40. Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.-M.; Yang, J. Lsknet: A foundation lightweight backbone for remote sensing. Int. J. Comput. Vis. 2025, 133, 1410–1431. [Google Scholar] [CrossRef]
  41. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  42. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  43. Li, H.; Xu, Z.; Taylor, G.; Studer, C.; Goldstein, T. Visualizing the loss landscape of neural nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 2–8 December 2018; Volume 31. [Google Scholar]
  44. Fort, S.; Hu, H.; Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. arXiv 2019, arXiv:1912.02757. [Google Scholar]
  45. Guiroy, S.; Verma, V.; Pal, C. Towards understanding generalization in gradient-based meta-learning. arXiv 2019, arXiv:1907.07287. [Google Scholar]
  46. Flennerhag, S.; Rusu, A.A.; Pascanu, R.; Visin, F.; Yin, H.; Hadsell, R. Meta-learning with warped gradient descent. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 6–9 May 2019. [Google Scholar]
  47. Lee, J.; Yoo, J.; Kwak, N. SHOT: Suppressing the hessian along the optimization trajectory for gradient-based meta-learning. Adv. Neural Inf. Process. Syst. 2023, 36, 61450–61465. [Google Scholar]
  48. Bartz-Beielstein, T.; Branke, J.; Mehnen, J.; Mersmann, O. Evolutionary algorithms. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2014, 4, 178–195. [Google Scholar] [CrossRef]
  49. Holland, J.H. Genetic algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
Figure 1. Illustrations of gradient search with TSE: θ 1 and θ 2 are local minima obtained independently via gradient descent, each with a large separation in the loss landscape (i.e., a larger loss barrier). θ 1 and θ 2 are local minima obtained independently via gradient descent without processing (N/P), each with a large separation in the loss landscape (i.e., a larger loss barrier). Maintaining linear mode connectivity between θ 1 and θ 2 helps explore flatter regions of the loss landscape. We then obtain an interpolated solution between θ 1 and θ 2 , and use this as the new starting point for restarting the gradient descent. We propose an improved sharpness-aware iterative optimization method, termed TSE, to obtain two solutions with linear connectivity and flatness. Our strategy first uses the interpolation between θ 1 and θ 2 as the starting point, and then incorporates distance regularization with θ 1 and θ 2 during the subsequent training. The red dots in the figure represent the terminal states of one model at the end of each step, and the red lines indicate the direction of change between these steps. The blue dots and lines convey the corresponding information for the other model. The stopping conditions are (a) reaching the edge of a basin where the radius ρ is greater than the initial ρ , or (b) identifying two minima that are connected by a linear path in the loss landscape.
Figure 1. Illustrations of gradient search with TSE: θ 1 and θ 2 are local minima obtained independently via gradient descent, each with a large separation in the loss landscape (i.e., a larger loss barrier). θ 1 and θ 2 are local minima obtained independently via gradient descent without processing (N/P), each with a large separation in the loss landscape (i.e., a larger loss barrier). Maintaining linear mode connectivity between θ 1 and θ 2 helps explore flatter regions of the loss landscape. We then obtain an interpolated solution between θ 1 and θ 2 , and use this as the new starting point for restarting the gradient descent. We propose an improved sharpness-aware iterative optimization method, termed TSE, to obtain two solutions with linear connectivity and flatness. Our strategy first uses the interpolation between θ 1 and θ 2 as the starting point, and then incorporates distance regularization with θ 1 and θ 2 during the subsequent training. The red dots in the figure represent the terminal states of one model at the end of each step, and the red lines indicate the direction of change between these steps. The blue dots and lines convey the corresponding information for the other model. The stopping conditions are (a) reaching the edge of a basin where the radius ρ is greater than the initial ρ , or (b) identifying two minima that are connected by a linear path in the loss landscape.
Mathematics 13 01259 g001
Figure 2. Illustration of our ρ selection strategy. We extend SAM by introducing a distance regularization term that encourages models with the same network architecture to converge to nearby regions in the solution space. To determine a suitable ρ , we start with a relatively large value defined by two endpoints and gradually reduce it using a heuristic search. Our method effectively enables the selection of favorable basins: even with a fixed ρ , models initially located in neighboring basins may converge into a broader or adjacent basin, implicitly, implicitly simulating an increase in ρ through the merging process.
Figure 2. Illustration of our ρ selection strategy. We extend SAM by introducing a distance regularization term that encourages models with the same network architecture to converge to nearby regions in the solution space. To determine a suitable ρ , we start with a relatively large value defined by two endpoints and gradually reduce it using a heuristic search. Our method effectively enables the selection of favorable basins: even with a fixed ρ , models initially located in neighboring basins may converge into a broader or adjacent basin, implicitly, implicitly simulating an increase in ρ through the merging process.
Mathematics 13 01259 g002
Figure 3. Illustration of test accuracy curves obtained from SAM and ASAM algorithm with a range of ρ = [ 10 3 , 10 2 , 10 1 , 10 0 , 10 1 ] .
Figure 3. Illustration of test accuracy curves obtained from SAM and ASAM algorithm with a range of ρ = [ 10 3 , 10 2 , 10 1 , 10 0 , 10 1 ] .
Mathematics 13 01259 g003
Figure 4. Illustration of Euclidean distance between output models using our algorithm steps. The x-axis represents the number of iterations, and the y-axis represents the Euclidean distance. As the number of iterations increases, the distance between points A and B gradually decreases.
Figure 4. Illustration of Euclidean distance between output models using our algorithm steps. The x-axis represents the number of iterations, and the y-axis represents the Euclidean distance. As the number of iterations increases, the distance between points A and B gradually decreases.
Mathematics 13 01259 g004
Figure 5. Illustration of the loss and accuracy curves of linear interpolation in Cifar10 at different steps for our proposed algorithm. As the number of iterations increased, the interpolation loss barrier between A and B disappeared, and the loss of A and B gradually decreased. The accuracy of the interpolation model improved and stabilized. Eventually, both models converged to a flat region.
Figure 5. Illustration of the loss and accuracy curves of linear interpolation in Cifar10 at different steps for our proposed algorithm. As the number of iterations increased, the interpolation loss barrier between A and B disappeared, and the loss of A and B gradually decreased. The accuracy of the interpolation model improved and stabilized. Eventually, both models converged to a flat region.
Mathematics 13 01259 g005
Figure 6. Illustration of the loss and accuracy curves of linear interpolation for Cifar100 at different steps of our proposed algorithm. As the number of iterations increased, the interpolation loss barrier between A and B disappeared, and the loss of A and B gradually decreased. The accuracy of the interpolation model improved and stabilized. Eventually, both models converged to a flat region.
Figure 6. Illustration of the loss and accuracy curves of linear interpolation for Cifar100 at different steps of our proposed algorithm. As the number of iterations increased, the interpolation loss barrier between A and B disappeared, and the loss of A and B gradually decreased. The accuracy of the interpolation model improved and stabilized. Eventually, both models converged to a flat region.
Mathematics 13 01259 g006
Figure 7. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for CIFAR-10 dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Figure 7. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for CIFAR-10 dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Mathematics 13 01259 g007aMathematics 13 01259 g007b
Figure 8. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for CIFAR-100 dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Figure 8. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for CIFAR-100 dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Mathematics 13 01259 g008aMathematics 13 01259 g008b
Figure 9. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for Vaihingen dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Figure 9. Kernel Density Estimation (KDE) of cosine similarity between each layer parameter within snapshots for Vaihingen dataset. The cosine similarity ranges from 1 to 1. The probability density statistics shown in the figure exceed this range due to a 0.1 bandwidth.
Mathematics 13 01259 g009
Figure 10. Visualization and error analysis.
Figure 10. Visualization and error analysis.
Mathematics 13 01259 g010
Table 1. Maximum test accuracies for SGD, SAM, ASAM, GAM, and TSE on CIFAR-10 dataset. The bold numbers indicate the best accuracy, signifying the best classification performance. The symbol ’*’ refers to the result from GAM [15]’s resnet101.
Table 1. Maximum test accuracies for SGD, SAM, ASAM, GAM, and TSE on CIFAR-10 dataset. The bold numbers indicate the best accuracy, signifying the best classification performance. The symbol ’*’ refers to the result from GAM [15]’s resnet101.
cifar10 (Acc%)SGDSAMASAMGAMTSE
resnet1895.9896.2296.6496.9396.98
resnet5096.295.9896.5597.35 *97.12
wrn16-896.8796.8697.31-97.78
wrn28-295.7895.6896.495.9496.55
wrn28-1096.8696.2297.297.4097.74
pyramid110-4895.8595.9893.96-96.92
pyramid110-27097.2097.2597.3797.6097.36
Table 2. Maximum test accuracies for SGD, SAM, ASAM, GAM, and TSE on CIFAR-100 dataset. The bold numbers indicate the best accuracy, signifying the best classification performance. The symbol ’*’ refers to the result from GAM [15]’s resnet101.
Table 2. Maximum test accuracies for SGD, SAM, ASAM, GAM, and TSE on CIFAR-100 dataset. The bold numbers indicate the best accuracy, signifying the best classification performance. The symbol ’*’ refers to the result from GAM [15]’s resnet101.
cifar100 (Acc%)SGDSAMASAMGAMTSE
resnet1879.6379.5581.2780.782.03
resnet5078.7479.5681.3883.2 *83.45
wrn16-882.482.6384.34-84.77
wrn28-277.777.8479.6477.8980.73
wrn28-1080.1482.5781.7584.3785.1
pyramid110-4878.3578.6375.14-82.98
pyramid110-27083.7484.2584.1285.3185.88
Table 3. Maximum test mIoUs for AdamW, SAM, ASAM, GAM, and TSE on Vaihingen dataset. The bold numbers indicate the best mean Intersection over Union(mIoU), signifying the best semantic segmentation performance.
Table 3. Maximum test mIoUs for AdamW, SAM, ASAM, GAM, and TSE on Vaihingen dataset. The bold numbers indicate the best mean Intersection over Union(mIoU), signifying the best semantic segmentation performance.
Vaihingen (mIoU%)AdamWSAMASAMTSE
Unetformer82.784.684.785.1
LSKNet-S85.185.385.085.5
Table 4. Test results for ensemble model on CIFAR-10/100 and Vaihingen. The bold numbers indicate that the results from model ensemble (bagging) outperform those of individual models by more than 0.1%.
Table 4. Test results for ensemble model on CIFAR-10/100 and Vaihingen. The bold numbers indicate that the results from model ensemble (bagging) outperform those of individual models by more than 0.1%.
Cifar10 (ACC%)SGDTSE-ATSE-BA + B Bagging
resnet1895.9896.9896.9596.96
resnet5096.297.0297.1297.27
wrn16-896.8797.7897.5697.68
wrn28-295.7896.4196.5596.59
wrn28-1096.8697.797.6697.71
pyramidnet-110-4895.8596.9296.6296.93
pyramidnet-110-27097.297.3697.2797.53
Cifar100 (ACC%)SGDTSE-ATSE-BA + B Bagging
resnet1879.6382.0381.9282.24
resnet5078.7483.4583.3183.90
wrn16-882.484.4884.7785.13
wrn28-277.779.9280.7381.16
wrn28-1080.1485.184.885.29
pyramidnet-110-4878.3582.9882.8383.20
pyramidnet-110-27083.7485.8885.6186.34
Vaihingen (mIoU%)AdamWTSE-ATSE-BA + B Bagging
UNetformer82.785.0784.985.08
LSKNet-S85.185.485.4785.56
Table 5. Maximum test accuracies of ResNet-50 models trained on CIFAR-10 with label noise. The bold numbers indicate the best accuracy, signifying the best classification performance.
Table 5. Maximum test accuracies of ResNet-50 models trained on CIFAR-10 with label noise. The bold numbers indicate the best accuracy, signifying the best classification performance.
Noise RateSGDSAMASAMTSE
096.295.9896.5597.12
0.291.591.7193.1893.46
0.486.8387.2189.1589.47
0.679.1779.0382.8885.56
0.825.128.1924.0226.07
Table 6. Consumption analysis.
Table 6. Consumption analysis.
SGDSAMTSE
n 2 n ( 2 × 2 n + 2 ) I
Table 7. Linear interpolation loss variance of iteratively output checkpoint models.
Table 7. Linear interpolation loss variance of iteratively output checkpoint models.
cifar10InitialStep 1Step 2Step 3Step 4Step 5
resnet180.7930.7670.0090.0030.002 *0.0006
resnet500.740.7890.3620.120.080.07 *
wrn16-80.7830.0980.010.003 *0.010.004
wrn28-20.770.670.110.040.040.04 *
wrn28-100.790.150.040.010.006-
pyramidnet110-480.70.730.340.060.02 *0.01
pyramidnet110-2700.710.710.790.760.660.62 *
cifar100initialstep 1step 2step 3step 4step 5
resnet181.5881.4330.2320.0070.010.01 *
resnet501.591.160.320.150.158 *0.04
wrn16-81.650.320.090.040.010.005 *
wrn28-21.481.820.920.210.20.06 *
wrn28-101.710.370.050.010.006 *-
pyramidnet110-481.451.570.970.250.09 *0.05
pyramidnet110-2701.511.531.651.431.040.9 *
’*’ denotes the best test result at the current step.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, H.; Zheng, H.; Wang, H.; He, L.; Lin, H.; Liang, Y. Exploring Flatter Loss Landscape Surface via Sharpness-Aware Minimization with Linear Mode Connectivity. Mathematics 2025, 13, 1259. https://doi.org/10.3390/math13081259

AMA Style

Liang H, Zheng H, Wang H, He L, Lin H, Liang Y. Exploring Flatter Loss Landscape Surface via Sharpness-Aware Minimization with Linear Mode Connectivity. Mathematics. 2025; 13(8):1259. https://doi.org/10.3390/math13081259

Chicago/Turabian Style

Liang, Hailun, Haowen Zheng, Hao Wang, Liu He, Haoyi Lin, and Yanyan Liang. 2025. "Exploring Flatter Loss Landscape Surface via Sharpness-Aware Minimization with Linear Mode Connectivity" Mathematics 13, no. 8: 1259. https://doi.org/10.3390/math13081259

APA Style

Liang, H., Zheng, H., Wang, H., He, L., Lin, H., & Liang, Y. (2025). Exploring Flatter Loss Landscape Surface via Sharpness-Aware Minimization with Linear Mode Connectivity. Mathematics, 13(8), 1259. https://doi.org/10.3390/math13081259

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop