1. Introduction
The optical phase carries critical information about the propagation of light waves and can be utilized to analyze the properties of objects by detecting changes in the optical wavefront phase. This technology has been widely applied in areas such as atmospheric turbulence detection, optical element defect analysis, and biological sample research [
1,
2,
3]. Among these, single-shot focal plane wavefront recognition is a technique that reconstructs the incident wavefront from a single focal plane intensity image.
Traditional wavefront recognition methods mainly rely on the two-dimensional intensity distribution of the diffracted light field to reconstruct wavefront aberrations. In 1972, Gerchberg and Saxton introduced the GS algorithm, which uses Fourier transforms to iteratively compute the intensity distributions of both the far field and near field, thus recovering the wavefront phase [
4]. The GS algorithm has a simple structure, high energy concentration, and can detect higher-order aberrations. However, it suffers from low inversion precision, a tendency to converge to local optima, and a fundamental issue of multiple solutions caused by the similarity between the near-field complex amplitude and its 180° rotated conjugate in the far-field intensity distribution [
5].
In 1979, Gonsalves and Chidlaw introduced the phase diversity (PD) method for wavefront reconstruction. This method uses two images—one in focus and one defocused—to iteratively recover the wavefront phase by combining intensity and defocus information. While the PD method resolves the multiple-solution issue by adding a constraint, it complicates the optical system and increases computational demands [
6]. In 1978, Fienup introduced the hybrid input-output (HIO) algorithm, which alternates between projection onto object-domain and frequency-domain constraint sets to reconstruct the image. Building on the GS algorithm, the HIO algorithm improves convergence speed, especially in coherent diffraction imaging. However, it still fails to resolve the issue of multiple solutions [
7]. Fienup proposed using asymmetric apertures to break the system’s symmetry [
8]. Although this approach effectively addresses the issue, it also increases the complexity of optical design and requires more precise alignment for optimal performance. In 2019, Kong Qingfeng and colleagues from the Chinese Academy of Sciences proposed an improved GS algorithm based on Walsh function modulation for wavefront reconstruction, eliminating the 180° rotational symmetry of the modulation phase and successfully recovering the wavefront phase [
9]. However, it remains an iterative method, suffering from issues such as low inversion accuracy and convergence difficulties.
In recent years, advancements in deep learning have led to new approaches for wavefront reconstruction. In 1992, Jorgenson and Aitken proposed using adaptive linear predictors and backpropagation neural networks (NNs) to predict wavefront distortions. Their results showed that neural network predictors achieved a mean squared error (MSE) about half that of linear systems [
10]. In 2020, Liu et al. applied deep learning to wavefront phase prediction in open-loop adaptive optics, using a long short-term memory (LSTM)-based neural network to compensate for control loop delays. Their approach outperformed traditional methods by providing stable and accurate wavefront predictions under varying turbulence conditions, highlighting the potential of neural networks for real-time wavefront phase recovery [
11]. However, the lack of model interpretability raises concerns about robustness in practical applications. In 2023, Chen et al. introduced PWFS-ResUnet, a network that reconstructs wavefront phase maps from plenoptic wavefront sensor (PWFS) slope measurements, effectively mitigating nonlinear issues in traditional approaches [
12]. Building upon attention mechanisms [
13,
14], Feng et al., 2023, proposed SH-U-Transformer, a Transformer-based network that directly reconstructs wavefront distributions from Shack–Hartmann wavefront sensor (SHWFS) spot array images [
15]. In 2024, Zhang et al. presented ADSA-Net, which integrates an additive self-attention (ADSA) module to significantly enhance both accuracy and efficiency in multi-aperture object feature recognition [
16]. Similarly, Hu et al., 2024, developed an automated detection system leveraging an improved attention U-Net, enabling fast and accurate micropore defect detection in composite thin-film materials [
17]. Meanwhile, Zhao et al. combined adaptive image processing with convolutional neural networks that incorporate a simple, parameter-free attention module (SimAM), greatly improving the performance and robustness of superimposed orbital angular momentum (OAM) beam topological charge identification under turbulent conditions [
18]. Finally, Kazemzadeh et al. advanced image transmission fidelity through turbulent media by introducing a global attention mechanism (GAM) [
19].
To address the limitations of existing methods—such as low inversion precision, convergence challenges, system complexity, and multiple solutions—we propose a novel single-frame focal plane wavefront recognition method based on the Transformer architecture. By utilizing the feature extraction and sequence modeling strengths of Transformers, our approach aims to improve upon traditional iterative methods and deep learning-based techniques.
This study explores the proposed method, covering its theoretical foundation, algorithm design, and experimental validation. The results show improvements in wavefront reconstruction accuracy and demonstrate the system’s simplicity, as well as its ability to resolve multi-solution issues, making it a promising solution for optical wavefront sensing and imaging.
3. Development of Neural Network Systems
Convolutional neural networks (CNNs) have shown exceptional performance in image processing, particularly in recognizing complex features and capturing local details. However, their ability to integrate global context remains limited. In contrast, Transformer architectures excel in natural language processing (NLP), with superior capabilities in sequence data processing and global information aggregation. To address this gap, we developed a Transformer-based neural network system that accepts far-field focal plane images as input and outputs the first 36 Zernike polynomial coefficients representing near-field wavefront aberrations, thereby enabling accurate reconstruction of incident optical wavefronts. The experimental workflow is shown in
Figure 3:
The core architecture of our neural network is based on Convolutional Attention Network (CoAtNet) [
23], with comparative implementations using MobileViT (Lightweight Vision Transformer) [
24] and ResNet34. The CoAtNet architecture combines Transformer principles with convolutional operations, enhancing both expressive power and computational efficiency. As shown in
Figure 4, the network uses a sequential arrangement of convolutional blocks and self-attention modules for automated feature extraction through deep learning.
To improve network performance, we added a Normalization-based Attention Module (NAM) between stages S2 and S3 of the CoAtNet framework and conducted ablation studies with the baseline architecture. The NAM module uses channel attention to recalibrate channel-wise weights, applying batch normalization to compute channel importance scores and modulating feature maps through sigmoid-activated weight allocation [
25].
We also implemented MobileViT and ResNet34 architectures as comparative benchmarks. The MobileViT core module consists of three key components: Unfold-Transformer-Fold operations, where the Unfold/Fold modules reshape tensors for Transformer processing. Our configuration includes three cascaded MobileViT blocks, interleaved with convolutional and fully connected layers, as shown in
Figure 5. The ResNet34 architecture serves as a standard deep residual network, providing baseline performance for comparison.
The network processes 256 × 256 single-channel grayscale images as input and generates 36 × 1 output vectors representing the 1st to 36th order Zernike polynomial coefficients. A comparative analysis of the architectural complexity and parameter counts for the four networks is presented in
Table 2.
4. Experiments and Analysis
The experimental setup used power spectrum inversion to generate 100,000 atmospheric phase screens with randomly sampled atmospheric coherence lengths (0.03–0.15 m), turbulence parameters (outer scale: 100 m, inner scale: 0.1 m), and optical specifications (wavelength: 1064 nm, pupil diameter: 54 mm, focal length: 50 mm). Ground truth labels were obtained through Moore–Penrose pseudoinverse calculation of the first 36 Zernike polynomial coefficients. Phase modulation was implemented using W4, W5, and W12 functions, followed by near-field to far-field intensity conversion via Fresnel diffraction, yielding 100,000 far-field intensity maps (90,000 for training, 10,000 for validation).
The computational platform used an NVIDIA A100 GPU (40 GB memory, NVIDIA, Santa Clara, CA, USA) and Intel Xeon Gold 6338 CPU (2.00 GHz, Intel, Santa Clara, CA, USA). The training protocol was set for 300 epochs, with a batch size of 256 and adaptive learning rate scheduling (initial rate: 0.01). Optimization stopped after 300 epochs or when loss converged, with RMSE between predicted and ground truth Zernike coefficients as the loss metric.
Figure 6 shows the loss function trajectories across training iterations, while
Table 3 presents the mean squared error (MSE) of Zernike coefficients on the validation set, highlighting performance variations among the evaluated models.
As shown by the training dynamics (
Figure 6), all models reached performance saturation around epoch 200, with W12-modulated datasets showing the lowest converged loss values. Quantitative metrics in
Table 3 indicate that the NAM-CoAtNet model trained with W12-modulated data achieves the lowest root mean square error (RMSE = 0.0081) on the validation set, validating the effectiveness of our NAM-enhanced architecture in extracting discriminative features from diffraction-based far-field intensity patterns.
The reconstructed wavefronts derived from inferred Zernike coefficients were evaluated using normalized wavefront error (NWE), defined as the ratio of the root mean square (RMS) of the wavefront residual to the RMS of the original wavefront. This dimensionless metric provides a standardized error measure for cross-system comparative analysis, formulated as:
denote the reconstructed wavefront value at the
i-th pixel and
represent the corresponding ground truth value at the
i-th spatial position. Quantitative evaluations were performed on 800 validation set images through systematic inference experiments, with the empirical results comprehensively summarized in
Table 4.
Table 4 shows the normalized wavefront error (NWE) of Zernike coefficient predictions across the evaluated neural architectures on the validation set. The data reveal that the CoAtNet framework performs best with W12-modulated data (NWE: 7.2%), with 2.6% and 3.3% reductions compared to ResNet34 (NWE: 9.8%) and MobileViT (NWE: 10.5%), respectively. The NAM-enhanced variant (NAM-CoAtNet) further reduces the error to 5.3% NWE, setting a new standard for prediction accuracy. Cross-modulation analysis shows all three baseline architectures (ResNet34, MobileViT, and CoAtNet) achieve minimal NWE values with W12 modulation, highlighting the importance of W12 phase encoding in improving Zernike coefficient estimation precision.
Notably, NAM-CoAtNet shows significant accuracy improvements over vanilla CoAtNet under both W5 and W12 modulation conditions. This performance gain validates the NAM module’s ability to optimize channel attention, enabling prioritized processing of wavefront-critical features through adaptive feature recalibration (see
Section 3).
Figure 7 compares predicted and reference Zernike coefficients under W12 modulation, showing strong correlation (Pearson’s r > 0.94) across modes Z
16–Z
36. These results validate the effectiveness of integrating convolutional and Transformer layers with attention mechanisms for diffraction feature extraction.
In total, 500 pairs of complex conjugate phase screens were generated, modulated using W12, and diffracted to obtain far-field intensity patterns. The trained model was then used to predict the wavefronts, where
represents the original wavefront and
denotes the complex conjugate wavefront. The normalized wavefront error (NWE) between the reconstructed and original wavefronts was calculated, with results summarized in
Table 5. Additionally, images of the far-field diffraction patterns corresponding to the Zernike coefficients predicted by the NAM-CoAtNet network are shown in
Figure 8.
The experimental results confirm that Walsh function-modulated far-field intensity patterns enable accurate near-field wavefront reconstruction through model inference. The NAM-CoAtNet architecture achieves optimal performance with normalized wavefront errors of 5.4% (original wavefront) and 6.3% (phase-conjugated counterpart), demonstrating its ability to distinguish complex-conjugate phase relationships through diffraction pattern analysis.
To assess practical applicability, we designed an offset-augmented dataset simulating 256 × 256 focal plane images with lateral displacements (0–50 pixels) from the optical axis. Using identical training protocols (90,000 training/10,000 validation samples), the system maintains stable reconstruction accuracy as shown in
Figure 9, with <8% NWE degradation under maximum displacement conditions.
As shown in
Figure 9, the system preserves the original wavefront morphology and maintains consistent coefficient variation patterns, demonstrating robust feature-encoding capabilities against displacement.
To address simulation-to-reality discrepancies, we established the systematic validation platform shown in
Figure 10:
A 100 mW laser operating at 1064 nm was collimated by a beam expander to generate a stabilized beam. The collimated beam passed through a deformable mirror (wavefront corrector) programmed to simulate atmospheric turbulence-induced aberrations. A beam splitter then divided the beam into two paths: the transmitted beam was directed to a Shack–Hartmann wavefront sensor (36 × 36 lenslet array, μm resolution) for aberration measurement, while the reflected beam was modulated by a W5 Walsh phase plate and focused through a 54 mm aperture onto an imaging camera.
Figure 11 shows the light field detected by the Shack–Hartmann wavefront sensor (SHWS) following Walsh phase modulation, along with the far-field intensity distribution captured at the focal plane through the Walsh phase plate.
The experimental results are shown in
Table 6:
According to
Table 6, the normalized wavefront error (NWE) values of all models in real-world experiments are consistently higher than in simulations, with MobileViT and ResNet showing significantly worse performance. This discrepancy may arise from practical challenges in real-world settings, such as ambient light interference, instrumental errors, and measurement inaccuracies, which can compromise data quality and degrade results.
Among the models tested, NAM-CoAtNet shows the lowest NWE (7.7%) in experiments, indicating its stronger adaptability and robustness. However, its experimental NWE remains higher than the simulation results, highlighting the gap between ideal and practical conditions. Future work could focus on improving experimental protocols (e.g., environmental controls, sensor calibration) or enhancing model architectures to better handle real-world perturbations.