Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling

Won, Yong-Yuk; Han, Heetae; Choi, Dongmin; Yoon, Sang Min

doi:10.3390/photonics12040395

Open AccessArticle

Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling

¹

Department of Electronic Engineering, Myongji University, 116 Myongji-ro, Cheoin-gu, Yongin 17058, Republic of Korea

²

School of Computer Science, Kookmin University, 77, Jeongneung-ro, Sungbuk-gu, Seoul 02707, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Photonics 2025, 12(4), 395; https://doi.org/10.3390/photonics12040395

Submission received: 3 March 2025 / Revised: 3 April 2025 / Accepted: 17 April 2025 / Published: 18 April 2025

(This article belongs to the Special Issue Optical Signal Processing for Advanced Communication Systems)

Download

Browse Figures

Versions Notes

Abstract

:

A sparse coded mask modeling technique is proposed to increase the transmission capacity of an optical wireless link based on Li-Fi. The learning model for the discrete multitone (DMT) signal waveform is implemented using the proposed technique, which is designed based on a masked auto-encoder. The entire length of the DMT signal waveform, encoded using quadrature phase shift keying (QPSK) or 16-quadrature amplitude modulation (16-QAM) symbols, is divided into equal intervals to generate DMT patches, which are subsequently compressed based on the specified masking ratio. After 1-m optical wireless transmission, the DMT signal waveform is reconstructed from the received DMT patch through a decoding process and then QPSK or 16-QAM symbols are recovered. Using the proposed technique, we demonstrate that we can increase the transmission capacity by up to 1.85 times for a 10 MHz physical bandwidth. Additionally, we verify that the proposed technique is feasible in Li-Fi networks with illumination environments above 240 lux.

Keywords:

discrete multi–tone modulation; light–fidelity; masked autoencoder; sparse coded mask modeling

1. Introduction

A light fidelity (Li-Fi) technology based on white light emitting diode (LED) has been recognized as a complementary solution to existing wireless fidelity (Wi–Fi) transmission technologies [1,2]. It is particularly expected to serve as an alternative communication service in environments where high security is prioritized or where the effects of electromagnetic interference need to be eliminated. Additionally, with the widespread installation of LED lighting infrastructure, this technology offers the advantage of providing both lighting and communication services simultaneously to users. Recently, it has proven its value in specialized fields such as underwater, space, and nuclear power plants [3,4].

However, the optical wireless transmission performance of Li-Fi using LED lighting is limited by the low-frequency response of the LED chips themselves, making it difficult to achieve transmission rate greater than 1 Mb/s. To address this, various techniques have been proposed to increase the transmission rate of existing Li-Fi links based on white LED. For example, the use of various equalizer circuits, blue lens filters, and modulation techniques that efficiently utilize bandwidth have been suggested [5,6,7,8]. These techniques often require additional physical hardware or strict adherence to the Nyquist sampling theorem. Recently, there have been active studies in applying deep neural network techniques to the field of digital signal processing, moving beyond traditional signal processing ones [9,10,11,12,13,14].

In this paper, we propose a masked autoencoder (MAE)-based sparse coded mask modeling technique to increase the channel capacity of discrete multi–tone (DMT) signals transmitted and received in a Li-Fi link based on a white LED. Since DMT signals are modulated with user data across multiple orthogonal subcarriers and generated in parallel simultaneously, each time-series subcarrier signal can be recognized as a multivariate time–series data (MTSD) waveform that exists concurrently. Research on analyzing and predicting such diverse MTSD has been conducted [15,16,17,18]. In Section 1.1, we provide a detailed explanation of motivation based on the masked autoencoder, which serves as the foundation for proposing and implementing our sparse coded mask modeling technique.

1.1. Technical Motivation Based on MAE

1.1.1. Fundamental Reason: True Time-Domain Compression and Self-Supervised Reconstruction

■

Physical Signal Shortening vs. Traditional Equalization

Traditional deep learning methods for optical wireless links (e.g., CNN- or DNN-based equalizers) assume full waveform transmission and focus on compensating channel impairments.
Our goal, however, is to physically “shorten” the transmitted DMT waveform (by deleting large fractions of samples), thereby freeing up more channel capacity.
Because the MAE forces the model to learn how to “fill in” missing segments, we can remove a substantial portion (up to 75–85%) of the waveform and still reconstruct it accurately at the receiver.
This self-supervised “inpainting” approach—where the network learns from unmasked patches to predict masked ones—differentiates the MAE from other autoencoders or supervised methods.

■

Self-Supervised Learning with Minimal Overhead

A masked autoencoder does not require specialized pilot sequences or external labels; the only references for missing parts are the original waveforms themselves.
It is thus naturally suited to handling missing data in a self-supervised manner, minimizing extra signaling overhead often necessary in end-to-end or classical channel-estimation approaches.

■

Better Generalization and Reduced Overfitting

Through random masking, an MAE forces the model to learn global structure rather than memorizing local correlations.
Our experiments show that increasing the masking ratio can paradoxically decrease errors (EVM) because the model becomes more robust to variations in the DMT waveform.
Standard autoencoders (without masking) risk local overfitting, as they see the entire waveform in each training pass.

1.1.2. Technical Justification: Treating DMT as Multivariate Time-Series

■

DMT Waveforms as Patches

We divide the original DMT waveform into patches, then randomly delete (mask) a fraction of them. The MAE reconstructs these missing patches by contextualizing the unmasked patches.
This works well because DMT signals are effectively multivariate time-series data, with multiple subcarriers modulated in parallel.
The MAE’s self-attention mechanism can exploit both time-domain continuity and cross-subcarrier correlations more effectively than purely local methods.

■

Positional Encoding for Time-Series Reconstruction

We embed each patch with a positional encoding, so the MAE knows where each patch belongs.
This is crucial for a time-series context: unlike images or simple feed-forward networks, we need to preserve temporal and subcarrier ordering.
By leveraging the transformer-style attention, the MAE can identify subtle relationships across patches, enabling accurate reconstructions even with high masking ratios.

1.1.3. Why Not a Standard Autoencoder or Other Alternatives?

■

Standard Autoencoders

Typical autoencoders assume access to all (or nearly all) input features, then compress them into a latent dimension.
They do not physically remove data but only map the full input to a lower-dimensional latent space. Hence, they do not yield a genuine time-domain compression that can be exploited for increasing channel capacity.
By contrast, an MAE literally masks out entire patches, demanding that the decoder reconstruct them from only partial information.

■

CNN or FNN-based Approaches

CNNs can handle spatially or temporally correlated data but typically focus on local receptive fields; they struggle to capture long-range dependencies if large chunks of data are missing.
FNNs fully connect all inputs but are not naturally suited to partial-observation “inpainting” tasks.
MAEs (particularly those using self-attention) adapt more easily to scenarios with large missing portions by leveraging global context in an end-to-end learned fashion.

■

End-to-End Deep Learning without Masking

While end-to-end methods can improve channel equalization, they still transmit the complete waveform.
Consequently, they cannot provide the effective capacity gains we achieve by physically not sending up to 75–85% of the samples.

1.1.4. Practical Impact of the Proposed Technique on Li-Fi Transmission

■

Increasing Channel Capacity Without New Hardware

Li-Fi’s bandwidth limitations arise from LED response constraints. Our approach circumvents these constraints by truly compressing the waveform in time.
Even with standard optics and the same nominal 10 MHz bandwidth, we achieve up to 1.85× capacity by transmitting fewer samples per symbol interval and reconstructing them later.

■

Robustness Under Realistic Conditions

As we showed experimentally (with 1 m free-space Li-Fi, indoor illumination levels of 300–440 lux), the MAE-based approach remains feasible and robust.
A high masking ratio (75–85%) still yields acceptable EVM under the firstfirst-generation GFEC threshold.

In summary, the motivation for using a masked autoencoder (MAE) in our Li-Fi DMT system is that it uniquely enables:

Actual time-domain compression (masking) and subsequent self-supervised reconstruction,
Improved generalization through forced global pattern learning, and
Real capacity gains without hardware changes, going beyond mere channel equalization.

These benefits are not readily realized by standard autoencoders, CNNs, or purely end-to-end supervised networks. Hence, we designed our system around the MAE paradigm to capitalize on its inherent ability to inpaint missing signals and ultimately boost Li-Fi transmission capacity.

The process of implementing the proposed technique can be summarized as follows. The entire DMT signal waveform is divided into N sub-waveforms, and K of these sub-waveforms are randomly selected and masked (deleted) for encoding. As a result, the encoded DMT signal has a reduced length compared to the original signal. After the encoded DMT signal is transmitted through the Li-Fi wireless link, the decoding process involves learning from the information of the adjacent unmasked waveforms in order to recover the K deleted sub-waveforms without requiring a high level of understanding. To the best of our knowledge, this is the first time that the MAE-based technique has been proposed to increase the channel capacity of DMT signal by reducing (masking) its length and then recovering it. We aim to more clearly convey the key contribution of our proposed technique by presenting in greater detail the differences between our approach and previously proposed techniques for improving the transmission performance of optical signals, as discussed in Section 1.2.

1.2. Other Existing Works and Key Contributions of Proposed Technique

First, as shown in Table 1, we have organized and presented the differences between the previously proposed techniques mentioned in the references and our proposed method to clearly highlight the distinctions.

The key contribution of our proposed technique can be summarized as follows.

■

True Time-Domain Compression Through Random Masking

Unlike many deep learning approaches that transmit the complete waveform and focus on post-transmission equalization, our method removes (masks) up to 75–85% of the DMT signal pre-transmission.
A masked autoencoder then reconstructs these missing segments at the receiver, which increases the effective channel capacity up to 1.85× over a 10 MHz bandwidth.

■

Viewing Multicarrier DMT Waveforms as Time-Series ‘Patches’

We divide the QPSK/16-QAM DMT waveforms into 1-D patches and apply positional encoding to identify each patch’s location.
This design captures long-range dependencies via self-attention, facilitating robust patch-wise masking and inpainting—even under high masking ratios.

■

Extensive Experimental Validation

We systematically measure EVM versus masking ratio, number of patches, and batch size in an actual Li-Fi testbed (1 m range, 300–440 lux).
Notably, our results show that higher masking ratios can sometimes reduce errors because the MAE is forced to learn global waveform structure instead of overfitting local features.
Our approach remains practically feasible above ~240 lux, typical of indoor illumination levels.

In sum, our work differentiates itself by physically removing large portions of the transmitted signal and then reconstructing them—a methodology not found in conventional deep learning-based optical enhancements. This leads to a true increase in data throughput for Li-Fi systems, corroborated by real-world experiments. We emphasize once more that this “partial signal transmission + global context reconstruction” is a novel concept enabling substantial capacity gains under standard indoor lighting conditions.

1.3. Conceptual Comparison with Representative State-of-the-Art (SOTA) Techniques

Our study focused primarily on validating the effectiveness of masking and reconstructing DMT signals using a deep learning-based self-supervised model, specifically a transformer-based MAE, which is a novel approach in the context of optical wireless Li-Fi systems. As such, there are few existing works that employ the same kind of signal-level reconstruction via learned masking strategies, especially in the time domain of DMT signals.

In contrast, many SOTA Li-Fi enhancement techniques focus on

Hardware improvements, such as LED driver optimization or photodiode sensitivity enhancement.
Equalization/filtering methods, such as RLS/FIR/DFE adaptive filters.
Modulation techniques, like CAP, OFDM with high-order QAM, or VLC-optimized schemes.

These methods, while effective, differ fundamentally in their domain of operation (e.g., analog front-end, frequency-domain digital processing) and are not directly comparable at the waveform compression-reconstruction level, which is the key innovation of our method.

Table 2 shows the conceptual and qualitative comparison below, focusing on performance attributes.

Our method supports waveform compression up to 85% masking, which traditional methods do not address.
Unlike FNN/CNN models that are limited to local features or final-stage demapping, the MAE captures long-range temporal dependencies and reconstructs full time-series waveforms.
Our approach also works without requiring strict channel models or Nyquist compliance, which many traditional DSP techniques depend on.

The remainder of this paper is organized as follows. Section 2 explains how the proposed sparse coded masked model encodes by masking the DMT signal and then decodes to recover it. Section 3 describes the implementation of the Li-Fi optical wireless link used to experimentally verify the proposed technique. Section 4 presents various experimental results on the transmission performance of the proposed technique measured through the implemented Li-Fi optical wireless link. Finally, Section 5 provides a summary and conclusion.

2. Augmentation of Li-Fi Wireless DMT Transmission Capacity Through Sparse Coded Mask Optimization

2.1. Sparse Coded Mask Modeling for DMT Signal

As mentioned in the introduction, the DMT signal can be considered a type of MTSD. The reasons for this are as follows: First, each subcarrier in the DMT signal can be viewed as a modulated signal over time, and thus can be recognized as a function of time. In this respect, it is very similar to time-series data. Second, if the DMT signal is analyzed in the time domain for each subcarrier, each subcarrier signal can be considered a separate time-series dataset. In this case, these individual time-series datasets coexist simultaneously and interact with each other (by maintaining orthogonality between subcarriers). Therefore, the entire set of the DMT signal can be viewed as MTSD. Based on the results of the existing studies [15,16,17,18], we anticipate that it should be possible to mask (delete) and then recover DMT signals, which exhibit characteristics that are very similar to MTSD.

Figure 1 shows the process of increasing the transmission capacity of DMT signals during optical wireless transmission using the proposed sparse coded mask modeling technique. A DMT signal,

X

, with

N

subcarriers, which is encoded by quadrature phase shift keying (QPSK) symbols, is fed into a masked autoencoder and then divided into

k

segments. These

k

segments are transformed into DMT patches through a one-dimensional convolution operation. Afterward, a positional encoding process is applied, adding one positional encoding (marked in black in Figure 1) to the

k

DMT patches. Out of the

k

DMT patches, only

d

(where

d ≪ k

) patches are randomly retained, while the

k - d

remaining patches are masked (deleted). Finally, the latent representation of the DMT signal, which fully reflects the original characteristics of the DMT signal, is then obtained as the output signal of the encoder. Therefore, since the length of the encoded DMT signal (the latent representation of the DMT signal) after masking is reduced compared to the original DMT signal, the length of the encoded optical DMT signal, after being directly modulated by the white LED and then transmitted, is also reduced. This means that, from the perspective of a typical data transmission system, the DMT signal is transmitted in a compressed form. After optical wireless transmission, the encoded DMT signal, which is received at the photo-detector, is recovered to the original DMT signal through a decoding process. Therefore, by compressing the length of the original DMT signal during optical wireless transmission, the channel capacity within one second is increased.

First, the process by which the length of the QPSK–encoded DMT signal is reduced (compressed) using the proposed MAE–based sparse coded mask modeling technique can be mathematically described as follows.

The DMT signal,

U (m)

, corresponding to one symbol in Figure 1, can be expressed as Equation (1):

U (m) = \frac{1}{\sqrt{2 N}} \sum_{n = 0}^{2 N - 1} C_{n} e x p (j 2 π n \frac{m}{2 N}), m = 0, 1, \dots, 2 N - 1

(1)

where

U (m)

is a sequence of real-valued numbers composed of

2 N

points, which is the result of applying the IFFT (Inverse Fast Fourier Transform) to all

2 N

points.

N

is the number of subcarriers.

C_{n} = A_{n} + j B_{n}

is the complex values according to QPSK modulation.

A_{n}

is the in-phase component and

B_{n}

is the quadrature component.

Next, a CP (Cyclic Prefix) of length,

N_{c p}

was added to the DMT signal,

U (m),

in order to maintain orthogonality, as shown in Equation (2).

X (m) = \frac{1}{\sqrt{2 N}} \sum_{n = 0}^{2 N - 1} C_{n} e x p [j \frac{2 π n (m - N_{c p})}{2 N}], m = 0, 1, \dots, 2 N + N_{c p} - 1

(2)

where

X (m)

is the DMT signal with CP. After the parallel to serial conversion, its total length becomes

(2 N + N_{c p})

. This is fed into the MAE.

The MAE process consists of two stages. First, the waveform of the DMT signal with the CP,

X (m)

is divided into

S

segments, where the length of a single segment,

x_{i} (m)

is

(2 N + N_{c p}) / S

) as shown in Equation (3). Next, they are transformed into

S

DMT patches,

P_{i} (m)

using a one-dimensional (1-D) convolution filter (kernel size,

K

= 16, stride = 16) as shown in Equation (4):

X (m) = \sum_{i = 1}^{S} x_{i} (m)

(3)

P_{i} (m) = \sum_{k = 0}^{K - 1} h_{k} \cdot x_{i} (m - k), i = 1, 2, \dots, S

(4)

where

P_{i} (m)

is the i-th extracted feature patch,

x_{i} (m)

is the i-th DMT signal segment,

S

represents the total number of patches, and the kernel

h_{k}

is applied uniformly across all segments. The summation iterates over the length of the kernel

K

, ensuring localized feature extraction. This convolution operation allows the network to learn meaningful representations of the input signal by capturing its temporal dependencies.

In our proposed technique, the 1-D convolution filter is applied to process the discrete multi-tone (DMT) signal in order to obtain a feature representation before the masking operation. The primary purpose of this step is to transform each segment of the DMT signal into a feature space suitable for masked autoencoder (MAE)-based processing. The 1-D convolution filter is designed as a learnable kernel during the training phase. It is initialized randomly and optimized through backpropagation to extract the most informative features from the input DMT signal. all DMT signal segments are processed using the same 1-D digital convolution filter. The rationale behind using the same filter for all segments is to maintain consistency in feature extraction, ensuring that the learned representations are generalized across the entire signal. This shared filtering process enhances the robustness of the feature extraction while preserving the structural characteristics of the input DMT waveform.

Also, the positional encoding patch is represented as shown in Equation (5).

E_{p o s i t i o n} (n, m) = \{\begin{matrix} \sin (\frac{n}{10000^{\frac{m}{L_{m o d e l}}}}), m = 2 j \\ \cos (\frac{n}{10000^{\frac{m}{L_{m o d e l}}}}), m = 2 j + 1 \end{matrix}

(5)

where

n \in \{1, \dots, k\}

,

j \in \{1, \dots, ⌊L_{m o d e l} / 2⌋\}

,

L_{m o d e l}

is the feature dimension after encoding.

n

is the position index of the DMT signal segment in the sequence,

m

is the feature dimension index within the model,

L_{m o d e l}

is the total feature dimension,

j

is an integer index. This formulation is inspired by sinusoidal positional encoding, commonly used in transformer-based models. It ensures that different frequency components are assigned to each position in a way that helps the model learn relative positions between different DMT signal segments.

To be more specific, the position index of the DMT signal,

n

represents the absolute position of a given DMT signal segment within the entire sequence. This ensures that each segment of the input signal is uniquely encoded with a position-dependent value. Since DMT signals are time-series data, maintaining this position information is crucial for preserving sequential dependencies. The feature dimension index,

m

corresponds to the dimension of the feature vector in the model. The model typically transforms the original DMT signal into a high-dimensional feature space, where each feature dimension is indexed by

m

. The equation differentiates even-indexed and odd-indexed feature dimensions, ensuring orthogonality between different components of the encoding.

The use of sine and cosine functions in Equation (5) serves the purpose of injecting position-dependent variations into the input features, which allows the model to differentiate between different time steps in a continuous manner. The encoding is structured as follows: The even dimensions (

m = 2 j

) use the sine function, which increases gradually over time. The odd dimensions (

m = 2 j + 1

) use the cosine function, which is phase-shifted by

\frac{π}{2}

, providing an alternating representation. This alternating pattern ensures that the encoding remains linearly separable, meaning that the model can compute meaningful distances between different positions in the sequence.

In the random masking process,

d

patches are randomly selected (

d ≪ S)

out of the

S

patches. The remaining patches (

S - d

) are all masked, or in other words, deleted. Eventually, the remaining patches adding the positional encoding patch,

E_{p o s i t i o n}

, become the latent representation of DMT signal. In this way, the length of the transmitted DMT signal can be reduced. From the perspective of a typical data transmission system, this can be referred to as compression. The set of randomly selected patches with

d

elements,

P_{p a t c h}

, can be expressed as shown in Equation (6).

P_{p a t c h} (= \{P_{1}, P_{2}, \dots, P_{m}, \dots, P_{S}\}) \subset P_{i} (m) (= \sum_{k = 0}^{K - 1} h_{k} \cdot x_{i} (m - k))

(6)

At the receiver-side, including the photo-detector with decoder, as depicted in the right box of Figure 1, the primary task is to reconstruct the deleted DMT patches that were masked at the transmitter-side including white LED with masked auto-encoder. The reconstruction is performed using a masked autoencoder (MAE)-based deep learning model, which learns to infer the missing DMT patches based on the transmitted ones.

The overall receiver-side process can be divided into the following steps. The optically modulated signal is received by a photo-detector (PD) and then the detected electrical signal is amplified and converted back into digital form via an analog-to-digital converter (ADC). Next, the received signal is divided into DMT patches, just like at the transmitter. The positions of the deleted patches are determined using positional encoding, ensuring that the decoder can infer the correct locations. Next, the received DMT patches (both unmasked and masked ones) are input into a deep learning decoder that reconstructs the missing patches. The model learns to predict the missing patches using adjacent patch information. As shown in Equation (7), the reconstruction function can be represented as follows:

{\hat{P}}_{m} = f_{d e c o d e r} (P_{1}, P_{2}, \dots, P_{m}, \dots, P_{S})

(7)

where

{\hat{P}}_{m}

is the reconstructed DMT patch,

P_{i}

are the available (unmasked) patches, and

f_{d e c o d e r}

is the neural network used for reconstruction. Next, the reconstructed DMT signal is reassembled by combining both original and reconstructed patches. The signal is then converted from parallel to serial and undergoes an inverse discrete Fourier transform (IDFT). After reconstructing the complete DMT signal, the cyclic prefix (CP) is removed. The demodulation process extracts the QPSK symbols using discrete Fourier transform (DFT).

We proposed a masked autoencoder (MAE)-based sparse coded mask modeling technique, which is a self-supervised learning approach. The model follows an encoder–decoder architecture, where the encoder compresses the unmasked patches into a latent representation, and the decoder reconstructs the missing patches. The encoder only receives a subset of the original patches (i.e., the unmasked patches). These patches are embedded into a high-dimensional space using a 1-D convolutional embedding layer. The encoded representation is then fed into a transformer-based neural network to learn contextual relationships between patches. On the other hand, the decoder takes both the encoded patches and positional encoding information. A self-attention mechanism is applied to infer missing patches based on adjacent ones. The final output is a reconstructed version of the complete DMT signal, including the originally missing patches.

As shown in Equation (8), the loss function is calculated by comparing the reconstructed patches with the original patches using the mean squared error (MSE) loss, outputting the difference between them. The result of the calculated loss function indicates the degree of similarity between the reconstructed patches and the original patches.

L = \frac{1}{N} \sum_{m = 1}^{N} {({\hat{P}}_{m} - P_{m})}^{2}

(8)

where

{\hat{P}}_{m}

is the reconstructed patch,

P_{m}

is the original patch. Algorithm 1 shows the detailed design and algorithm of the learning model.

Algorithm 1 Masked Autoencoder (MAE)-Based Sparse Coded Mask Modeling

1: Input Received DMT patches (both available and masked)
2: Step 1 Patch Embedding
Apply a 1-D convolutional layer to transform each patch into a feature space.
Embed position information using positional encoding.
3: Step 2 Encoding
Feed the unmasked patches into a transformer encoder.
Learn contextual relationships using multi-head self-attention (MHSA) layers.
4: Step 3 Masked Patch Reconstruction
Input the encoded features into a decoder network.
Predict the missing patches based on surrounding patch information.
5: Step 4 Loss Calculation
Compute the MSE loss between the predicted and original DMT patches.
6: Step 5 Output the Reconstructed DMT Signal
Reassemble the patches and apply the inverse transformation to reconstruct the full DMT waveform.

2.2. Impact of the Masking Ratio on the Channel Capacity

2.2.1. Revisiting Our Masked Autoencoder and Signal Compression

■

Definition of Masking Ratio

Suppose the original DMT waveform for one symbol duration has M samples.
A masking ratio $r$ means that a fraction $r$ of the total patches (or samples) is deleted, and only the remaining $(1 - r)$ fraction is actually transmitted over the optical link.
In essence, the time-domain length of the transmitted signal (after masking) becomes $(1 - r) \times M$ .

■

Physical Shortening in the Time Domain

Because these masked samples are not transmitted at all, we effectively reduce the symbol’s physical duration.
In a typical data transmission system, if we can compress a waveform to a smaller time interval while still recovering it accurately at the receiver, we can fit more waveforms within a fixed total time interval (e.g., 1 s).

2.2.2. Linking the Masking Ratio to Channel Capacity

(1): Idealized Capacity Gain Analysis

■

Baseline (No Masking) Case

Let the original DMT waveform occupy a duration $T$ . Within each symbol interval, we transmit $M$ samples.
We define the “baseline” capacity, $C_{b a s e}$ in bits per second (bps) can be expressed as Equation (9).

$C_{b a s e} = \frac{B i t s / s y m b o l}{T}$

(9)
If we are using a certain modulation format (e.g., QPSK, 16-QAM), “Bits/symbol” is determined by the number of bits encoded in each DMT symbol.

■

Masked Transmission

When a fraction $r$ is masked, the transmitted portion has duration $(1 - r) \times T$ .
If we assume that we strictly use that shorter time to send the compressed DMT waveform and can pack more compressed symbols into the same total time, the capacity can be described as Equation (10).

$C_{m a s k e d} = \frac{B i t s / s y m b o l}{(1 - r) T}$

(10)
This suggests an ideal capacity improvement factor, as shown in Equation (11).

$\frac{C_{m a s k e d}}{C_{b a s e}} = \frac{\frac{B i t s / s y m b o l}{(1 - r) T}}{\frac{B i t s / s y m b o l}{T}} = \frac{1}{1 - r}$

(11)

■

Interpretation

If $r = 0.75$ (75% masked), then $\frac{1}{1 - r} = 4$ . In an ideal scenario with negligible overhead, we could quadruple throughput.
If $r = 0.85$ , $\frac{1}{1 - r} \approx 6.67$ .

(2): Practical Considerations and the 1.85× Factor

■

Overheads and Guard Intervals

In practice, our experiment must incorporate positional encoding, guard intervals, zero padding, and other overheads that reduce the net gain from the theoretical $\frac{1}{1 - r}$ .
We also rely on the masked autoencoder’s ability to reconstruct the missing patches reliably. If the masking ratio is too high, the system may fail to meet our EVM threshold for reliable demodulation.

■

Experimental Feasibility

In our Li-Fi testbed, we found that at masking ratios of up to 85%, the error vector magnitude (EVM) for QPSK/16-QAM still remains below the first-generation GFEC limit.
Accounting for all overhead (MAE training overhead, patch boundary, zero padding, etc.), our experiments showed an effective capacity gain of up to ~1.85× (instead of the pure theoretical factor near 6.67).
This value (1.85×) reflects a realistic balance between high masking ratios and maintaining acceptable EVM under practical Li-Fi conditions (1 m distance, ≥240 lux illumination).

2.2.3. Mathematical Derivation

Let us restate key equations to show how the “masking ratio to channel capacity” link can be expanded in the paper:

■: Masked DMT Signal Length, $L_{m a s k e d}$ can be described as Equation (12).

L_{m a s k e d} = (1 - r) \times L_{o r i g}

(12)

where

L_{o r i g}

is the original sample length and

0 \leq r \leq 1

.

■

Per-Symbol Transmission Time

If the system transmits each masked symbol in $(1 - r) \times T_{o r i g}$ , then the instantaneous symbol rate can be increased proportionally if we keep the same overall bandwidth or sampling rate.

■

Maximum Achievable Data Rate

Denote $β$ bits per DMT symbol (depending on QAM order and number of subcarriers). In an ideal scenario with no overhead, $C_{i d e a l} (r)$ can be expressed as Equation (13).

$C_{i d e a l} (r) = \frac{β}{(1 - r) T_{o r i g}}$

(13)

■

Realistic Rate Improvement

Let $α$ be an experimental efficiency factor ( $0 \leq α \leq 1$ ), representing overhead or imperfection. Our measured rate gain is organized as Equation (14).

$C_{r e a l} (r) \approx α \times \frac{β}{(1 - r) T_{o r i g}}$

(14)
In practice, we empirically observed $α$ to be significantly below 1, resulting in about 1.85× final measured improvement for $r = 0.85$ .

These derivations clarify that while the theoretical capacity can rise dramatically with large

r

, real-world overhead and constraints limit how much we can exploit the reduced signal length. Nevertheless, the improvement remains significant (nearly doubling the net throughput at 85% masking).

2.2.4. Additional Factors Influencing Masking Ratio

■

EVM Threshold

As shown in our experimental plots, once r exceeds a certain level, the reconstruction’s accuracy degrades, pushing EVM beyond forward error correction thresholds.
Thus, there is an optimal or near-optimal range for r (around 75–85% in our testbed) where capacity is maximized while maintaining acceptable symbol quality.

■

Number of Patches, Positional Encoding Overhead

We also vary the number/size of patches (Figure 5 in the manuscript) and the batch size (Figure 6). These parameters can shift how effectively the model learns global vs. local structures.
If patches are too large, we lose the fine-grained advantage of self-attention. If they are too small, overhead and complexity rise.

■

Complex Modulation Formats

Surprisingly, for 16-QAM (a more complex modulation format), we observed better reconstruction performance at higher masking ratios compared to QPSK. This is due to the presence of more detailed signal features.
Future expansions can consider advanced modulation orders or other waveforms, further exploring how the masking ratio and overhead interplay to affect net capacity gains.

In summary, the masked autoencoder technique enables true time-domain compression: a fraction

r

of the original signal is omitted, and in principle, we could fit

\frac{1}{1 - r}

times more transmissions in the same interval. Practically, overhead, reconstruction fidelity, and EVM constraints diminish the raw gain factor, but we still achieve up to a 1.85× capacity increase at an 85% masking ratio in a real Li-Fi environment. We will update our manuscript to provide these additional derivations and highlight the interplay between theoretical capacity improvements and practical constraints.

For your information, Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) techniques can be used as an alternative to MAE techniques. However, they may exhibit the following limitations. FNNs process data in a fully connected manner; however, they do not effectively capture the sequential dependencies between DMT patches. While FNNs could be used for final-stage regression tasks, they are not ideal for reconstructing missing patches. CNNs, on the other hand, are effective for processing spatially correlated data, and a 1-D CNN could be employed to reconstruct missing patches by convolving over-known adjacent patches. However, CNNs lack the ability to model long-range dependencies, which is crucial for accurately reconstructing masked patches over an extended DMT sequence. In contrast, the transformer-based MAE leverages self-attention mechanisms to capture long-range dependencies between masked and unmasked patches. Unlike CNNs, which focus on local information, MAE can infer missing patches using a global context, making it the most suitable choice for reconstructing masked DMT patches in this study, while CNNs and FNNs remain as potential alternatives.

3. Experimental Setup

3.1. Li-Fi Transmission Link Testbed Based on Sparse Coded Mask Model

Figure 2 illustrates the optical wireless Li-Fi transmission link implemented for the experimental verification of the proposed technique. The dotted box represents the offline processing conducted using MATLAB^® 2023a interfaced with Python 3.13.2. As shown in the dotted box on the left, the DMT signal encoded by QPSK symbols is designed to have 512 subcarriers, allocated from DC up to 10 MHz. To eliminate spike noise occurring near DC, zero–padding was applied to the first 20 subcarriers. A cyclic prefix (CP) of length 16 was added to maintain orthogonality during the optical wireless transmission and then the parallel–processed DMT signal was converted to serial form. Each DMT symbol consists of 528 samples, and a total of four DMT symbols were transmitted, resulting in 2112 samples. To ensure clear separation between consecutive DMT symbols, 40 zero-padding samples were inserted four times, leading to a total of 160 zero-padding samples. Consequently, the final DMT signal length is 2272 samples, which is not an integer multiple of 528 due to the addition of zero-padding. This design was chosen in order to enhance symbol boundary clarity and effectively minimize inter-symbol interference. Waveforms measured from (A) to (D) point are shown at the bottom of Figure 2. Here, the masking ratio was 75%, which means that 75% of the entire DMT signal waveform was removed, and only 25% was retained for transmission.

This DMT signal is evenly divided into 16 patches, with each patch having a length of 142 samples (=2272/16). These 16 patches are then treated as 16–DMT waveform images and processed through a 1–dimensional convolutional filter (embedding dimension: 1024, kernel size: 16, stride: 16) to perform patch embedding. As a result, 142-embedded patches, each with a dimension of 1 × 1024, are obtained. Next, random masking is applied to compress the DMT signal. In case of 75% random masking, the total length of unmasked patches is 568 (=142 × 16 × 0.25). Here, its batch size is set to 32. We set the learning rate to 0.00025. The remaining unmasked patches, together with the positional encoding patches, were combined to generate the latent representation of the DMT signal. It was loaded into an arbitrary waveform generator (AWG) sampled at 40 MS/s. The signal output from the AWG was amplified by a low–noise amplifier (LNA) and then biased with a DC power supply using a bias-tee. The optical signal emitted from the white LED was directly modulated by the biased AWG output signal and transmitted over a 1-m free-space optical link to an avalanche photodiode (APD).

The light intensity measured at the APD was 300 lux. After optical detection at the APD, the received signal waveform corresponding to the latent representation of DMT signal was captured and then stored by a real-time oscilloscope (MSO 71604C, Tektronix, Beaverton, OR, USA) sampled at 120 MS/s. Since the captured signal consisted of transmitted DMT patches and positional encoding patches, the positions of the transmitted patches were determined using the positional encoding patches to reconstruct the masked patches.

Subsequently, the masked patches are reconstructed through an MAE-based decoding process, which is performed in reverse order of the encoding process, extracting the feature points. Using the reconstructed ones with the transmitted patches, the DMT signal waveform was recovered and then the CP was removed. The mean squared error (MSE) was used to measure the difference between the recovered DMT signal and the input DMT one. Learning was performed to optimize various hyperparameters such as patch size, batch size, and learning rate by minimizing the obtained MSE value.

Following the demodulation process using a Discrete Fourier Transform (DFT), the QPSK symbols were obtained. In this paper, our primary focus was not on achieving the highest transmission capacity in the field of Li-Fi research, but rather on verifying whether the proposed technique functions conceptually in line with the theoretical framework. Therefore, we implemented an optical wireless Li-Fi transmission link with a relatively low data rate to experimentally validate the proposed technique.

The specifications of the optical and RF components used to implement the experimental setup shown in Figure 2, as well as the values of the transmission link parameters, are summarized in Table 3.

The primary objective of this research was to verify the feasibility and theoretical validity of the proposed technique in a controlled and measurable environment, rather than achieving maximum transmission distance or data rate. The key reasons for selecting a 1 m optical link are as follows.

Controlled Channel Characteristics: Short-range optical transmission minimizes the impact of unpredictable environmental variables such as multipath reflections, non-line-of-sight propagation, and ambient noise. This enables us to isolate and evaluate the effect of our proposed masking, compression, and reconstruction techniques on the DMT waveform.
Fine-grained Signal Capture: At shorter distances, signal distortion is minimized, allowing us to accurately measure subtle waveform differences introduced by the MAE process. These include time-domain waveform shape, constellation distortion (EVM), and patch recovery quality.
Reproducibility and System Stability: A 1 m setup ensures consistent illumination intensity (300–440 lux) and signal power at the photo-detector, making it easier to reproduce results and compare performance under controlled variations in masking ratio, patch number, and batch size.
Focus on Concept Validation: As stated in the manuscript (Section 3, p. 6), the main focus of this work was to validate the concept and behavior of the proposed sparse coded mask modeling technique when applied to Li-Fi DMT signals, rather than to develop a fully optimized long-range Li-Fi system.

Although the current experiment was performed over a short distance, the proposed technique is inherently scalable and can be adapted to longer-range or more complex Li-Fi scenarios. Here is why:

Transmission over Extended Distances: The proposed MAE-based framework is agnostic to transmission distance as long as the signal-to-noise ratio (SNR) at the receiver is sufficient to recover the transmitted DMT patches. In fact, as shown in Figure 7 of the manuscript, our system successfully recovered signals even at SNR levels as low as 22.5 dB, indicating robustness to moderate noise or signal degradation that may occur over longer distances.
Adaptation to Ambient Illumination and Multipath Effects: The proposed method already accounts for variable lighting conditions, as demonstrated in Figure 8. Additionally, the use of transformer-based MAE architecture enables modeling of non-local dependencies, which may help in learning signal patterns distorted by multipath effects in larger indoor environments.
Feasibility in Indoor Infrastructure: Many Li-Fi use cases (e.g., smart offices, classrooms, hospitals) are characterized by short to medium range communication, typically within 2–5 m. Therefore, the proposed 1 m setup serves as a relevant first step and is representative of core mechanisms applicable to these environments.
Possible Enhancements for Long-Range Transmission: To extend the operational range, the following practical solutions can be employed:
-
Optical Lens System to focus and collimate the LED beam, increasing directionality and intensity.
-
Multiple Input Multiple Output (MIMO) architectures to aggregate spatial diversity and mitigate path loss.
-
Higher sensitivity photodiodes (e.g., SiPM or APD arrays) to enhance detection under lower illumination at longer ranges.
-
Adaptive masking ratio tuning based on channel state information (CSI) to preserve signal recoverability as the channel degrades.

The proposed MAE model consists of a patch encoder and decoder designed to process multivariate time-series DMT waveforms. Each DMT frame is split into fixed-length patches (e.g., 16-sample segments), and a random subset (e.g., 75–85%) is masked before being input to the encoder. The encoder embeds the unmasked patches using a 1-D positional embedding and passes them to a multi-layer transformer.

The decoder reconstructs the full DMT frame by predicting the missing (masked) patches using both the encoded representations and learnable mask tokens. Reconstruction loss is computed only over the masked patches to guide self-supervised learning.

To ensure reproducibility and transparency, we have now included a table summarizing the key hyperparameters, as shown in Table 4.

We tuned these hyperparameters through an empirical grid search using a validation set split from the waveform dataset. We observed that smaller patch sizes resulted in improved fine-grained reconstruction but increased model size, while 16-sample patches provided the best trade-off between performance and complexity. Similarly, a masking ratio above 85% degraded EVM sharply, while the 75–85% range maintained reliable reconstruction.

3.2. Key Transmission Performance Metrics

3.2.1. Bit Error Rate (BER) vs. Error Vector Magnitude (EVM)

■

Why we focus on EVM

In our experiments, we employed Error Vector Magnitude (EVM) as the primary performance indicator. EVM is widely used in optical and wireless communications as it provides a direct measure of constellation distortion and correlates closely with BER once we set a threshold (e.g., the first-generation forward error correction (GFEC) limit).
Specifically, in systems that use QAM-based modulation (QPSK, 16-QAM), an EVM under certain percentages typically assures a BER level under 10⁻³ or 10⁻⁴, depending on the coding scheme.

■

Estimated BER from EVM

Although we did not explicitly measure BER in real time, it is possible to relate EVM to BER via analytical expressions or standardized thresholds. For instance, the first GFEC threshold for QPSK is typically around 26.3% EVM, correlating to a BER near 3.8 × 10⁻³. For 16-QAM, a 12.3% EVM is often used as a typical threshold for meeting a BER near 10⁻³.
As reported in our experimental results, when we keep EVM below those threshold values, we can infer that the resulting BER is within acceptable limits for standard FEC to correct any residual errors.
Furthermore, a clear theoretical relationship exists between EVM and BER. Typically, a lower EVM indicates higher signal quality, directly translating to a lower BER. This relationship is mathematically defined as follows:

B E R = 0.5 \times e r f c (\sqrt{1 / {E V M}^{2}})

where

e r f c ()

is the complementary error function. Thus, our extensive EVM results effectively provide a strong indirect indication of BER performance under varying masking ratios and modulation schemes (QPSK and 16-QAM). Our results clearly demonstrate optimal masking ratios around 75%, balancing improved throughput and accurate signal reconstruction.

3.2.2. Computational Complexity and Processing Overhead

■

Offline Training vs. Online Inference

The proposed masked autoencoder (MAE) involves a training phase (offline) and an inference (deployment) phase (online).
Training Complexity:
-
This is typically performed offline on a GPU-equipped workstation. The training process can indeed be computationally demanding. However, once training converges, the resulting model parameters are frozen and used for real-time inference.
Inference Complexity:
-
The inference side typically involves fewer matrix multiplications and can run on relatively modest hardware, such as embedded GPUs or even advanced DSP/FPGA solutions, depending on implementation details.
Hence, while there is a significant one-time overhead for training, the real-time overhead for the user is predominantly in the decoder that reconstructs masked patches. This step can be optimized.

■

Comparison with Conventional Li-Fi

Traditional Li-Fi systems might use simpler digital signal processors for equalization or channel estimation, leading to relatively low-latency operations.
However, those systems transmit the entire waveform, whereas our approach physically omits a large fraction of the samples and recovers them with deep learning. This trade-off can yield a net advantage in throughput if the MAE’s inference overhead is kept manageable.
To provide a rough computational comparison:
-
If we consider a typical Li-Fi link employing FFT-based OFDM or DMT, we have an IFFT/FFT block plus straightforward channel equalization.
-
Our method adds the overhead of a 1-D convolution and transformer-based reconstruction at the receiver. The precise complexity depends on hyperparameters such as the number of patches, embedding dimension, and number of attention heads.
In future work, we plan to include scalability metrics (e.g., how many MAC operations or FLOPs per symbol) to give a more quantitative sense of overhead. Early estimates suggest that for moderate patch sizes and a well-optimized network, real-time operation at tens of MHz sampling rates can be feasible with standard GPU or FPGA acceleration.

■

Possible Reduction in Complexity

Techniques like model pruning, quantization, or lightweight transformers can reduce complexity further.
Our approach can also be integrated with partial layering or smaller embedding sizes for faster inference, at some potential trade-off in reconstruction fidelity.

3.2.3. Performance Under Varying Distances and Real-World Channel Impairments

■

Current Setup

Our current experiments are performed at 1 m with an LED-based Li-Fi link operating around 300–440 lux. We deliberately tested different lux levels to simulate typical indoor illumination conditions, but we have not yet covered a range of link distances (e.g., 2 m, 5 m, etc.).

■

Impact of Distance

As distance grows, the received optical power drops (following an inverse-square law in free-space optical scenarios), and the SNR typically decreases.
Lower SNR can degrade the autoencoder’s ability to reconstruct heavily masked waveforms, pushing EVM higher.
However, if we can ensure an adequate SNR above the forward error correction threshold, the overall concept should still apply. The system might require a lower masking ratio at longer distances or a higher optical power/better photodiode sensitivity to maintain acceptable EVM.

■

Real-World Channel Impairments

In practice, ambient light interference, multi-LED flicker, device aging, and reflection-induced multipath can further degrade SNR or introduce nonlinearities.
Our masked autoencoder approach is relatively robust because it learns the global structure of the DMT waveform. However, more severe channel distortions may require additional training data or an expanded neural network capacity.
We intend to investigate multi-path or large-signal distortion scenarios in future work. The masked approach could also be combined with a pre-distortion or equalization stage if necessary.

■

Future work

We concur that a more in-depth test across various distances and real-world building environments (e.g., offices, corridors) will help demonstrate the generalizability of our approach. We will aim to include such results in an extended version of this research or subsequent publications.

In summary, BER, computational overhead, and distance/channel impairments are as follows:

BER Discussion: While we measured EVM in our current paper, we acknowledge that a direct BER vs. SNR analysis can provide additional clarity. We will integrate approximate BER estimates and potentially run separate BER tests in future expansions.
Computational Overhead: Our method introduces a deep learning step for reconstruction, which can be performed in real-time with optimized hardware. A direct complexity comparison to traditional Li-Fi techniques is planned to give numerical overhead estimates.
Distance and Channel Impairments: We have validated the method at 1 m and under typical indoor illuminance, but broader testing at multiple distances and channels is an important next step.

4. Experimental Results

It is necessary to first theoretically verify whether the learning model implemented using the proposed technique is properly trained for the DMT signal. Therefore, we calculated how the training loss converges according to the masking ratio to evaluate the performance of the learning model.

Figure 3 shows the mean square error (MSE) as the training loss of the learning model according to the masking ratio of the DMT patches. Here, the masking ratio refers to the proportion of the waveform removed from the overall DMT signal, representing the compression ratio from the perspective of digital signal processing. The solid black line represents the MSE curve at a masking ratio of 25%, the solid gray line represents the MSE curve at a masking ratio of 55%, and the gray dashed line represents the MSE curve at a masking ratio of 75%. Since the difference in MSE according to the masking ratio change in epochs from 400 to 500 was not discernible to the naked eye, the result of enlarging the MSE curve in the corresponding epoch range is shown in the inset of Figure 3. Notably, as illustrated in the inset, the MSE at a masking ratio of 75% was observed to be consistently lower, within the epoch range of 400 to 500. These results suggest that the errors associated with recovering masked DMT patches would be decreased progressively as the masking ratio increased from 25% to 75%. To experimentally validate these simulation result, the error vector magnitude (EVM) was measured for two different symbols (QPSK and 16-quadrature amplitude modulation (QAM)) under varying masking ratios using the optical wireless Li-Fi transmission link implemented in Figure 2.

Figure 4 shows the error vector magnitude (EVM) of the QPSK symbols and 16-QAM symbols repeatedly measured while changing the masking ratio. The filled squares represent the EVM of the QPSK symbols, while the open squares indicate the EVM of the 16-QAM symbols. Each inset shows the constellation diagram of QPSK and 16-QAM symbols when the masking ratio is 75%. At this time, the total number of DMT patch was 16, and the batch size was 32. Here, the masking ratio refers to the proportion of the waveform removed from the overall DMT signal, representing the compression ratio from the perspective of digital signal processing. As shown in Figure 4, the EVM of both symbol types, QPSK and 16-QAM, decreased progressively as the masking ratio increased from 15% to 75%. From the perspective of general digital signal processing, an increase in the masking ratio, i.e., compression ratio, leads to a higher error rate during the recovery process, which inevitably results in an increase in EVM. However, with the application of the proposed technique, it was observed that, as the masking ratio increases to 75%, the recovery process for each signal exhibited a reduction in error. Notably, the EVM decreased to 17% for QPSK symbols and 2.7% for 16-QAM symbols.

These results can be explained by the following technical reasons. At low masking ratio, the proposed MAE-based learning model is exposed to a substantial portion of the original DMT signal waveform data encoded with each symbol type (QPSK and 16-QAM). As a result, the model may become overly reliant on, or overfitted to, the abundant original DMT data during the recovery process. Consequently, the proposed learning model primarily emphasizes local relationships, rather than capturing the overall structure of the entire DMT signal waveform. As a result, the model tends to reconstruct the missing waveform based on the surrounding information from the unmasked DMT patches. If these processes are repeated, the model fails to capture the overall characteristics of the randomly varying time-series data, leading to an increased error rate during the recovery of the DMT waveform data.

In contrast, at higher masking ratios, the proposed learning model is required to reconstruct a larger portion of the missing waveform using limited information from the unmasked DMT patch data. This constraint enhances the ability of the learning model to deeply analyze the DMT patch data and perform recovery based on learned patterns. This process allows generalized pattern learning to occur, which helps prevent overfitting. In other words, the proposed learning model does not simply predict the masked portions of the DMT waveform data randomly using limited DMT data information. Instead, it learns the overall structure and patterns of the DMT data, enabling it to effectively predict the missing DMT data waveform based on this understanding. Ultimately, this recovery process leads to a reduction in EVM by minimizing errors.

In addition, we can see that, as the masking ratio increases beyond 75%, the EVM of each signal increases. This increase in EVM can be attributed to the following reason: when the masking ratio becomes excessively high, the amount of DMT patch data available for the learning model is significantly reduced. As a result, the learning process becomes challenging, leading to a decrease in recovery performance. Therefore, it is necessary to optimize the masking ratio according to the characteristics of various signals in order to apply the proposed technique to signals other than the DMT signal.

Another notable observation is that the EVM of the 16-QAM symbols was consistently lower than that of the QPSK symbols at the same masking ratio. This result implies that the DMT waveform encoded by a 16-QAM symbol, which has a more complex waveform shape than that encoded by a QPSK symbol, is better recovered and closer to the original one. In a typical digital signal processing environment, the DMT waveform encoded by a 16-QAM symbol exhibits greater complexity compared to that encoded by a QPSK symbol. Consequently, the intricate and delicate structure of the DMT waveform encoded by the 16-QAM symbols is more susceptible to loss during the compression process. Therefore, as the compression ratio increases, the 16-QAM symbols will have more errors during the recovery process.

However, contrary to conventional digital signal processing environments, the proposed technique demonstrates superior recovery of the more complex DMT waveform encoded by the 16-QAM symbols, as shown in Figure 4. This can be attributed to the following reasons. Complex DMT waveforms inherently contain more information due to their detailed features, such as frequent abrupt changes in the waveform. In other words, complex DMT waveforms encompass a greater number of structural features of the DMT data. This allows the proposed learning model to leverage a richer set of structural DMT information during the learning process, enabling it to better comprehend the overall DMT signal waveform structure and achieve more accurate recovery, closer to the original DMT waveform. This allows the proposed learning model to reduce the error rate when recovering missing DMT waveforms based on unmasked DMT patches.

Finally, it was observed that both symbols (QPSK and 16-QAM) can be successfully transmitted even at masking ratio up to at least 85%, based on the first generic forward error correction (GFEC) EVM threshold (QPSK: 26.3%, 16-QAM: 12.3%), as shown in Figure 4. These results indicate that, based on the 10 MHz physical bandwidth in the experimental setup of Figure 2, the transmission rate can be enhanced to up to 37 Mb/s (calculated as 20 Mb/s × (1 + 0.85)) for QPSK symbols and up to 74 Mb/s (calculated as 40 Mb/s × (1 + 0.85)) for 16-QAM symbols.

Figure 5 shows how the EVM of QPSK and 16-QAM symbols can change as the total number of DMT patches increases from 8 to 64 in a power of 2. At this time, the masking ratio was 75%, and the batch size was 32. The two insets in Figure 5 show the constellations of the QPSK symbols and the 16-QAM symbols measured when the number of DMT patches was set to 8.

Figure 5. Measured EVM variation against the total number of DMT patches. Filled squares: EVM of QPSK symbols, Open squares: EVM of 16-QAM symbols.

As shown in Figure 5, we confirmed that the EVM of both signals decreases as the number of DMT patches increases from 8 to 32. The reduction in EVM observed as the number of DMT patches increases from 8 to 32 can be attributed to the following technical factors: As the total number of divided DMT patches decreases, the length of each patch increases. Consequently, the temporal range of DMT information contained within each patch expands. Thus, random masking of DMT patches at a fixed mask ratio results in the removal of relatively longer patches, which arise due to a smaller total number of patches. Consequently, this increases the likelihood of failing to capture and learn the detailed time-series patterns within critical DMT patches. Ultimately, the proposed learning model leads to an increased error rate in the reconstruction of the overall DMT signal.

Conversely, as the total number of segmented patches increases and the length per patch becomes shorter, the time-series pattern information contained in each DMT patch becomes more fragmented. When utilizing these more fragmented DMT patches, the recovery of masked patches becomes significantly easier by leveraging the surrounding unmasked patches, even at the same masking ratio. In other words, smaller DMT patches enable the proposed learning model to learn detailed time-series DMT patterns and changes more accurately in order to recover the DMT signal waveform, thus reducing the error rate. As shown in Figure 5, even when the total number of DMT patches was increased to 32, the recovery error rate decreased. This was because the detailed pattern variations in the DMT signal are more accurately learned, as discussed above, leading to a gradual reduction in the EVM for both QPSK and 16-QAM symbols.

However, when the total number of divided DMT patches reached 64, an increase in EVM was observed for both symbols. This increase in EVM can be attributed to the following factors: the waveform of a DMT signal encodes critical information regarding its continuity and temporal pattern variations. When the length of each DMT patch becomes excessively short, maintaining this continuity and capturing these pattern variations may become challenging. Furthermore, as the number of divided patches increases excessively, the complexity of the proposed learning model also rises, adversely impacting its ability to achieve accurate learning. For these technical reasons, it could be concluded that the EVM for both symbols increased when the total number of DMT patches was raised to 64.

Based on the experimental results presented in Figure 5, it can be concluded that optimizing the number of divided patches is crucial when transmitting data using waveforms other than the DMT signal.

Figure 6 shows the EVM variation in QPSK and 16-QAM symbols with respect to the change in batch size. At this time, the masking ratio was 75% and the number of divided DMT patches was 16. Since batch refers to the number of samples used each time the proposed learning model is updated, it can affect the error rate that occurs in the process of recovering the DMT signal. Therefore, we measured how the transmission performance of the two symbols (QPSK and 16-QAM) changes according to the change in batch size. As shown in Figure 6, as the batch size increases from 4 to 64, the EVM of the QPSK symbols increases from 16.5% to 21.1%, and for the 16-QAM symbols, the EVM increases from 2.5% to 5.4%.

Figure 6. Measured EVM variation against the batch size. Filled squares: EVM of QPSK symbols, Open squares: EVM of 16-QAM symbols.

The reason why EVM increases as batch size increases can be attributed to the following reasons: It is known that when the batch size is small, it helps the learning model to prevent overfitting and learn various patterns by adding noise to the gradient estimation at each update process, thereby increasing the recovery rate [19,20,21]. Similarly, in the case of the DMT signal, it is possible to effectively recover the masked part of the DMT patch by capturing the delicate changes in the DMT signal waveform with time-series characteristics. On the other hand, as the batch size increases, the diversity of DMT patch data decreases, which may cause the proposed learning model to become biased toward a specific pattern, resulting in overfitting. This will increase errors in the recovery process of the DMT signal waveform. However, when the batch size is small, the memory usage of the proposed learning model increases, which may lead to longer training times. Therefore, it is necessary to optimize the batch size within the EVM threshold, considering the latency allowed by the network.

As shown in Figure 7, the BER of both QPSK and 16-QAM symbols decreases consistently as the input SNR increases from 15 dB to 40 dB. The QPSK signal achieves a BER below the GFEC threshold at around 23 dB, while the 16-QAM signal reaches this point slightly earlier, around 22 dB. This performance trend is consistent with the EVM results presented earlier, where 16-QAM showed lower error magnitudes than QPSK.

Figure 7. Measured BER versus input SNR for QPSK and 16-QAM symbols after 1 m optical wireless transmission using the proposed MAE-based sparse coded mask modeling technique. The masking ratio was fixed at 75%. The dashed line indicates the first-generation GFEC threshold (BER = 1 × 10⁻⁴).

The superior BER performance of 16-QAM may be attributed to its higher symbol resolution, allowing the masked autoencoder to better infer complex symbol patterns during the reconstruction process. At higher SNRs (above 30 dB), the BER of 16-QAM drops to near-zero levels, demonstrating the effectiveness of the proposed reconstruction approach even for high-order modulation schemes.

These findings validate the robustness of the proposed MAE-based sparse coded mask modeling technique not only in terms of constellation-level accuracy (via EVM) but also in terms of bit-level reliability, as verified through BER analysis.

Figure 8 illustrates the variation in EVM for two modulation formats, QPSK and 16-QAM, under repeated measurements, as the intensity of light incident on the APD changes after 1 m of optical wireless transmission. The experimental conditions included a masking ratio of 75%, 16 DMT patches, and a batch size of 32. The received light intensity was measured within the range of 200 to 440 lux, reflecting the typical illuminance levels of indoor environments. This range was chosen considering the use of Li-Fi networks, which employ LEDs for indoor illumination. As illustrated in Figure 8, the EVM floor was observed for both modulation formats within the received light intensity range of 300 to 440 lux. This indicates that the received SNR does not exhibit further improvement with an increase in light intensity beyond 300 lux. Furthermore, based on the first generation GFEC EVM threshold, it was determined that the received light intensity must be approximately 240 lux or higher to ensure successful transmission for both QPSK and 16-QAM symbols.

Figure 8. Measured EVM changes in QPSK and 16-QAM symbols against the received light intensity. Filled squares: EVM of QPSK symbols, Open squares: EVM of 16-QAM symbols.

For your information, we primarily focused on identifying the optimal masking ratio and patch number/size for different DMT subcarrier modulation formats. However, we acknowledge that other DMT system parameters, such as IFFT size and CP length, could also significantly affect these optimal values. When the IFFT size is large, the subcarrier spacing becomes narrower, increasing frequency resolution and enabling more precise spectral representation. This allows the MAE model to utilize finer frequency details, potentially supporting a higher masking ratio. Additionally, smaller patch sizes may be more effective since the higher resolution ensures sufficient detail for reconstruction. Conversely, when the IFFT size is small, the subcarrier spacing widens, reducing frequency resolution but making the signal more resistant to frequency-selective fading. In this case, a high masking ratio might degrade reconstruction quality due to the limited frequency resolution, and larger patch sizes may be necessary to retain sufficient information for accurate reconstruction. Similarly, when the CP length is long, more redundancy is introduced in the signal, allowing a higher masking ratio as the MAE can leverage this redundancy, while smaller patch sizes can still preserve enough information for accurate reconstruction. However, when the CP length is short, less redundancy is present, meaning that a high masking ratio might lead to excessive information loss. To compensate, a lower masking ratio is needed, and larger patches may be necessary to capture enough information within each patch. Therefore, future studies need to investigate various IFFT sizes and CP lengths to understand their respective roles in optimizing MAE-based DMT signal reconstruction.

Based on the experimental results presented in Section 4, we can summarize the increase in transmission capacity of the Li-Fi system achieved by the proposed method, as follows.

(A): Summary of the Experimental Transmission System

■: Physical bandwidth used: 10 MHz
■: Modulation formats: QPSK and 16-QAM
■: Number of subcarriers per DMT symbol: 512
■: Cyclic Prefix (CP): Length of 16
■: Total samples per transmission: 2272 (including four DMT symbols and zero-padding)
■: Masking ratio: Up to 85%
■: Sampling rate: 40 MS/s (Arbitrary Waveform Generator)

(B): Data Rate Estimation

■

QPSK Transmission Rate

Baseline rate (without masking):
-
Since QPSK carries 2 bits per symbol:
-
Baseline = 10 MHz × 2 bits = 20 Mbps
Enhanced rate with 85% masking (i.e., a 1.85× increase in channel capacity):
-
Effective rate = 20 Mbps × (1 + 0.85) = 37 Mbps

■

16-QAM Transmission Rate

Baseline rate:
-
16-QAM carries 4 bits per symbol:
-
Baseline = 10 MHz × 4 bits = 40 Mbps
Enhanced rate with 85% masking:
-
Effective rate = 40 Mbps × (1 + 0.85) = 74 Mbps

Therefore, our experimental setup demonstrates that the proposed method can achieve up to 37 Mbps with QPSK and 74 Mbps with 16-QAM, under a 10 MHz bandwidth scenario.

Also, the optimization of parameters such as the masking ratio, patch numbers, and batch size is indeed a crucial component of the proposed MAE-based sparse coded mask modeling technique. These parameters can be summarized as follows:

■

Masking Ratio

The masking ratio determines the degree of compression applied to the DMT signal. As demonstrated in our experimental results (Figure 3 and Figure 4 of the manuscript), we observed that increasing the masking ratio up to 75% resulted in a decrease in error vector magnitude (EVM) for both QPSK and 16-QAM symbols. This somewhat counterintuitive result is attributed to the self-supervised learning capability of the MAE, which encourages global pattern learning and reduces overfitting when less information is directly provided. However, if the masking ratio exceeds 85%, reconstruction performance starts to degrade due to the lack of sufficient unmasked patches, as explained on page 6 of the manuscript.

■

Patch Numbers

As shown in Figure 5, increasing the number of DMT patches from 8 to 32 led to a decrease in EVM, indicating improved reconstruction. Smaller patches preserve finer time-domain details, making it easier for the MAE model to learn local and global structures. However, further increasing the patch count to 64 resulted in higher EVM due to excessive fragmentation, which hinders temporal continuity learning and increases model complexity. Therefore, a trade-off exists between information density and spatial resolution, requiring careful tuning depending on the modulation scheme and waveform complexity.

■

Batch Size

Figure 6 illustrates the influence of batch size on EVM. A smaller batch size introduces stochastic gradient noise during training, promoting generalization and aiding the MAE in learning diverse temporal patterns. On the other hand, large batch sizes tend to stabilize training but may lead to overfitting limited data patterns. An optimal batch size of 32 was selected to balance memory efficiency and reconstruction performance in our implementation.

For your information, the practical feasibility of integrating the proposed masked autoencoder-based sparse coded mask modeling into real-world Li-Fi systems can be considered as follows:

■

Practical Feasibility in Real-World Li-Fi Systems:

Hardware Compatibility:
-
Our proposed method operates entirely in the digital baseband domain. The masking and reconstruction processes are applied to the digital representation of DMT waveforms before digital-to-analog conversion. Therefore, no modifications to the optical front-end (e.g., LEDs, photodiodes) are required. This makes our method highly compatible with the existing Li-Fi transceiver hardware.
Integration into DSP Chains:
-
The MAE-based reconstruction module can be embedded as part of the digital signal processing (DSP) pipeline on the receiver side. Given that many commercial Li-Fi systems already include FPGA- or SoC-based DSP platforms, deploying our model using an efficient inference engine (e.g., TensorRT, ONNX, or custom hardware accelerators) is feasible. Quantization-aware training can further reduce computational overhead for edge deployment.
Latency and Complexity:
-
Our proposed system is based on a relatively lightweight transformer encoder–decoder architecture (4 layers, 128-dim embeddings), and inference time per DMT frame is in the sub-millisecond range on modern embedded GPUs. Furthermore, the patch-based nature of our method supports parallelization across hardware threads.
Adaptability to Dynamic Environments:
-
The masking ratio can be dynamically adjusted based on real-time channel conditions (e.g., under high interference or signal loss). The model’s robustness to high masking ratios (up to 85%) provides flexibility and resilience under varying Li-Fi deployment scenarios (e.g., mobility, partial blockage).
Training and Update Strategy:
-
Since the MAE model follows a self-supervised training scheme, it can be periodically retrained or fine-tuned using unlabeled waveform data collected from the field. This supports lifelong learning in deployed systems without manual annotation.
Energy Efficiency Considerations:
-
Because the proposed method reduces the total number of transmitted samples via masking, it has the potential to reduce average power consumption on the transmitter side—an important factor in energy-sensitive Li-Fi applications.

In summary, the proposed model is both computationally tractable and practically deployable within existing Li-Fi system architectures.

Also, the generalizability of our proposed technique extends beyond DMT-based Li-Fi transmissions. Since the masked autoencoder framework is particularly effective at reconstructing multivariate time-series data, the following potential applications are envisioned.

■

OFDM-based Visible Light Communications (VLC)

Similar to DMT, Orthogonal Frequency Division Multiplexing (OFDM) systems also encode signals over multiple subcarriers. The proposed masking and reconstruction strategy can be applied to compress OFDM waveforms for high-throughput VLC systems, especially in scenarios constrained by physical LED bandwidth.

■

mmWave and Terahertz Wireless Communications

In millimeter-wave (mmWave) and Terahertz (THz) systems, signal waveforms often suffer from path loss and hardware nonlinearities. Applying masked autoencoder-based compression and recovery could help enhance robustness and reduce the overhead associated with channel estimation or signal regeneration.

■

Underwater Optical Communications

The underwater environment exhibits strong multipath fading and scattering. The MAE-based technique can be adapted to reconstruct distorted waveforms received after transmission through turbid water, thereby improving effective throughput and error resilience.

■

Biomedical Time-Series Signal Compression

Electrocardiogram (ECG) and Electroencephalogram (EEG) signals are also examples of multivariate time-series data. Our method can be used for efficient compression and reconstruction of these signals in wearable devices with limited transmission bandwidth, while preserving signal integrity for downstream diagnostic use.

■

Structural Health Monitoring (SHM)

Sensor networks deployed for SHM generate large volumes of time-series vibration or strain data. By applying sparse coded masking, data can be compressed at the edge before transmission to a central server, reducing bandwidth consumption while retaining reconstruction fidelity for anomaly detection.

5. Conclusions

A new technique is proposed to increase the transmission capacity of Li-Fi based optical wireless transmission links. The MAE–based sparse coded mask modeling technique presented in this paper was implemented through the sequential execution of the following steps. First, the entire length of the DMT signal waveform, encoded using QPSK or 16-QAM symbols, is divided equally into a predefined number of DMT patches to obtain segmented patches. Subsequently, these segmented DMT patches are compressed based on a predefined masking ratio using the MAE technique. The compressed patches are then transmitted via an optical wireless channel based on Li-Fi. Following optical wireless transmission, the received DMT patches were decoded to reconstruct the DMT signal and recover the original QPSK and 16-QAM symbols.

In summary, the experimental results demonstrate that both QPSK and 16-QAM symbols can be successfully transmitted over a 1-m optical wireless link with a masking ratio of up to 85%, meeting the first-generation GFEC EVM threshold. This result indicates that the transmission capacity can be enhanced by up to 1.85 times compared to the physical bandwidth of 10 MHz. Additionally, the EVM of the QPSK and 16-QAM symbols, which were obtained by demodulating the DMT signal reconstructed from the DMT patch compressed with a 75% masking ratio, was repeatedly measured under varying received light intensity. This result tells us that the proposed technique is viable for use in Li-Fi networks within illuminance environments of 240 lux or higher.

Author Contributions

Conceptualization, Y.-Y.W. and S.M.Y.; methodology, H.H.; software, D.C.; validation, Y.-Y.W. and S.M.Y.; formal analysis, H.H.; investigation, H.H. and D.C.; resources, Y.-Y.W.; data curation, H.H. and D.C.; writing—original draft preparation, Y.-Y.W.; writing—review and editing, Y.-Y.W. and S.M.Y.; visualization, H.H.; supervision, Y.-Y.W.; project administration, Y.-Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hass, H. LiFi is a paradigm-shifting 5G technology. Rev. Phys. 2018, 3, 26–31. [Google Scholar] [CrossRef]
Albraheem, L.I.; Alhudaithy, L.H.; Aljaser, A.A.; Aldhaan, M.R.; Bahliwah, G.M. Toward designing a Li-Fi-based hierarchical IoT architecture. IEEE Access 2018, 6, 40811–40825. [Google Scholar] [CrossRef]
Elamassie, M.; Miramirkhani, F.; Uysal, M. Performance characterization of underwater visible light communication. IEEE Trans. Commun. 2018, 67, 543–552. [Google Scholar] [CrossRef]
Wu, X.; Safari, M.; Haas, H. Access point selection for hybrid Li-Fi and Wi-Fi networks. IEEE Trans. Commun. 2017, 65, 5375–5385. [Google Scholar] [CrossRef]
Islim, M.S.; Ferreira, R.; He, X.; Xie, E.; Videv, S.; Viola, S.; Watson, S.; Bamiedakis, N.; Penty, R.; White, I.; et al. Towards 10 Gb/s orthogonal frequency division multiplexing-based visible light communication using a GaN violet micro-LED. Photonics Res. 2017, 5, A35–A43. [Google Scholar] [CrossRef]
Binh, P.H.; Hung, N.T. High-speed visible light communications using ZnSe-based white light emitting diode. IEEE Photonics Technol. Lett. 2016, 28, 1948–1951. [Google Scholar] [CrossRef]
Cossu, G.; Khalid, A.M.; Choudhury, P.; Corsini, R.; Ciaramella, E. 3.4 Gbit/s visible optical wireless transmission based on RGB LED. Opt. Express 2012, 20, B501–B506. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Huang, X.; Tao, L.; Shi, J.; Chi, N. 4.5-Gb/s RGB-LED based WDM visible light communication system employing CAP modulation and RLS based adaptive equalization. Opt. Express 2015, 23, 13626–13633. [Google Scholar] [CrossRef] [PubMed]
Shupeng, L.; Huang, H.; Zou, Y. A Novel Pre-Equalization Scheme for Visible Light Communications with Direct Learning Ar-chitecture. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference, Dubai, United Arab Emirates, 22 April 2024. [Google Scholar]
Wang, D.; Song, Y.; Li, J.; Qin, J.; Yang, T.; Zhang, M.; Chen, X.; Boucouvalas, A.C. Data-driven Optical Fiber Channel Modeling: A Deep Learning Approach. J. Light. Technol. 2020, 38, 4730–4743. [Google Scholar] [CrossRef]
Zhongya, L.; Shi, J.; Zhao, Y.; Li, G.; Chen, J.; Zhang, J.; Chi, N. Deep learning based end-to-end visible light communication with an in-band channel modeling strategy. Opt. Express 2022, 30, 28905–28921. [Google Scholar]
Wu, X.; Huang, Z.; Ji, Y. Deep neural network method for channel estimation in visible light communication. Opt. Commun. 2020, 462, 12572. [Google Scholar] [CrossRef]
Wenqing, N.; Chen, H.; Hu, F.; Shi, J.; Ha, Y.; Li, G.; He, Z.; Yu, S.; Chi, N. Neural-Network-Based Nonlinear Tomlin-son-Harashima Precoding for Bandwidth-Limited Underwater Visible Light Communication. J. Light. Technol. 2022, 40, 2296–2306. [Google Scholar]
Ulkar, M.G.; Baykas, T.; Pusane, A.E. VLCnet: Deep Learning Based End-to-End Visible Light Communication System. J. Light. Technol. 2020, 38, 5937–5948. [Google Scholar] [CrossRef]
Tang, P.; Zhang, X. MTSMAE: Masked Autoencoders for Multivariate Time-Series Forecasting. In Proceedings of the 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), Macao, China, 1 November 2022. [Google Scholar]
Zijiao, C.; Qing, J.; Xiang, T.; Yue, W.L.; Zhou, J. Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20 June 2023. [Google Scholar]
Gong, M.; Sun, J.; Xie, X.; Zheng, Y. Multivariate Time Series Prediction based on Improved Transformer Model in Computing System. In Proceedings of the 2023 2nd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), Chengdu, China, 4 November 2023. [Google Scholar]
Ma, Q.; Liu, Z.; Zheng, Z.; Huang, Z.; Zhu, S.; Yu, Z.; Kwok, J.T. A Survey on Time-Series Pre-Trained Models. IEEE Trans. Knowl. Data Eng. 2024, 36, 7536–7555. [Google Scholar] [CrossRef]
Na, W.; Liu, K.; Zhang, W.; Xie, H.; Jin, D. Deep Neural Network with Batch Normalization for Automated Modeling of Microwave Components. In Proceedings of the 2020 IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO), Hangzhou, China, 8 December 2020. [Google Scholar]
Wu, S.; Li, G.; Deng, L.; Liu, L.; Wu, D.; Xie, Y.; Shi, L. L1-Norm Batch Normalization for Efficient Training of Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 2043–2051. [Google Scholar] [CrossRef] [PubMed]
Lin, R. Analysis on the Selection of the Appropriate Batch Size in CNN Neural Network. In Proceedings of the 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), Guilin, China, 26 February 2022. [Google Scholar]

Figure 1. Masking and reconstruction process of DMT signal using the proposed sparse coded masked modeling technique based on masked autoencoder (MAE). The DMT signal is segmented, masked, transmitted via Li-Fi link, and reconstructed at the receiver.

Figure 2. Experimental setup for Li-Fi-based optical wireless transmission. Top: Photographs of the physical experiment showing the transmitter (AWG, LED, LNA), receiver (APD, oscilloscope), and optical alignment. Insets include close-up views of the LED and APD modules. Bottom: Measured time-domain waveforms at key stages (A–D), with zoomed-in views (A-1,D-1) highlighting temporal characteristics before and after reconstruction.

Figure 3. MSE curves of the proposed learning model against masking ratio.

Figure 4. Variation in measured EVM as the masking ratio increases from 15% to 95%. Filled squares: EVM of QPSK symbols. Open squares: EVM of 16-QAM symbols.

Table 1. Comparative analysis of existing techniques and our proposed one.

Other Techniques	Key Aspects of Other Techniques	Differences from Our Proposed Technique
[9] S. Li et al. “A Novel Pre-Equalization Scheme for Visible Light Communications with Direct Learning Architecture”	Introduces a direct learning + pre-equalization strategy for LED-based Visible Light Communication (VLC). A deep learning model applies pre-equalization at the transmitter so the receiver only needs simplified compensation.	Our Approach: We apply an MAE-based method that physically ‘masks out’ up to 75–85% of the DMT waveform before transmission and then reconstruct it at the receiver. By contrast, [9] transmits the entire waveform and focuses on reducing channel distortion, we shorten the time-domain signal itself. This effectively increases channel capacity by transmitting fewer samples within the same bandwidth—an entirely different perspective.
[10] D. Wang et al. “Data-driven Optical Fiber Channel Modeling: A Deep Learning Approach”	Proposes a deep neural network (DNN) to perform channel modeling and noise compensation for optical fiber transmission. Primarily addresses nonlinear channel distortions over long-distance fiber links.	Our Approach: Targets short-range Li-Fi (visible light) rather than fiber. Uses a self-supervised MAE framework that can reconstruct heavily masked (deleted) DMT patches. Unlike fiber modeling, we focus on actual time-domain compression (masking) of the waveform to enhance throughput—this is not merely channel compensation but a fundamental signal-length reduction that increases capacity.
[11] Z. Li et al. “Deep learning based end-to-end visible light communication with an in-band channel modeling strategy”	Proposes an end-to-end deep learning framework that integrates a channel model inside the network. Avoids separate channel estimation steps by embedding channel effects internally, thereby performing direct signal distortion compensation in a single pass.	Our Approach: Not merely mapping inputs to outputs in an end-to-end manner; we delete (mask) a portion of the time-domain signal and then reconstruct it. This results in a reduced physical signal length, thereby boosting capacity, rather than just improving channel equalization. We physically transmit only part of the waveform—a fundamentally different approach from standard end-to-end solutions that use the entire waveform.
[12] X. Wu et al. “Deep neural network method for channel estimation in visible light communication”	Uses CNN/DNN-based channel estimation to mitigate channel impairments in VLC. Focuses on accurate channel response prediction and equalization. Emphasizes re-creating the original signal without compressing it.	Our Approach: We deliberately mask (delete) large chunks of the DMT waveform, then employ MAE to reconstruct it at the receiver, thereby maximizing data throughput. Prior works assume the original waveform is fully transmitted; by contrast, we send only a partial signal. We also leverage multivariate time-series patch structures, using global context to fill in missing data, which diverges from straightforward channel estimation.
[13] W. Na et al. “Neural-Network-Based Nonlinear Tomlinson-Harashima Precoding for Bandwidth-Limited Underwater Visible Light Communication”	Addresses nonlinear distortion in underwater (UW) VLC links, combining Tomlinson-Harashima Precoding with DNN. Optimizes power efficiency under severe underwater channel conditions (scattering, absorption, etc.).	Our Approach: Focuses on indoor Li-Fi over 1 m rather than underwater channels and uses MAE-based compression at up to 85% masking. We do not merely “compensate for channel distortions”. Instead, we reduce the time-domain signal length (i.e., actual sample count) while preserving data integrity. This leads to a 1.85× capacity gain within the same 10 MHz bandwidth.
[14] M. G. Ulkar et al. “VLCnet: Deep Learning Based End-to-End Visible Light Communication System”	Proposes an end-to-end CNN-based framework called VLCnet. Uses regression-based modeling to compensate for VLC link noise and distortions. Shows potential gains in a simulation environment but does not physically remove significant portions of the signal in a compression sense.	Our Approach: Implements time-domain compression by adjusting the “masking ratio” (up to 75–85%), so that only a fraction of patches is sent. We perform 1-D convolution + positional encoding on subdivided “patches”, enabling the MAE decoder to reconstruct missing sections based on global self-attention. Compared to CNN-based end-to-end solutions, our MAE can handle high masking ratios and still yield good EVM performance.

Table 2. The conceptual and qualitative comparison below, focusing on performance attributes.

Method	Compression Capability	Reconstruction Ability	Modulation Compatibility	Adaptability	Hardware Cost	Representative References
Adaptive equalization	None	Channel distortion compensation	QPSK, 16-QAM, etc.	Requires parameter tuning	Medium	Wang et al., J. Lightwave Technol., 2020 [10]
CAP modulation (Carrierless amplitude modulation)	None	High-speed modulation/demodulation	16/64-QAM	Fixed architecture	High	Wang et al., Opt. Express, 2015 [8]
CNN/FNN-based channel modeling or demodulation	Limited (mainly feature extraction)	Local recovery possible	QPSK, 16-QAM	Focused on local features	High	Ulkar et al., J. Lightwave Technol., 2020 [14]
Proposed MAE-based sparse mask modeling	Up to 85% masking	Full DMT waveform reconstruction	QPSK, 16-QAM	Self-supervised, generalizable	Low (offline learning)	This paper

Table 3. Li-Fi link parameters.

Parameter	Value/Specification	Component/Reference
Transmitter LED	OSRAM^® LUW W5AM, ThinGaN, SMT package	OSRAM datasheet
Peak luminous output	116 lm	Manufacturer specification
Emission spectrum	400–700 nm (visible white light)	ThinGaN LED technology
Optical beam divergence angle	~170° (FWHM)	Based on LED viewing angle at 50% light output
Transmitting lens	Biconvex lens, focal length: 60 mm, diameter: 50.8 mm	Thorlabs LB1723-A
Lens spectral passband	350–700 nm	Thorlabs LB1723-A
Estimated transmitted optical power	~10 mW (estimated from luminous flux and beam divergence)	Derived from 116 lm specification
Bias-tee (LED driver coupling)	Mini-Circuits^® ZFBT-6GW-FT+, 100 kHz–6 GHz, insertion loss: 0.15 dB, max power: 30 dBm	Mini-Circuits^® datasheet
Free-space transmission distance	1 m	-
Receiver photo-detector	Avalanche photodiode (APD), measured illumination: 300 lux	Hamamatsu (C5330-11)
Receiver field of view (FOV)	30°	-
Operating illuminance environment	≥240 lux	Minimum required for successful transmission

Table 4. Key hyperparameters of sparse code mask modeling.

Key Hyperparameter	Value
Patch size	16 samples (selected via ablation over {8, 16,32}
Masking ratio	75–85% (optimized based on lowest EVM with acceptable BER)
Transformer depth	4 layers (evaluated over {2, 4, 6})
Embedding dimension	128
Number of attention heads	4
Dropout rate	0.1
Training epochs	200
Optimizer	Adam with learning rate = 1 × 10⁻⁴

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Won, Y.-Y.; Han, H.; Choi, D.; Yoon, S.M. Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling. Photonics 2025, 12, 395. https://doi.org/10.3390/photonics12040395

AMA Style

Won Y-Y, Han H, Choi D, Yoon SM. Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling. Photonics. 2025; 12(4):395. https://doi.org/10.3390/photonics12040395

Chicago/Turabian Style

Won, Yong-Yuk, Heetae Han, Dongmin Choi, and Sang Min Yoon. 2025. "Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling" Photonics 12, no. 4: 395. https://doi.org/10.3390/photonics12040395

APA Style

Won, Y.-Y., Han, H., Choi, D., & Yoon, S. M. (2025). Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling. Photonics, 12(4), 395. https://doi.org/10.3390/photonics12040395

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancement of Optical Wireless Discrete Multitone Channel Capacity Based on Li-Fi Using Sparse Coded Mask Modeling

Abstract

1. Introduction

1.1. Technical Motivation Based on MAE

1.1.1. Fundamental Reason: True Time-Domain Compression and Self-Supervised Reconstruction

1.1.2. Technical Justification: Treating DMT as Multivariate Time-Series

1.1.3. Why Not a Standard Autoencoder or Other Alternatives?

1.1.4. Practical Impact of the Proposed Technique on Li-Fi Transmission

1.2. Other Existing Works and Key Contributions of Proposed Technique

1.3. Conceptual Comparison with Representative State-of-the-Art (SOTA) Techniques

2. Augmentation of Li-Fi Wireless DMT Transmission Capacity Through Sparse Coded Mask Optimization

2.1. Sparse Coded Mask Modeling for DMT Signal

2.2. Impact of the Masking Ratio on the Channel Capacity

2.2.1. Revisiting Our Masked Autoencoder and Signal Compression

2.2.2. Linking the Masking Ratio to Channel Capacity

2.2.3. Mathematical Derivation

2.2.4. Additional Factors Influencing Masking Ratio

3. Experimental Setup

3.1. Li-Fi Transmission Link Testbed Based on Sparse Coded Mask Model

3.2. Key Transmission Performance Metrics

3.2.1. Bit Error Rate (BER) vs. Error Vector Magnitude (EVM)

3.2.2. Computational Complexity and Processing Overhead

3.2.3. Performance Under Varying Distances and Real-World Channel Impairments

4. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI