1. Introduction
Frequency-hopping (FH) communication is a spread-spectrum technique that rapidly switches the carrier frequency according to a pseudo-random sequence, providing robustness against narrowband interference, multipath fading, and interception [
1,
2]. FH is therefore widely used in military tactical radios and in civilian Internet-of-Things networks [
3]. For non-cooperative receivers, however, the hopping sequence is unknown, and the electromagnetic environment typically contains fixed-frequency interference, sweep jamming, and overlapping emissions from multiple FH networks, making reliable FH detection difficult [
4,
5].
FH signal detection constitutes the first step of communication reconnaissance; its accuracy directly constrains subsequent parameter estimation [
6] and network-station sorting [
7,
8]. Traditional approaches, including energy detection, time-frequency analysis with fixed thresholds, and clustering methods [
9,
10], rely on hand-crafted features and degrade at low signal-to-noise ratios (SNRs), where noise energy overwhelms the hop pulses. Sparse Bayesian reconstruction [
2,
11] improves noise tolerance by modeling the spectrum as a superposition of a few active frequencies, but its runtime scales poorly with the number of simultaneous hops and emitters. Statistical estimators based on cyclostationary analysis [
7] and blind source separation [
8] can handle multi-emitter environments but require either strict orthogonality assumptions or prior knowledge of the number of active sources, conditions that are rarely satisfied in congested electromagnetic environments. Parameter estimation via deep neural networks [
6,
12] reduces reliance on hand-crafted features but still needs an upstream detection step to localize hops before estimating their parameters, leaving the low-SNR detection problem unresolved.
Low-SNR failures share consistent mechanisms across these pipelines. Energy and fixed-threshold methods suffer from threshold drift: small changes in the noise floor flip detections on or off, leading to fragmented hop rectangles. Time-frequency clustering becomes unstable because the within-cluster variance is dominated by noise rather than hop energy. Deep detectors trained on higher SNR regimes encounter low contrast and boundary blur, which inflate false positives on noise bands and shrink hop boxes during localization.
In recent years, YOLO-based object detectors have been applied to time-frequency spectrograms to bypass explicit parameter estimation. Huang et al. [
13] combined cyclic Wigner distribution images with YOLOv5 for multi-emitter sorting under low SNR. Jiang et al. [
14] proposed DFN-YOLO for narrowband signal detection in broadband spectra. Kang et al. [
15] employed YOLOv8 for joint detection and classification of communication and radar signals. Chen et al. [
16] applied YOLOv3 to FH signal identification. Wang et al. [
12] used a deep neural network (DNN) based detector for FH parameter estimation. Zhu et al. [
17] showed that spectrograms alone suffice for variable-speed FH signal sorting. These studies show that YOLO-family detectors can locate signal features in time-frequency images. However, most of these works adopt off-the-shelf YOLO architectures without modifying the network to account for the specific structure of time-frequency data.
The YOLO architecture has advanced rapidly through anchor-free detection [
18], structural re-parameterization [
19,
20], programmable gradient information [
21], and modular training design [
22,
23]. Transformer-based real-time detectors such as the Real-Time Detection Transformer (RT-DETR) [
24] have also reached competitive accuracy with single-stage methods. These advances target natural-image benchmarks, however, and do not address the directional noise structure or multi-scale hop patterns specific to time-frequency spectrograms.
Directly applying general-purpose detectors to FH spectrograms leaves two domain-specific problems unaddressed. First, background noise in spectrograms is correlated along the frequency axis due to the windowing sidelobe effect of the short-time Fourier transform (STFT), yet standard convolution kernels apply isotropic smoothing across both axes, mixing noise into temporal features. Second, FH hops from different emitters span a wide range of bandwidths and durations; a single receptive field cannot simultaneously capture both narrowband fast hops and wideband slow hops. Deeper or wider networks may address these issues in principle, but introduce overfitting risk on the typically moderate-sized FH datasets [
25].
This paper proposes YOLO11-FH, a modified YOLO11 detector that introduces two signal-processing-motivated modules into the convolutional backbone, together with a lightweight head design. The domain-driven novelty is twofold: (i) we translate STFT noise-correlation physics into a directional smoothing operator (FSB) and (ii) we match hop-scale diversity with parallel receptive fields (TFMRB). The remaining changes (LightHead) are pragmatic adaptations to the single-class setting rather than new architectural claims.
FreqSmoothBlock (FSB): A depthwise convolution with kernel size is inserted at the shallowest backbone stage (P2/4) to perform directional smoothing exclusively along the frequency axis, adding negligible parameters ( per block, where C is the channel count) while suppressing frequency-axis noise before it propagates to deeper layers.
TFMultiResBlock (TFMRB): Three parallel convolution branches with dilation rates of 1, 2, and 3 produce effective receptive fields of , , and to cover the typical scale range of FH hop rectangles; this module replaces a heavier C3k2 block at the P3/8 stage, reducing backbone parameters while improving multi-scale representation.
A LightHead design further simplifies the detection head by reducing the Bottleneck repeat count from 2 to 1 in all C3k2 modules and disabling the deep C3 submodule at the P5 scale, lowering head redundancy for the single-class FH detection task.
Experiments on a simulated dataset with SNRs ranging from dB to dB show that YOLO11-FH achieves an mAP@0.5 of 96.04% with 2.51 M parameters. Ablation studies confirm that each module contributes positively to the overall performance improvement.
The remainder of this paper is organized as follows.
Section 2 defines the signal model and STFT preprocessing.
Section 3 details the proposed YOLO11-FH architecture.
Section 4 presents experimental results, including comparisons with recent YOLO variants and ablation studies.
Section 5 discusses the findings and limitations, and
Section 6 concludes the paper.
5. Discussion
5.1. Emitter-Count Robustness
Figure 7 and
Table 6 summarize accuracy as the number of simultaneous emitters increases from 3 to 10 on the test split (120 images per setting). The mAP@0.5 curve drops steadily, from 0.995 at three emitters to 0.828 at ten emitters, consistent with more hop collisions and heavier time-frequency overlap. The mAP@0.5:0.95 metric and precision/recall values follow the same downward trend, indicating that both localization quality and detection coverage are affected as the scene becomes more congested. Even so, the model sustains mAP@0.5 above 0.83 at ten emitters, which suggests that the proposed backbone modifications remain effective under dense multi-emitter conditions.
5.2. Real-World Validation on RFUAV Subset
The model was fine-tuned on a small real-world RFUAV subset and evaluated on its validation split (train: 762 images, val: 110 images) [
39].
Table 7 summarizes the validation results after 20 epochs: mAP@0.5 of 0.8519, mAP@0.5:0.95 of 0.5860, precision of 0.9182, and recall of 0.7888. This provides a first check of transfer to real RF spectrograms derived from UAV signals, while the limited validation size suggests that a larger held-out set would yield a more stable estimate.
5.3. Interpretation of FSB
FSB can be viewed as a learnable one-dimensional filter applied along the frequency axis. If the batch normalization scaling factor approaches unity and the SiLU activation operates near its linear region, the depthwise convolution approximates a 3-tap weighted average, a classical noise suppression operator. Unlike a fixed filter, the coefficients of FSB are optimized end-to-end to minimize detection loss, making them adaptive to the noise statistics of the training data. Visualizing the learned kernel weights and comparing FSB against a fixed 3-tap moving average filter would provide direct evidence for this interpretation; we leave this analysis for future work.
The choice of a kernel covers three adjacent frequency bins, which is smaller than the main-lobe width of the Hamming window (∼4 bins). This is treated as a conservative lower bound that reduces correlated noise while protecting narrowband hop edges. A wider kernel, such as or , might capture more of the correlated noise structure but could also blur narrowband hop edges. Ablation over the kernel size was not conducted in this study; this is a limitation that warrants further investigation.
FSB is applied after logarithmic compression and grayscale mapping, so the feature space is nonlinear relative to raw STFT magnitudes. FSB is thus interpreted as a learnable suppressor of frequency-axis correlation in the CNN feature space, rather than as a fixed filter tied to raw spectral statistics. The consistent gains in
Table 5 suggest that the learned directional smoothing remains effective despite the nonlinearity.
The residual structure also offers a practical training property. At initialization, the depthwise weights are drawn from a near-zero distribution, so the branch output is small and the module output approximates the identity. This means FSB can be inserted into a partly trained YOLO backbone, or fine-tuned on a downstream FH dataset, without disrupting the existing feature statistics. The same property allows the module to be added to any standard STFT-based pipeline at negligible engineering cost. He et al. [
31] showed formally that zero-initialized residual branches leave the gradient flow unchanged at the start of training; FSB inherits this stability by construction.
5.4. TFMRB Versus Deeper Backbones
An alternative to multi-resolution enhancement is making the backbone deeper. For FH spectrogram detection with a moderate amount of training data (9600 images), deeper networks risk overfitting. TFMRB obtains multi-scale representation within a single layer by parallelizing branches of different dilation rates [
34], adding receptive-field diversity without increasing depth. The parameter count is lower than the C3k2 block it replaces, supporting its efficiency. Modern convolutional architectures such as ConvNeXt [
40,
41] and EfficientNetV2 [
42] have demonstrated that well-designed CNN backbones match vision transformers when equipped with improved training recipes; a similar principle underlies TFMRB, which achieves multi-scale representation through structural design rather than increased depth.
5.5. Jamming Signal Annotation
In the current dataset, jamming signals (LFM, NAM, NFM, RF noise, and pulsed) are not annotated and are treated as background clutter. This means the detector must implicitly learn to distinguish FH hop rectangles from spectrally similar interference patterns. At low SNR, certain jamming types, particularly narrowband FM (NFM) and pulsed interference, can produce time-frequency signatures that partially overlap with FH hops in shape, potentially increasing false positives. The 96.04% mAP@0.5 suggests the detector handles this confusion adequately on the simulated data, but a dedicated study of per-jamming-type false positive rates would strengthen the evaluation. The steepest absolute performance gap between and dB coincides with the SNR range where narrowband interference begins to rival the hop pulse energy. Among the five jamming types, pulsed interference and narrowband FM are the most compact in the time-frequency plane and share the rectangular footprint of FH hops, differing mainly in duration. LFM sweeps span a continuous frequency range and are less likely to be mistaken for the short, band-limited hop pulses; the detector is expected to handle LFM correctly in the majority of images at all tested SNR levels. Quantifying per-type confusion requires re-annotating the dataset with jamming labels, a straightforward extension of the existing annotation pipeline. This would enable analysis of which interference type is the dominant source of false positives at each SNR and would clarify whether FSB or TFMRB has a greater impact on suppressing specific interference patterns. In real-world scenarios where novel interference types may appear, explicit multi-class annotation (FH pulse vs. specific interference types) could improve detection robustness.
Regarding narrowband FM (NFM) interference, FSB smooths correlated frequency bins but does not impose hop-like geometry; empirically, the module reduces horizontal band noise without increasing false positives on NFM-like clutter. Future work will quantify this explicitly with per-jamming-type false positive rates.
5.6. Limitations
The evaluation has several limitations that should be acknowledged. All 9600 images were generated using MATLAB simulation; the AWGN-plus-jamming model covers six SNR levels and five interference types, but real electromagnetic environments exhibit non-stationary noise, device-specific thermal noise, and spectral artifacts not present in the simulation. Validation on captured FH signals, for example, using software-defined radio recordings, is necessary to confirm practical utility. Beyond the simulated-data constraint, only FH-versus-background detection is addressed; multi-class detection (e.g., distinguishing FH from radar pulses) and simultaneous emitter identification remain unexplored directions. The neck uses isotropic upsampling in the FPN, which may blur sharp temporal edges for extreme aspect-ratio hops. Exploring anisotropic or deformable upsampling is a promising extension. Finally, all experiments were conducted at the nano scale, so behavior at small, medium, or large model sizes may differ. Explicit noise-model mismatch and non-instantaneous switching simulations were not evaluated, which limits assessment of robustness to real hardware effects.
5.7. Applicability to Other Spectrogram-Based Detection
The noise structure addressed by FSB and the scale diversity addressed by TFMRB arise from general properties of STFT spectrograms rather than FH-specific signal characteristics. Any spectrogram computed with a windowed Fourier transform exhibits main-lobe frequency-axis noise correlation; the correlation length is set by the window type and overlap ratio [
26] and is independent of the modulation class being observed. Signals with variable bandwidth and duty cycle, including radar pulses, orthogonal frequency-division multiplexing (OFDM) subcarriers, and satellite transponder emissions, produce the same multi-scale bounding-box challenge that TFMRB was designed to address.
In practice, integrating either module into an existing YOLO-based spectrogram pipeline requires only channel-count alignment. TFMRB replaces a single C3k2 block at P3/8; FSB inserts after the second strided convolution at P2/4. Neither change touches the neck, the detection heads, or the loss function. Applied to the joint communication-and-radar detection scenario of Kang et al. [
15], for instance, both modules could be dropped in without redesigning the training pipeline. Whether the mAP gains transfer quantitatively to other signal classes remains an open empirical question, but the low integration cost makes the experiment straightforward.
6. Conclusions
This paper presented YOLO11-FH, a modified YOLO11 architecture for frequency-hopping signal detection in time-frequency spectrograms. Two modules, FreqSmoothBlock (frequency-axis depthwise convolution) and TFMultiResBlock (parallel dilated convolutions at rates of 1, 2, and 3), were inserted into the backbone, and the detection head was simplified. On a simulated dataset spanning
to
dB SNR with five jamming types, YOLO11-FH reached 96.04% mAP@0.5 and 76.18% mAP@0.5:0.95, outperforming the YOLO11n baseline by 0.95 and 2.91 pp. The accuracy–efficiency bubble chart (
Figure 4) showed that the proposed model breaks through the performance ceiling of unmodified YOLO variants on this task, while the horizontal ablation bar chart (
Figure 5) confirmed that mAP@0.5:0.95 improves roughly three times more than mAP@0.5 per module addition, pointing to localization quality as the primary beneficiary. Comparisons with YOLOv10n, YOLO12n, YOLOv13n, YOLO26n, and RT-DETR under identical training conditions confirmed the advantage of domain-specific modules over general-purpose architectural advances.
Section 5.7 argued that both modules address structural properties common to all STFT spectrograms, making them applicable to other signal-detection tasks with minimal integration effort.
The current evaluation is limited to simulated data. Future work will (1) validate the approach on real FH signals captured by software-defined radio, (2) ablate the FSB kernel size and visualize the learned filter coefficients, (3) extend to multi-class detection with simultaneous emitter identification, (4) compare with existing FH-specific detection methods under a standardized benchmark, (5) evaluate noise-model mismatch and non-instantaneous switching effects, and (6) evaluate the transfer of FSB and TFMRB to other spectrogram-based detection tasks such as radar pulse and OFDM signal detection.