YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms

Zhu, Huijie; Wang, Wei; Yang, Cui; Xiang, Youjun; Li, Jiawei; Xu, Yuheng

doi:10.3390/signals7030048

Open AccessArticle

YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms

by

Huijie Zhu

¹

,

Wei Wang

¹,

Cui Yang

^2,*,

Youjun Xiang

²,

Jiawei Li

² and

Yuheng Xu

²

¹

National Key Laboratory of Electromagnetic Space Security, Jiaxing 314033, China

²

School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(3), 48; https://doi.org/10.3390/signals7030048

Submission received: 23 March 2026 / Revised: 8 May 2026 / Accepted: 19 May 2026 / Published: 25 May 2026

Download

Browse Figures

Versions Notes

Abstract

Frequency-hopping (FH) signals appear as small rectangular pulses in time-frequency spectrograms. At low signal-to-noise ratios (SNRs), noise along the frequency axis, caused by short-time Fourier transform (STFT) spectral leakage, blurs pulse boundaries, while the varying scales of hop rectangles exceed the capacity of a single receptive field. This paper presents YOLO11-FH, a modified YOLO11 detector that introduces two signal-processing-motivated modules. A FreqSmoothBlock (FSB) uses a

(3, 1)

depthwise convolution to smooth exclusively along the frequency axis, while adding only

5 C

parameters. A TFMultiResBlock (TFMRB) fuses three parallel dilated convolution branches (dilation rates of 1, 2, and 3) to cover different hop scales, replacing a heavier C3k2 module. The detection head is further simplified by halving the Bottleneck repeat count and disabling the deep submodule at the P5 scale. On a simulated FH dataset (SNRs ranging from

- 15

dB to

- 10

dB, five jamming types), YOLO11-FH achieves 96.04% mean average precision (mAP)@0.5 and 76.18% mAP@0.5:0.95, outperforming the YOLO11n baseline by 0.95 and 2.91 percentage points (pp) with 2.9% fewer parameters.

Keywords:

frequency-hopping signal detection; time-frequency spectrogram; YOLO11; depthwise convolution; multi-resolution feature extraction; low SNR

1. Introduction

Frequency-hopping (FH) communication is a spread-spectrum technique that rapidly switches the carrier frequency according to a pseudo-random sequence, providing robustness against narrowband interference, multipath fading, and interception [1,2]. FH is therefore widely used in military tactical radios and in civilian Internet-of-Things networks [3]. For non-cooperative receivers, however, the hopping sequence is unknown, and the electromagnetic environment typically contains fixed-frequency interference, sweep jamming, and overlapping emissions from multiple FH networks, making reliable FH detection difficult [4,5].

FH signal detection constitutes the first step of communication reconnaissance; its accuracy directly constrains subsequent parameter estimation [6] and network-station sorting [7,8]. Traditional approaches, including energy detection, time-frequency analysis with fixed thresholds, and clustering methods [9,10], rely on hand-crafted features and degrade at low signal-to-noise ratios (SNRs), where noise energy overwhelms the hop pulses. Sparse Bayesian reconstruction [2,11] improves noise tolerance by modeling the spectrum as a superposition of a few active frequencies, but its runtime scales poorly with the number of simultaneous hops and emitters. Statistical estimators based on cyclostationary analysis [7] and blind source separation [8] can handle multi-emitter environments but require either strict orthogonality assumptions or prior knowledge of the number of active sources, conditions that are rarely satisfied in congested electromagnetic environments. Parameter estimation via deep neural networks [6,12] reduces reliance on hand-crafted features but still needs an upstream detection step to localize hops before estimating their parameters, leaving the low-SNR detection problem unresolved.

Low-SNR failures share consistent mechanisms across these pipelines. Energy and fixed-threshold methods suffer from threshold drift: small changes in the noise floor flip detections on or off, leading to fragmented hop rectangles. Time-frequency clustering becomes unstable because the within-cluster variance is dominated by noise rather than hop energy. Deep detectors trained on higher SNR regimes encounter low contrast and boundary blur, which inflate false positives on noise bands and shrink hop boxes during localization.

In recent years, YOLO-based object detectors have been applied to time-frequency spectrograms to bypass explicit parameter estimation. Huang et al. [13] combined cyclic Wigner distribution images with YOLOv5 for multi-emitter sorting under low SNR. Jiang et al. [14] proposed DFN-YOLO for narrowband signal detection in broadband spectra. Kang et al. [15] employed YOLOv8 for joint detection and classification of communication and radar signals. Chen et al. [16] applied YOLOv3 to FH signal identification. Wang et al. [12] used a deep neural network (DNN) based detector for FH parameter estimation. Zhu et al. [17] showed that spectrograms alone suffice for variable-speed FH signal sorting. These studies show that YOLO-family detectors can locate signal features in time-frequency images. However, most of these works adopt off-the-shelf YOLO architectures without modifying the network to account for the specific structure of time-frequency data.

The YOLO architecture has advanced rapidly through anchor-free detection [18], structural re-parameterization [19,20], programmable gradient information [21], and modular training design [22,23]. Transformer-based real-time detectors such as the Real-Time Detection Transformer (RT-DETR) [24] have also reached competitive accuracy with single-stage methods. These advances target natural-image benchmarks, however, and do not address the directional noise structure or multi-scale hop patterns specific to time-frequency spectrograms.

Directly applying general-purpose detectors to FH spectrograms leaves two domain-specific problems unaddressed. First, background noise in spectrograms is correlated along the frequency axis due to the windowing sidelobe effect of the short-time Fourier transform (STFT), yet standard convolution kernels apply isotropic smoothing across both axes, mixing noise into temporal features. Second, FH hops from different emitters span a wide range of bandwidths and durations; a single receptive field cannot simultaneously capture both narrowband fast hops and wideband slow hops. Deeper or wider networks may address these issues in principle, but introduce overfitting risk on the typically moderate-sized FH datasets [25].

This paper proposes YOLO11-FH, a modified YOLO11 detector that introduces two signal-processing-motivated modules into the convolutional backbone, together with a lightweight head design. The domain-driven novelty is twofold: (i) we translate STFT noise-correlation physics into a directional smoothing operator (FSB) and (ii) we match hop-scale diversity with parallel receptive fields (TFMRB). The remaining changes (LightHead) are pragmatic adaptations to the single-class setting rather than new architectural claims.

FreqSmoothBlock (FSB): A depthwise convolution with kernel size $(3, 1)$ is inserted at the shallowest backbone stage (P2/4) to perform directional smoothing exclusively along the frequency axis, adding negligible parameters ( $5 C$ per block, where C is the channel count) while suppressing frequency-axis noise before it propagates to deeper layers.
TFMultiResBlock (TFMRB): Three parallel $3 \times 3$ convolution branches with dilation rates of 1, 2, and 3 produce effective receptive fields of $3 \times 3$ , $5 \times 5$ , and $7 \times 7$ to cover the typical scale range of FH hop rectangles; this module replaces a heavier C3k2 block at the P3/8 stage, reducing backbone parameters while improving multi-scale representation.

A LightHead design further simplifies the detection head by reducing the Bottleneck repeat count from 2 to 1 in all C3k2 modules and disabling the deep C3 submodule at the P5 scale, lowering head redundancy for the single-class FH detection task.

Experiments on a simulated dataset with SNRs ranging from

- 15

dB to

- 10

dB show that YOLO11-FH achieves an mAP@0.5 of 96.04% with 2.51 M parameters. Ablation studies confirm that each module contributes positively to the overall performance improvement.

The remainder of this paper is organized as follows. Section 2 defines the signal model and STFT preprocessing. Section 3 details the proposed YOLO11-FH architecture. Section 4 presents experimental results, including comparisons with recent YOLO variants and ablation studies. Section 5 discusses the findings and limitations, and Section 6 concludes the paper.

2. Signal Model and Preprocessing

2.1. Signal Model

The carrier frequency of an FH signal is governed by a pseudo-random hopping sequence. Within an observation interval

[0, T]

, the received signal is modeled as

r (t) = s (t) + i (t) + n (t), 0 \leq t \leq T,

(1)

where

s (t)

is the superposition of FH signals from multiple co-channel emitters,

i (t)

denotes interfering signals (e.g., fixed-frequency interference, linear frequency modulation sweep jamming), and

n (t)

is additive white Gaussian noise (AWGN).

The signal from the k-th FH emitter is expressed as

s_{k} (t) = A_{k} \sum_{i = 1}^{N_{k}} {rect}_{T_{k}} (t - i T_{k} - θ_{k}) e^{j 2 π f_{i, k} (t - i T_{k} - θ_{k})},

(2)

where

A_{k}

is the amplitude,

N_{k}

is the number of hops,

T_{k}

is the hop duration (assumed constant within each emitter),

f_{i, k}

is the carrier frequency of the i-th hop,

θ_{k}

is the initial time offset, and

{rect}_{T} (\cdot)

is a rectangular window of width T. This instantaneous switching model is a standard idealization; real hardware exhibits phase continuity and finite switching time, which introduces spectral splatter that is not captured in this simulation.

2.2. STFT Preprocessing

The STFT converts the one-dimensional received signal into a two-dimensional time-frequency representation:

{STFT}_{r} (t, f) = \int_{- \infty}^{+ \infty} r (τ) g (τ - t) e^{- j 2 π f τ} d τ,

(3)

where

g (τ)

is the analysis window function.

In this work, a Hamming window of length

N_{w} = 512

samples is used with 75% overlap (hop size

H = 128

samples) and a fast Fourier transform (FFT) length of 512 points. The sampling rate is

f_{s} = 400

MHz, yielding a time resolution of

H / f_{s} = 0.32

μs per frame and a frequency resolution of

f_{s} / N_{w} \approx 781

kHz. The magnitude spectrum is logarithmically compressed and mapped to a

640 \times 640

pixel grayscale image, with the horizontal axis spanning

[0, 25 ms]

and the vertical axis spanning

[10, 200 MHz]

.

2.3. Frequency-Axis Noise Correlation

The windowing operation in the STFT produces a noise correlation structure that has no counterpart in natural image detection. For additive white Gaussian noise

n (t)

with power spectral density

σ_{n}^{2}

, the expected squared magnitude at STFT bin k is uniform across all k because white noise has a flat power spectrum. However, adjacent bins are not statistically independent. The cross-correlation between bins k and

k + p

evaluates to

ρ (p) = \frac{\sum_{m = 0}^{N_{w} - 1} g^{2} (m) e^{- j 2 π p m / N_{w}}}{\sum_{m = 0}^{N_{w} - 1} g^{2} (m)},

(4)

which is the normalized discrete Fourier transform of the squared window envelope

g^{2} (m)

[26]. We normalize

g (m)

such that

\sum_{m = 0}^{N_{w} - 1} g^{2} (m) = 1

. For a Hamming window, the main lobe of

g^{2} (m)

spans roughly

8 f_{s} / N_{w}

, so

ρ (p)

remains non-negligible for

| p | \leq 4

. In our configuration (

N_{w} = 512

,

f_{s} = 400

MHz), this corresponds to a correlation extent of about

3.1

MHz, far wider than the 100 kHz FH hop bandwidth.

The practical consequence is that noise appears as horizontal bands of varying intensity along the frequency axis, precisely the direction in which FH hop rectangles have their sharpest edges. A standard

3 \times 3

kernel applies equal weight to both axes and therefore cannot distinguish a noise band from a hop edge. The

(3, 1)

kernel in FSB is motivated directly by this structure: smoothing only along the frequency axis averages the correlated bins without blurring the temporal boundaries of hops.

2.4. Bounding Box–Physical Parameter Mapping

Each FH hop appears as a localized rectangle in the time-frequency image. The YOLO detector outputs normalized bounding boxes

(x_{c}, y_{c}, w, h) \in {[0, 1]}^{4}

, which can be mapped to physical parameters via

\begin{matrix} f_{c} & = f_{min} + y_{c} \cdot (f_{max} - f_{min}), \end{matrix}

(5)

\begin{matrix} B_{w} & = h \cdot (f_{max} - f_{min}), \end{matrix}

(6)

\begin{matrix} t_{c} & = t_{min} + x_{c} \cdot (t_{max} - t_{min}), \end{matrix}

(7)

\begin{matrix} T_{d} & = w \cdot (t_{max} - t_{min}), \end{matrix}

(8)

where

f_{c}

is the center frequency,

B_{w}

is the bandwidth,

t_{c}

is the time position, and

T_{d}

is the hop duration.

3. Proposed Method: YOLO11-FH

3.1. Overall Architecture

YOLO11-FH builds on the YOLO11n baseline, which follows the standard backbone–neck–head paradigm. The backbone extracts multi-scale features, the neck (based on a Feature Pyramid Network [27] with Path Aggregation [28]) fuses features across scales, and three detection heads predict bounding boxes at P3/8, P4/16, and P5/32 resolutions.

Three modifications are applied to the baseline architecture. An FSB module is inserted at stage P2/4 (layer 2), immediately after the second strided convolution. The C3k2 block with residual C3 submodules at stage P3/8 is replaced by a TFMRB module (layer 5). In the detection head, all C3k2 blocks have their Bottleneck repeat count reduced from 2 to 1, and the P5-level C3k2 block switches from the deep C3 variant (c3k = True) to a standard Bottleneck (c3k = False).

Table 1 summarizes the complete layer-by-layer configuration. The overall architecture is illustrated in Figure 1.

3.2. FreqSmoothBlock (FSB)

3.2.1. FSB Motivation

In STFT-based spectrograms, the noise power spectral density is approximately uniform along the frequency axis for additive white Gaussian noise. However, the windowing sidelobe effect causes adjacent frequency bins to be correlated: for a window of length

N_{w}

, the main lobe of the Hamming window spans approximately four frequency bins (

4 f_{s} / N_{w} \approx 3.1

MHz in our configuration), causing noise energy in each bin to leak into its immediate neighbors [26]. Within this main-lobe width, adjacent bins exhibit strong correlation that manifests as band-shaped noise along the frequency axis, blurring the sharp boundaries of FH hop rectangles.

Standard convolutional layers in YOLO employ square kernels (e.g.,

3 \times 3

) that smooth both axes equally, mixing frequency-axis noise into the temporal structure. A natural alternative is to smooth only along the frequency axis, analogous to applying a one-dimensional low-pass filter to each frequency column of the spectrogram. We select a

(3, 1)

kernel as a conservative lower bound of the ∼4-bin correlation width: it suppresses the most correlated bins while preserving narrow hop edges that could be blurred by wider kernels.

3.2.2. FSB Architecture

FSB adopts a residual structure with a depthwise convolution core. Given an input feature map

X \in R^{C \times H \times W}

, the output is

Y = X + σ (BN ({DWConv}_{(3, 1)} (X))),

(9)

where

{DWConv}_{(3, 1)}

is a depthwise convolution with kernel size

(3, 1)

and C groups,

BN

is batch normalization, and

σ

denotes the Sigmoid Linear Unit (SiLU) activation. The

(3, 1)

kernel processes three adjacent frequency bins per channel while leaving the temporal dimension unchanged.

The residual pathway ensures that the module defaults to an identity mapping at initialization, so it cannot degrade performance. The depthwise branch progressively learns to suppress frequency-axis noise through gradient descent, acting as an adaptive one-dimensional filter with data-driven coefficients.

The concept of depthwise separable convolution originates from MobileNet [29] and its successor MobileNetV2 [30]. FSB differs in its directional application: only the frequency axis is processed, aligning the convolution direction with the physical noise distribution in spectrograms.

The internal structure of FSB is shown in Figure 2.

3.2.3. Parameter Cost

FSB introduces

3 C

depthwise convolution weights and

2 C

batch normalization parameters (

γ

and

β

), totaling

5 C

. At the P2/4 stage with

C = 32

(after width scaling by 0.25 in the nano configuration), this amounts to 160 additional parameters, negligible relative to the 2.58 M baseline.

3.2.4. FSB Placement

FSB is placed at layer 2 (P2/4), the shallowest backbone stage with

1 / 4

spatial resolution relative to the input. At this level, the feature map retains high spatial resolution, and the frequency-axis noise has not yet been compressed by downsampling. Early suppression prevents noise from corrupting features in subsequent layers.

3.3. TFMultiResBlock (TFMRB)

3.3.1. TFMRB Motivation

FH signals in a spectrogram originate from multiple emitters with independent physical parameters. The

(B_{w}, T_{d})

pair (bandwidth and hop duration) determines the height and width of each hop rectangle in the image. Different emitters may operate at hop rates ranging from 100 to 320 hops/s with varying bandwidths, producing bounding boxes that span a wide range of aspect ratios and scales. A single-receptive-field backbone layer cannot efficiently capture this diversity.

3.3.2. TFMRB Architecture

TFMRB fuses features from three parallel dilated convolution branches. For input

X \in R^{C_{1} \times H \times W}

:

\begin{matrix} B_{k} = {Conv}_{3 \times 3}^{d = k} (X), k \in {1, 2, 3}, \end{matrix}

(10)

\begin{matrix} Y = Proj (X) + {Conv}_{1 \times 1} ([B_{1}; B_{2}; B_{3}]), \end{matrix}

(11)

where each

{Conv}_{3 \times 3}^{d = k}

maps

C_{1}

channels to

⌊ C_{2} \cdot e ⌋

hidden channels with expansion ratio

e = 0.5

,

[\cdot; \cdot; \cdot]

denotes channel-wise concatenation,

{Conv}_{1 \times 1}

compresses the fused features from

3 ⌊ C_{2} e ⌋

to

C_{2}

channels, and

Proj

is a

1 \times 1

projection applied when

C_{1} \neq C_{2}

.

The three dilation rates produce effective receptive fields of

3 \times 3

,

5 \times 5

, and

7 \times 7

, which correspond to progressively larger regions in the time-frequency plane. The residual connection follows standard practice [31] to stabilize training. The structure of TFMRB is shown in Figure 3.

3.3.3. Comparison with ASPP

The multi-scale fusion principle is related to Atrous Spatial Pyramid Pooling (ASPP) [32] and Inception modules [33]; the dilated convolution concept itself was introduced by Yu and Koltun [34] for dense prediction. ASPP includes a global average pooling branch for semantic segmentation, which introduces excessive global context that is counterproductive for detecting localized FH pulses. TFMRB omits the global branch and uses moderate dilation rates (1, 2, 3) focused on local multi-scale features. We note that similar parallel dilated convolution designs have appeared in recent signal processing works; the specific contribution here is the integration into a YOLO backbone for FH spectrogram detection, with dilation rates matched to the physical scale range of hop rectangles.

3.3.4. TFMRB Placement

TFMRB is placed at layer 5 (P3/8), replacing the original C3k2

(256, c 3 k = True)

block. This location provides moderate spatial resolution and semantic abstraction, balancing the ability to distinguish multi-scale hops with computational efficiency. TFMRB has fewer parameters than the C3k2 block it replaces, since the latter nests a deeper C3 submodule with two Bottleneck layers.

3.4. LightHead Design

The YOLO11 detection head performs feature refinement through C3k2 modules at three scales. In the default configuration, each C3k2 module stacks two Bottleneck layers (

r = 2

), and the P5-level module employs a deep C3 variant (c3k = True).

For single-class FH signal detection with a moderately sized training set (∼6700 training images), this head capacity exceeds the task requirement.

The Bottleneck repeat count of all head C3k2 modules is reduced from 2 to 1 (layers 14, 17, 20, 23). In addition, the C3k2 module at the P5 level (layer 23) switches from c3k = True to c3k = False, replacing the nested C3 submodule with a lighter standard Bottleneck. At this scale, feature maps have the largest receptive field; the identity mapping from a residual C3 submodule may propagate shallow-layer noise to deep features, potentially interfering with the detection of small FH targets.

This design is consistent with the head redundancy reported in YOLOv10 [25], where simplifying the detection head improved efficiency without sacrificing accuracy on moderate-complexity tasks.

4. Experimental Results

4.1. Dataset

A simulated FH signal dataset was generated using MATLAB R2023a. The dataset comprises 9600 time-frequency spectrogram images, each labeled with YOLO-format bounding boxes for FH hop pulses. The split ratio is 70%/20%/10% for train/validation/test. Key parameters are summarized in Table 2.

Each image is generated by (1) randomly selecting a subset of emitters from the predefined pool, (2) synthesizing their composite FH signal with interference and AWGN at the specified SNR, (3) computing the STFT with a 512-point Hamming window and 75% overlap, (4) applying logarithmic compression and mapping the magnitude spectrum to a

640 \times 640

pixel image, and (5) computing YOLO-format bounding boxes from the known hop parameters. Only FH hop pulses are labeled; jamming signals are treated as background clutter. The five jamming types are linear frequency modulation (LFM) sweep, narrowband amplitude modulation (NAM), narrowband frequency modulation (NFM), wideband RF noise, and pulsed interference.

4.2. Implementation Details

All models are trained under identical conditions using stochastic gradient descent (SGD) with automatic mixed precision (AMP). Table 3 lists the full configuration. Vertical flip, rotation, shear, and mixup augmentations are disabled to preserve the physical orientation of time-frequency axes.

The default YOLO11 loss configuration is used, combining CIoU [35] for bounding box regression with binary cross-entropy for classification. Evaluation metrics include mAP@0.5, mAP@0.5:0.95, precision (P), recall (R), parameter count (Params), and floating-point operations (GFLOPs). Single-image inference speed (FPS) is reported using the Ultralytics benchmark on a single NVIDIA T4 at

640 \times 640

resolution and batch size 1.

4.3. Comparison with State-of-the-Art Object Detectors

Table 4 compares YOLO11-FH with five recent YOLO architectures, all at the nano scale and trained from scratch (random initialization) on the same dataset. The comparison models include YOLOv10n [25], the unmodified YOLO11n baseline, YOLO12n [36], YOLOv13n [37], and YOLO26n [38]. A non-YOLO baseline, RT-DETR [24], is also reported.

From Table 4, several observations stand out. General-purpose detectors show limited gains on this domain-specific task: from YOLOv10n (2024) to YOLO26n (2026), mAP@0.5 rises from 95.24% to 95.71%, a span of 0.47 percentage points across two years of architectural innovation that includes attention mechanisms (YOLO12), hypergraph-based perception (YOLOv13), and end-to-end design (YOLO26). Domain-specific modules yield a larger gain: YOLO11-FH reaches 96.04% mAP@0.5 (+0.33 pp over YOLO26n, +0.95 pp over the baseline), with the mAP@0.5:0.95 gap even more pronounced at 76.18% vs. 74.98% for YOLO26n (+1.20 pp) and 73.27% for the baseline (+2.91 pp). The mAP@0.5:0.95 metric, which penalizes imprecise boxes, shows the largest absolute improvement because FSB suppresses noise at hop boundaries while TFMRB captures diverse hop scales, both directly reducing localization errors. The comparison is limited to general-purpose object detectors and RT-DETR. Prior FH detection works, such as Huang et al. [13] (CWD+YOLOv5) and Chen et al. [16] (YOLOv3-based), use different datasets, preprocessing pipelines, and evaluation metrics, so direct numerical comparison is impractical without full reimplementation. The controlled comparison above isolates the effect of domain-specific modules under identical experimental conditions.

RT-DETR provides a non-YOLO reference point: it reaches 95.80% mAP@0.5 but with a much larger model (32.8 M params, 108 GFLOPs) and lower FPS. This supports the claim that domain-specific modules let a nano-scale model surpass general-purpose detectors without paying a large efficiency penalty.

Figure 4 plots the detection models in the accuracy–efficiency plane, with bubble area encoding GFLOPs. The general-purpose variants cluster between 95.1% and 95.7% mAP@0.5 irrespective of parameter count, suggesting an architectural ceiling near 95.7% on this task once no domain knowledge is incorporated. YOLO11-FH lies 0.33 pp above this cluster at a comparable parameter cost to that of YOLO11n. The slight overhead relative to YOLOv10n and YOLO26n (

+ 0.14

and

+ 0.14

M, respectively) comes entirely from the three parallel branches in TFMRB; pairing those branches with a leaner backbone would close the gap.

4.4. Ablation Study

Table 5 presents the ablation results. Starting from the unmodified YOLO11n baseline, each proposed module is progressively added.

4.4.1. Effect of FSB

Adding FSB alone improves mAP@0.5 by 0.81 pp and mAP@0.5:0.95 by 2.25 pp. Precision increases from 92.95% to 94.77%, indicating that frequency-axis smoothing reduces false positives caused by noise-induced spurious detections. Recall also improves (88.16% → 88.73%), suggesting that cleaner feature maps enable the detector to recover faint hop pulses that were previously lost in noise. The parameter increase is negligible (160 additional parameters, +0.006%), and GFLOPs remain unchanged at 6.3.

4.4.2. Effect of TFMRB

TFMRB alone improves mAP@0.5 by 0.76 pp and mAP@0.5:0.95 by 2.11 pp. The multi-resolution branches enable the backbone to capture hop rectangles of diverse sizes more effectively. Because TFMRB replaces the heavier C3k2

(256, c 3 k = True)

block, the parameter count actually decreases by 42,064 (

- 1.6 %

), while GFLOPs increase marginally from 6.3 to 6.5 due to the three parallel convolution branches.

4.4.3. Complementarity of FSB and TFMRB

Combining FSB and TFMRB yields mAP@0.5 of 96.02% and mAP@0.5:0.95 of 76.06%, outperforming each module used alone. The combined mAP@0.5:0.95 gain (

+ 2.79

pp) is sub-additive compared with the sum of individual gains (

2.25 + 2.11 = 4.36

pp), indicating partially overlapping benefits. This is expected: FSB cleans the features that TFMRB subsequently processes, so the two modules share part of the same improvement pathway.

4.4.4. Effect of LightHead

Adding LightHead to the FSB + TFMRB configuration provides a modest additional improvement: mAP@0.5 increases by 0.02 pp and mAP@0.5:0.95 by 0.12 pp, while reducing parameter count from 2.54 M to 2.51 M (

- 1.2 %

). The small but positive accuracy gain, combined with parameter reduction, confirms that the original detection head is mildly over-parameterized for this single-class task.

4.4.5. Overall Impact

The complete YOLO11-FH model achieves 96.04% mAP@0.5 and 76.18% mAP@0.5:0.95, with 2.51 M parameters and 6.5 GFLOPs. Relative to the baseline, the net parameter change is

- 2.9 %

, and GFLOPs increase by

+ 3.2 %

. The accuracy improvement, particularly the 2.91 pp gain in mAP@0.5:0.95, is primarily attributable to the two backbone modules (FSB and TFMRB), while LightHead contributes parameter efficiency. Figure 5 summarizes the incremental gains; the orange bars (mAP@0.5:0.95) are consistently about three times longer than the blue bars (mAP@0.5), which reflects that both modules primarily sharpen localization rather than boost coarse detection probability.

We acknowledge that the ablation table does not include all possible module combinations (e.g., TFMRB + LightHead without FSB, or FSB + LightHead without TFMRB). The progressive addition order (TFMRB alone, FSB alone, FSB + TFMRB, then FSB + TFMRB + LightHead) was chosen to isolate the individual contribution of each module and their pairwise interaction. A full factorial design is left for future work.

4.5. Qualitative Detection Examples

Figure 6 shows representative detections on low-SNR spectrograms. The predicted boxes align with hop rectangles despite dense interference, while spurious responses along horizontal noise bands are reduced.

Regarding deployment cost, the parameter overhead of YOLO11-FH is modest by design. FSB adds 160 parameters at P2/4; TFMRB saves 42,064 by replacing the heavier C3k2 block; and LightHead provides a further reduction, so that the net count is

- 2.9 %

relative to the baseline. GFLOPs increase by 0.2 (from 6.3 to 6.5) owing to the three parallel branches in TFMRB. In fixed-point hardware implementations, the three branches can execute in parallel without a latency penalty, making the GFLOPs increase largely irrelevant for throughput. On platforms such as software-defined radio processors where memory bandwidth is the primary constraint, the smaller parameter count translates directly to a lower model footprint, which is favorable for real-time monitoring applications.

5. Discussion

5.1. Emitter-Count Robustness

Figure 7 and Table 6 summarize accuracy as the number of simultaneous emitters increases from 3 to 10 on the test split (120 images per setting). The mAP@0.5 curve drops steadily, from 0.995 at three emitters to 0.828 at ten emitters, consistent with more hop collisions and heavier time-frequency overlap. The mAP@0.5:0.95 metric and precision/recall values follow the same downward trend, indicating that both localization quality and detection coverage are affected as the scene becomes more congested. Even so, the model sustains mAP@0.5 above 0.83 at ten emitters, which suggests that the proposed backbone modifications remain effective under dense multi-emitter conditions.

5.2. Real-World Validation on RFUAV Subset

The model was fine-tuned on a small real-world RFUAV subset and evaluated on its validation split (train: 762 images, val: 110 images) [39]. Table 7 summarizes the validation results after 20 epochs: mAP@0.5 of 0.8519, mAP@0.5:0.95 of 0.5860, precision of 0.9182, and recall of 0.7888. This provides a first check of transfer to real RF spectrograms derived from UAV signals, while the limited validation size suggests that a larger held-out set would yield a more stable estimate.

5.3. Interpretation of FSB

FSB can be viewed as a learnable one-dimensional filter applied along the frequency axis. If the batch normalization scaling factor approaches unity and the SiLU activation operates near its linear region, the depthwise convolution approximates a 3-tap weighted average, a classical noise suppression operator. Unlike a fixed filter, the coefficients of FSB are optimized end-to-end to minimize detection loss, making them adaptive to the noise statistics of the training data. Visualizing the learned kernel weights and comparing FSB against a fixed 3-tap moving average filter would provide direct evidence for this interpretation; we leave this analysis for future work.

The choice of a

(3, 1)

kernel covers three adjacent frequency bins, which is smaller than the main-lobe width of the Hamming window (∼4 bins). This is treated as a conservative lower bound that reduces correlated noise while protecting narrowband hop edges. A wider kernel, such as

(5, 1)

or

(7, 1)

, might capture more of the correlated noise structure but could also blur narrowband hop edges. Ablation over the kernel size was not conducted in this study; this is a limitation that warrants further investigation.

FSB is applied after logarithmic compression and grayscale mapping, so the feature space is nonlinear relative to raw STFT magnitudes. FSB is thus interpreted as a learnable suppressor of frequency-axis correlation in the CNN feature space, rather than as a fixed filter tied to raw spectral statistics. The consistent gains in Table 5 suggest that the learned directional smoothing remains effective despite the nonlinearity.

The residual structure also offers a practical training property. At initialization, the depthwise weights are drawn from a near-zero distribution, so the branch output is small and the module output approximates the identity. This means FSB can be inserted into a partly trained YOLO backbone, or fine-tuned on a downstream FH dataset, without disrupting the existing feature statistics. The same property allows the module to be added to any standard STFT-based pipeline at negligible engineering cost. He et al. [31] showed formally that zero-initialized residual branches leave the gradient flow unchanged at the start of training; FSB inherits this stability by construction.

5.4. TFMRB Versus Deeper Backbones

An alternative to multi-resolution enhancement is making the backbone deeper. For FH spectrogram detection with a moderate amount of training data (9600 images), deeper networks risk overfitting. TFMRB obtains multi-scale representation within a single layer by parallelizing branches of different dilation rates [34], adding receptive-field diversity without increasing depth. The parameter count is lower than the C3k2 block it replaces, supporting its efficiency. Modern convolutional architectures such as ConvNeXt [40,41] and EfficientNetV2 [42] have demonstrated that well-designed CNN backbones match vision transformers when equipped with improved training recipes; a similar principle underlies TFMRB, which achieves multi-scale representation through structural design rather than increased depth.

5.5. Jamming Signal Annotation

In the current dataset, jamming signals (LFM, NAM, NFM, RF noise, and pulsed) are not annotated and are treated as background clutter. This means the detector must implicitly learn to distinguish FH hop rectangles from spectrally similar interference patterns. At low SNR, certain jamming types, particularly narrowband FM (NFM) and pulsed interference, can produce time-frequency signatures that partially overlap with FH hops in shape, potentially increasing false positives. The 96.04% mAP@0.5 suggests the detector handles this confusion adequately on the simulated data, but a dedicated study of per-jamming-type false positive rates would strengthen the evaluation. The steepest absolute performance gap between

- 13

and

- 14

dB coincides with the SNR range where narrowband interference begins to rival the hop pulse energy. Among the five jamming types, pulsed interference and narrowband FM are the most compact in the time-frequency plane and share the rectangular footprint of FH hops, differing mainly in duration. LFM sweeps span a continuous frequency range and are less likely to be mistaken for the short, band-limited hop pulses; the detector is expected to handle LFM correctly in the majority of images at all tested SNR levels. Quantifying per-type confusion requires re-annotating the dataset with jamming labels, a straightforward extension of the existing annotation pipeline. This would enable analysis of which interference type is the dominant source of false positives at each SNR and would clarify whether FSB or TFMRB has a greater impact on suppressing specific interference patterns. In real-world scenarios where novel interference types may appear, explicit multi-class annotation (FH pulse vs. specific interference types) could improve detection robustness.

Regarding narrowband FM (NFM) interference, FSB smooths correlated frequency bins but does not impose hop-like geometry; empirically, the module reduces horizontal band noise without increasing false positives on NFM-like clutter. Future work will quantify this explicitly with per-jamming-type false positive rates.

5.6. Limitations

The evaluation has several limitations that should be acknowledged. All 9600 images were generated using MATLAB simulation; the AWGN-plus-jamming model covers six SNR levels and five interference types, but real electromagnetic environments exhibit non-stationary noise, device-specific thermal noise, and spectral artifacts not present in the simulation. Validation on captured FH signals, for example, using software-defined radio recordings, is necessary to confirm practical utility. Beyond the simulated-data constraint, only FH-versus-background detection is addressed; multi-class detection (e.g., distinguishing FH from radar pulses) and simultaneous emitter identification remain unexplored directions. The neck uses isotropic upsampling in the FPN, which may blur sharp temporal edges for extreme aspect-ratio hops. Exploring anisotropic or deformable upsampling is a promising extension. Finally, all experiments were conducted at the nano scale, so behavior at small, medium, or large model sizes may differ. Explicit noise-model mismatch and non-instantaneous switching simulations were not evaluated, which limits assessment of robustness to real hardware effects.

5.7. Applicability to Other Spectrogram-Based Detection

The noise structure addressed by FSB and the scale diversity addressed by TFMRB arise from general properties of STFT spectrograms rather than FH-specific signal characteristics. Any spectrogram computed with a windowed Fourier transform exhibits main-lobe frequency-axis noise correlation; the correlation length is set by the window type and overlap ratio [26] and is independent of the modulation class being observed. Signals with variable bandwidth and duty cycle, including radar pulses, orthogonal frequency-division multiplexing (OFDM) subcarriers, and satellite transponder emissions, produce the same multi-scale bounding-box challenge that TFMRB was designed to address.

In practice, integrating either module into an existing YOLO-based spectrogram pipeline requires only channel-count alignment. TFMRB replaces a single C3k2 block at P3/8; FSB inserts after the second strided convolution at P2/4. Neither change touches the neck, the detection heads, or the loss function. Applied to the joint communication-and-radar detection scenario of Kang et al. [15], for instance, both modules could be dropped in without redesigning the training pipeline. Whether the mAP gains transfer quantitatively to other signal classes remains an open empirical question, but the low integration cost makes the experiment straightforward.

6. Conclusions

This paper presented YOLO11-FH, a modified YOLO11 architecture for frequency-hopping signal detection in time-frequency spectrograms. Two modules, FreqSmoothBlock (frequency-axis depthwise convolution) and TFMultiResBlock (parallel dilated convolutions at rates of 1, 2, and 3), were inserted into the backbone, and the detection head was simplified. On a simulated dataset spanning

- 15

to

- 10

dB SNR with five jamming types, YOLO11-FH reached 96.04% mAP@0.5 and 76.18% mAP@0.5:0.95, outperforming the YOLO11n baseline by 0.95 and 2.91 pp. The accuracy–efficiency bubble chart (Figure 4) showed that the proposed model breaks through the performance ceiling of unmodified YOLO variants on this task, while the horizontal ablation bar chart (Figure 5) confirmed that mAP@0.5:0.95 improves roughly three times more than mAP@0.5 per module addition, pointing to localization quality as the primary beneficiary. Comparisons with YOLOv10n, YOLO12n, YOLOv13n, YOLO26n, and RT-DETR under identical training conditions confirmed the advantage of domain-specific modules over general-purpose architectural advances. Section 5.7 argued that both modules address structural properties common to all STFT spectrograms, making them applicable to other signal-detection tasks with minimal integration effort.

The current evaluation is limited to simulated data. Future work will (1) validate the approach on real FH signals captured by software-defined radio, (2) ablate the FSB kernel size and visualize the learned filter coefficients, (3) extend to multi-class detection with simultaneous emitter identification, (4) compare with existing FH-specific detection methods under a standardized benchmark, (5) evaluate noise-model mismatch and non-instantaneous switching effects, and (6) evaluate the transfer of FSB and TFMRB to other spectrogram-based detection tasks such as radar pulse and OFDM signal detection.

Author Contributions

Conceptualization, C.Y., H.Z. and W.W.; Methodology, H.Z., W.W., C.Y., Y.X. (Youjun Xiang) and J.L.; Software, H.Z.; Validation, H.Z., W.W., C.Y. and Y.X. (Youjun Xiang); Formal analysis, H.Z. and W.W.; Investigation, H.Z., W.W., C.Y. and Y.X. (Youjun Xiang); Resources, C.Y., Y.X. (Youjun Xiang) and J.L.; Data curation, H.Z., W.W. and Y.X. (Yuheng Xu); Writing—original draft preparation, H.Z.; Writing—review and editing, H.Z., W.W., C.Y., Y.X. (Youjun Xiang), Y.X. (Yuheng Xu) and J.L.; Visualization, H.Z., W.W. and Y.X. (Yuheng Xu); Supervision, C.Y.; Project administration, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hoang, L.M.; Zhang, J.A.; Nguyen, D.N.; Hoang, D.T. Frequency hopping joint radar-communications with hybrid sub-pulse frequency and duration modulation. IEEE Wirel. Commun. Lett. 2022, 11, 2300–2304. [Google Scholar] [CrossRef]
Zhao, L.; Wang, L.; Bi, G.; Zhang, L.; Zhang, H. Robust frequency-hopping spectrum estimation based on sparse Bayesian method. IEEE Trans. Wirel. Commun. 2014, 14, 781–793. [Google Scholar] [CrossRef]
Li, H.; Guo, Y.; Sui, P.; Yu, X.; Yang, X.; Wang, S. Frequency-Hopping Signal Network-Station Sorting Based on Maxout Network Model and Generative Method. Math. Probl. Eng. 2019, 2019, 9152728. [Google Scholar] [CrossRef]
Mao, J.; Luo, F.; Hu, X. Distributed passive positioning and sorting method for multi-network frequency-hopping time division multiple access signals. Sensors 2024, 24, 7168. [Google Scholar] [CrossRef]
Yang, X.; Ye, Y.; Liu, Y. Estimation and Simulation of the Capability for Frequency Hopping Signal Sorting. In Proceedings of the 2024 16th International Conference on Communication Software and Networks (ICCSN); IEEE: Piscataway, NJ, USA, 2024; pp. 207–210. [Google Scholar]
Lin, M.; Tian, Y.; Zhang, X.; Huang, Y. Parameter estimation of frequency-hopping signal in UCA based on deep learning and spatial time–frequency distribution. IEEE Sens. J. 2023, 23, 7460–7474. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, B.; Zhu, Z.; Wang, Z.; Gong, K. Signal sorting algorithm of hybrid frequency hopping network station based on neural network. IEEE Access 2021, 9, 35924–35931. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Sun, Q.; Li, Y. A novel underdetermined blind source separation algorithm of frequency-hopping signals via time-frequency analysis. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 4286–4290. [Google Scholar] [CrossRef]
Wei, Y.; Xiang, Q.; Qian, Y.; Li, C.; Ou-Yang, S.R.; Zhuo, X. Improved Adaptive Multi-Density DBSCAN Method for Radar Signal Sorting in Complex Electromagnetic Environment. In Proceedings of the 2024 6th International Conference on Communications, Signal Processing, and Their Applications (ICCSPA); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Ye, J.; Zou, J.; Gao, J.; Zhang, G.; Kong, M.; Pei, Z.; Cui, K. A new frequency hopping signal detection of civil UAV based on improved K-means clustering algorithm. IEEE Access 2021, 9, 53190–53204. [Google Scholar] [CrossRef]
Bazzi, A.; Slock, D.T.; Meilhac, L. Sparse recovery using an iterative Variational Bayes algorithm and application to AoA estimation. In Proceedings of the 2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Limassol, Cyprus, 12–14 December 2016; pp. 197–202. [Google Scholar] [CrossRef]
Wang, Y.; He, S.; Wang, C.; Li, Z.; Li, J.; Dai, H.; Xie, J. Detection and parameter estimation of frequency hopping signal based on the deep neural network. Int. J. Electron. 2022, 109, 520–536. [Google Scholar] [CrossRef]
Huang, D.; Yan, X.; Hao, X.; Dai, J.; Wang, X. Low SNR multi-emitter signal sorting and recognition method based on low-order cyclic statistics CWD time-frequency images and the YOLOv5 deep learning model. Sensors 2022, 22, 7783. [Google Scholar] [CrossRef]
Jiang, K.; Peng, K.; Feng, Y.; Guo, X.; Tang, Z. DFN-YOLO: Detecting Narrowband Signals in Broadband Spectrum. Sensors 2025, 25, 4206. [Google Scholar] [CrossRef]
Kang, X.; Chen, H.M.; Chen, G.; Chang, K.C.; Clemons, T.M. Joint detection and classification of communication and radar signals in congested RF environments using YOLOv8. In Proceedings of the MILCOM 2024-2024 IEEE Military Communications Conference (MILCOM); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Chen, J.; Zhang, K.; Ding, F. Study on Frequency Hopping Signal Detection and Identification Based on YoLov3. In Proceedings of the China Conference on Command and Control; Springer: Singapore, 2024; pp. 101–110. [Google Scholar]
Zhu, W.; Jin, H.; Wang, J.; Lei, Y.; Lou, C.; Liu, C. Variable-Speed Frequency-Hopping Signal Sorting: Spectrogram Is Sufficient. Electronics 2023, 12, 4533. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Harris, F.J. On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 1978, 66, 51–83. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Ye, Q.; Doermann, D.S. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection. arXiv 2025, arXiv:2509.25164. [Google Scholar] [CrossRef]
Shi, R.; Yu, X.; Wang, S.; Zhang, Y.; Xu, L.; Pan, P.; Ma, C. RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification. arXiv 2025, arXiv:2503.09033. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]

Figure 1. Overall architecture of YOLO11-FH. Shaded blocks indicate modified components relative to the YOLO11n baseline: FSB at layer 2 (P2/4), TFMRB at layer 5 (P3/8), and LightHead modifications in the detection head.

Figure 2. Structure of the FreqSmoothBlock (FSB). A

(3, 1)

depthwise convolution processes three adjacent frequency bins per channel, followed by batch normalization and SiLU activation. The residual connection preserves the original features. Colors indicate data flow only.

Figure 2. Structure of the FreqSmoothBlock (FSB). A

(3, 1)

depthwise convolution processes three adjacent frequency bins per channel, followed by batch normalization and SiLU activation. The residual connection preserves the original features. Colors indicate data flow only.

Figure 3. Structure of the TFMultiResBlock (TFMRB). Three parallel

3 \times 3

dilated convolution branches (dilation rates

d = 1, 2, 3

) capture features at different scales. Outputs are concatenated and fused by a

1 \times 1

convolution. A residual connection is applied when input and output channel counts differ. Colors indicate different dilation branches only.

Figure 3. Structure of the TFMultiResBlock (TFMRB). Three parallel

3 \times 3

dilated convolution branches (dilation rates

d = 1, 2, 3

) capture features at different scales. Outputs are concatenated and fused by a

1 \times 1

convolution. A residual connection is applied when input and output channel counts differ. Colors indicate different dilation branches only.

Figure 4. Accuracy–efficiency trade-off for nano-scale object detection models on the FH dataset. Horizontal axis: parameter count (M). Vertical axis: mAP@0.5 (%). Bubble area encodes GFLOPs. General-purpose detectors (blue) cluster below the 95.7% ceiling, while YOLO11-FH (red star) clears this ceiling by 0.33 pp at a similar parameter budget.

Figure 5. Incremental mAP gains of each ablation configuration over the YOLO11n baseline. The mAP@0.5:0.95 improvement (red-orange bars) is consistently about three times larger than the mAP@0.5 gain (blue bars), confirming that FSB and TFMRB primarily improve bounding box localization rather than coarse detection probability.

Figure 6. Qualitative detection examples on low-SNR spectrograms. Green boxes denote predicted FH hops.

Figure 7. mAP@0.5 versus the number of simultaneous emitters on the test split. Each point aggregates 120 images for a fixed emitter count across six SNR levels.

Table 1. YOLO11-FH Architecture (nano scale). Bold rows indicate modified layers relative to the YOLO11n baseline.

Layer	Module	Output	Change
Backbone
0	Conv $64, 3 \times 3, s = 2$	P1/2	—
1	Conv $128, 3 \times 3, s = 2$	P2/4	—
2	FSB $128$	P2/4	New
3	C3k2 $256, r = 2$	—	—
4	Conv $256, 3 \times 3, s = 2$	P3/8	—
5	TFMRB $256$	P3/8	Replaces C3k2
6	Conv $512, 3 \times 3, s = 2$	P4/16	—
7	C3k2 $512, r = 2$	—	—
8	Conv $1024, 3 \times 3, s = 2$	P5/32	—
9	C3k2 $1024, r = 2$	—	—
10	SPPF 1024 ^†	—	—
11	C2PSA 1024 ^†	—	—
Head (LightHead)
14	C3k2 512 ( $r = 1$ )	P4	$r : 2 \to 1$
17	C3k2 256 ( $r = 1$ )	P3	$r : 2 \to 1$
20	C3k2 512 ( $r = 1$ )	P4	$r : 2 \to 1$
23	C3k2 1024 ( $r = 1$ , `c3k = F`)	P5	$r : 2 \to 1$ , `T→F`
24	Detect	P3/P4/P5	—

^† SPPF = spatial pyramid pooling-fast; C2PSA = cross-stage partial with spatial attention. Both are unmodified from YOLO11n.

Table 2. Dataset configuration.

Parameter	Value
Total samples	9600
Emitter count per image	3–10 (from a pool of 10 predefined emitters)
SNR levels	${- 15, - 14, - 13, - 12, - 11, - 10}$ dB
Samples per (emitter count, SNR)	200
Modulation types	QPSK, 16QAM, AM, FM, MSK
Jamming types	LFM, NAM, NFM, RF noise, pulsed
Jamming power ratio	60%
Sampling rate	400 MHz
Observation duration	25 ms
Signal bandwidth	100 kHz
Hop rates	100–320 hops/s
Frequency range (display)	10–200 MHz
Image resolution	$640 \times 640$ pixels
Detection class	1 (FH_Signal)
Train/Val/Test split	70%/20%/10%

Table 3. Training hyperparameters.

Hyperparameter	Value
Hardware	2× NVIDIA T4 (16 GB each), data-parallel
Framework	Ultralytics v8.3, PyTorch 2.1
Epochs	100
Batch size	128 (64 per GPU)
Optimizer	SGD, momentum 0.937, weight decay $5 \times 10^{- 4}$
Learning rate	${lr}_{0} = 0.01$ , cosine annealing
Warm-up	5 epochs (momentum 0.8, bias lr 0.1)
Input size	$640 \times 640$
Initialization	Random (from YAML configuration)
Mosaic	$p = 1.0$
Horizontal flip	$p = 0.5$
HSV jitter	$h = 0.015$ , $s = 0.7$ , $v = 0.4$
Mixed precision	AMP enabled

Table 4. Comparison with State-of-the-Art Object Detectors (nano scale). The best results are shown in bold, the second-best results are underlined.

Method	mAP₅₀	mAP_50-95	P	R	Params	GFLOPs	FPS
Method	(%)	(%)	(%)	(%)	(M)	GFLOPs	FPS
YOLOv10n	95.24	75.09	93.36	87.81	2.27	6.5	—
YOLO11n	95.09	73.27	92.95	88.16	2.58	6.3	50.0
YOLO12n	95.36	74.21	93.58	88.26	2.59	6.5	48.8
YOLOv13n	95.53	74.63	93.76	88.45	2.48	6.4	35.1
YOLO26n	95.71	74.98	94.03	88.63	2.37	5.4	53.2
RT-DETR	95.80	73.00	93.50	89.20	32.80	108.0	20.4
Ours	96.04	76.18	94.84	88.92	2.51	6.5	48.4

Table 5. Ablation Study. FSB = FreqSmoothBlock, TFMRB = TFMultiResBlock, LH = LightHead. Checkmark indicates the module is enabled.

Δ

indicates change relative to baseline. Bold indicates the best results.

Table 5. Ablation Study. FSB = FreqSmoothBlock, TFMRB = TFMultiResBlock, LH = LightHead. Checkmark indicates the module is enabled.

Δ

indicates change relative to baseline. Bold indicates the best results.

FSB	TFMRB	LH	mAP₅₀	$Δ$	mAP_50-95	$Δ$	Params	GFLOPs
FSB	TFMRB	LH	(%)	(pp)	(%)	(pp)	(M)	GFLOPs
			95.09	—	73.27	—	2.58	6.3
	✓		95.85	+0.76	75.38	+2.11	2.54	6.5
✓			95.90	+0.81	75.52	+2.25	2.58	6.3
✓	✓		96.02	+0.93	76.06	+2.79	2.54	6.5
✓	✓	✓	96.04	+0.95	76.18	+2.91	2.51	6.5

Table 6. Per-emitter-count accuracy on the test split.

Emitters	mAP@0.5	mAP@0.5:0.95	P	R
3	0.9950	0.8647	0.9831	0.9954
4	0.9806	0.8173	0.9552	0.9827
5	0.9778	0.8015	0.9392	0.9799
6	0.9585	0.7532	0.9134	0.9625
7	0.9263	0.6991	0.8822	0.9330
8	0.8969	0.6469	0.8612	0.9062
9	0.8566	0.5965	0.8419	0.8684
10	0.8275	0.5679	0.8288	0.8432

Table 7. Validation results after 20-epoch fine-tuning on the RFUAV subset.

Split	mAP@0.5	mAP@0.5:0.95	P	R
RFUAV val	0.8519	0.5860	0.9182	0.7888

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, H.; Wang, W.; Yang, C.; Xiang, Y.; Li, J.; Xu, Y. YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms. Signals 2026, 7, 48. https://doi.org/10.3390/signals7030048

AMA Style

Zhu H, Wang W, Yang C, Xiang Y, Li J, Xu Y. YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms. Signals. 2026; 7(3):48. https://doi.org/10.3390/signals7030048

Chicago/Turabian Style

Zhu, Huijie, Wei Wang, Cui Yang, Youjun Xiang, Jiawei Li, and Yuheng Xu. 2026. "YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms" Signals 7, no. 3: 48. https://doi.org/10.3390/signals7030048

APA Style

Zhu, H., Wang, W., Yang, C., Xiang, Y., Li, J., & Xu, Y. (2026). YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms. Signals, 7(3), 48. https://doi.org/10.3390/signals7030048

Article Menu

YOLO11-FH: Frequency-Axis Smoothing and Multi-Resolution Enhancement for Frequency-Hopping Signal Detection in Low-SNR Spectrograms

Abstract

1. Introduction

2. Signal Model and Preprocessing

2.1. Signal Model

2.2. STFT Preprocessing

2.3. Frequency-Axis Noise Correlation

2.4. Bounding Box–Physical Parameter Mapping

3. Proposed Method: YOLO11-FH

3.1. Overall Architecture

3.2. FreqSmoothBlock (FSB)

3.2.1. FSB Motivation

3.2.2. FSB Architecture

3.2.3. Parameter Cost

3.2.4. FSB Placement

3.3. TFMultiResBlock (TFMRB)

3.3.1. TFMRB Motivation

3.3.2. TFMRB Architecture

3.3.3. Comparison with ASPP

3.3.4. TFMRB Placement

3.4. LightHead Design

4. Experimental Results

4.1. Dataset

4.2. Implementation Details

4.3. Comparison with State-of-the-Art Object Detectors

4.4. Ablation Study

4.4.1. Effect of FSB

4.4.2. Effect of TFMRB

4.4.3. Complementarity of FSB and TFMRB

4.4.4. Effect of LightHead

4.4.5. Overall Impact

4.5. Qualitative Detection Examples

5. Discussion

5.1. Emitter-Count Robustness

5.2. Real-World Validation on RFUAV Subset

5.3. Interpretation of FSB

5.4. TFMRB Versus Deeper Backbones

5.5. Jamming Signal Annotation

5.6. Limitations

5.7. Applicability to Other Spectrogram-Based Detection

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI