Figure 1.
Detection strategy of the proposed PTANet. The input consists of paired RGB-T images, and the output is the detection results for the visible image. The heatmaps illustrate the model’s focus at different stages. Yellow bounding boxes indicate persons, while red bounding boxes denote background clutter.
Figure 1.
Detection strategy of the proposed PTANet. The input consists of paired RGB-T images, and the output is the detection results for the visible image. The heatmaps illustrate the model’s focus at different stages. Yellow bounding boxes indicate persons, while red bounding boxes denote background clutter.
Figure 2.
Structure of PTANet. The red dashed arrows denote the person segmentation auxiliary branch, which is implemented during the training phase. The blue dashed arrows represent the cross-modality background mask, which is only used during the inference phase. GAFFM indicates the proposed global adaptive feature fusion module.
Figure 2.
Structure of PTANet. The red dashed arrows denote the person segmentation auxiliary branch, which is implemented during the training phase. The blue dashed arrows represent the cross-modality background mask, which is only used during the inference phase. GAFFM indicates the proposed global adaptive feature fusion module.
Figure 3.
Comparison of (a) traditional transformer-based fusion methods and (b) our method.
Figure 3.
Comparison of (a) traditional transformer-based fusion methods and (b) our method.
Figure 4.
Structure of the GAFFM. The inputs are visible features and thermal features , and the output is their fused feature obtained via the GAFFM module.
Figure 4.
Structure of the GAFFM. The inputs are visible features and thermal features , and the output is their fused feature obtained via the GAFFM module.
Figure 5.
Statistics of person sizes in three drone-based datasets: (a) VTUAV-det, (b) RGBTDronePerson, and (c) VTSaR datasets. Green dots indicate persons smaller than 10 × 10 pixels, red dots represent persons larger than 10 × 10 but smaller than 32 × 32 pixels, and blue dots correspond to persons larger than 32 × 32 pixels.
Figure 5.
Statistics of person sizes in three drone-based datasets: (a) VTUAV-det, (b) RGBTDronePerson, and (c) VTSaR datasets. Green dots indicate persons smaller than 10 × 10 pixels, red dots represent persons larger than 10 × 10 but smaller than 32 × 32 pixels, and blue dots correspond to persons larger than 32 × 32 pixels.
Figure 6.
Structure of the PSAB. The red arrow indicates the upsampling operation.
Figure 6.
Structure of the PSAB. The red arrow indicates the upsampling operation.
Figure 7.
Original image (a) and target binary mask (b).
Figure 7.
Original image (a) and target binary mask (b).
Figure 8.
Structure of the CMBM. The thermal image and the red band of the visible image are used as input to generate , which is downsampled to the spatial resolution of the visible branch feature map and element-wise multiplied with it to produce the final visible branch feature .
Figure 8.
Structure of the CMBM. The thermal image and the red band of the visible image are used as input to generate , which is downsampled to the spatial resolution of the visible branch feature map and element-wise multiplied with it to produce the final visible branch feature .
Figure 9.
Visualization results of different models on VTUAV-det. (a) Ground truth; (b) CFT; (c) SuperYOLO*; and (d) PTANet. Ground-truth targets are shown in red, undetected targets in yellow, and false alarms in blue. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 9.
Visualization results of different models on VTUAV-det. (a) Ground truth; (b) CFT; (c) SuperYOLO*; and (d) PTANet. Ground-truth targets are shown in red, undetected targets in yellow, and false alarms in blue. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 10.
Annotated visible and thermal person targets on different datasets. (a) VTUAV-det. (b) RGBDronePerson. (c) VTSaR. Red squares indicate person targets.
Figure 10.
Annotated visible and thermal person targets on different datasets. (a) VTUAV-det. (b) RGBDronePerson. (c) VTSaR. Red squares indicate person targets.
Figure 11.
Visualization results of different models on RGBTDronePerson. (a) Ground truth; (b) YOLOv5n (thermal); (c) QFDet’; and (d) PTANet. Ground-truth targets are shown in red, undetected targets in yellow, and false alarms in blue. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 11.
Visualization results of different models on RGBTDronePerson. (a) Ground truth; (b) YOLOv5n (thermal); (c) QFDet’; and (d) PTANet. Ground-truth targets are shown in red, undetected targets in yellow, and false alarms in blue. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 12.
Visualization results of different models on VTSaR. (a) Ground truth; (b) TFDet; (c) SuperYOLO*; and (d) PTANet. Ground-truth targets are shown in red and undetected targets in yellow. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 12.
Visualization results of different models on VTSaR. (a) Ground truth; (b) TFDet; (c) SuperYOLO*; and (d) PTANet. Ground-truth targets are shown in red and undetected targets in yellow. Scene types are labeled as ST (small targets), WI (low-light conditions), and CB (cluttered backgrounds).
Figure 13.
Precison–Recall (PR) curves of PTANet for person detection on three datasets. (a) VTUAV-det. (b) RGBTDronePerson. (c) VTSaR.
Figure 13.
Precison–Recall (PR) curves of PTANet for person detection on three datasets. (a) VTUAV-det. (b) RGBTDronePerson. (c) VTSaR.
Figure 14.
Detection heatmaps for different postures of person. (a) Original images. (b) Heatmaps for the baseline model. (c) Heatmaps for the PTANet.
Figure 14.
Detection heatmaps for different postures of person. (a) Original images. (b) Heatmaps for the baseline model. (c) Heatmaps for the PTANet.
Figure 15.
Jetson Orin NX board and its hardware specifications.
Figure 15.
Jetson Orin NX board and its hardware specifications.
Figure 16.
Heatmaps of prediction results before and after adding modules. (a) Original images. (b) Heatmaps for the baseline model. (c) Heatmaps for the model after adding the modules.
Figure 16.
Heatmaps of prediction results before and after adding modules. (a) Original images. (b) Heatmaps for the baseline model. (c) Heatmaps for the model after adding the modules.
Figure 17.
Examples of target masks generated by different approaches. (a) Original image. (b) Box-driven mask. (c) Gaussian center heatmap. (d) Boundary erosion. (e) Boundary relaxation.
Figure 17.
Examples of target masks generated by different approaches. (a) Original image. (b) Box-driven mask. (c) Gaussian center heatmap. (d) Boundary erosion. (e) Boundary relaxation.
Table 1.
Performance comparison of different methods on VTUAV-det dataset. Bold values indicate results of the proposed method.
Table 1.
Performance comparison of different methods on VTUAV-det dataset. Bold values indicate results of the proposed method.
Method | Modality | Backbone | mAP50 (%) | Params (M) | GFLOPs |
---|
YOLOv8n [44] | RGB | DarkNet-53 | 39.8 | 2.87 | 8.1 |
YOLOv5n [54] | RGB | DarkNet-53 + Focus | 41.8 | 1.9 | 4.5 |
SSD512 [55] | RGB | VGG16 | 10.2 | 24.39 | 268 |
Faster-RCNN [56] | RGB | ResNet50-FPN | 35.5 | 41.35 | 168 |
YOLOv8n [44] | T | DarkNet-53 | 73.5 | 2.87 | 8.1 |
YOLOv5n [54] | T | DarkNet-53 + Focus | 71.6 | 1.9 | 4.5 |
SSD512 [55] | T | VGG16 | 66.6 | 24.39 | 268 |
Faster-RCNN [56] | T | ResNet50-FPN | 67.6 | 41.35 | 168 |
TINet [17] | RGB + T | ResNet50-FPN | 59.40 | 100.86 | 92.80 |
SuperYOLO* [57] | RGB + T | DarkNet-53 | 76.45 | 4.62 | 56.2 |
CFT [22] | RGB + T | CSPDarkNet-PANet | 76 | 206.03 | 224.40 |
QFDet [5] | RGB + T | ResNet50-FPN | 70.4 | 60.19 | 81.43 |
QFDet’ [5] | RGB + T | ResNet50-FPN | 75.5 | 60.25 | 242.82 |
PTANet (ours) | RGB + T | DarkNet-53 | 79.5 | 4.72 | 29.2 |
Table 2.
Performance comparison of different methods on RGBTDronePerson dataset. Bold values indicate results of the proposed method.
Table 2.
Performance comparison of different methods on RGBTDronePerson dataset. Bold values indicate results of the proposed method.
Method | Modality | Backbone | mAP50 (%) | Params (M) | GFLOPs |
---|
YOLOv8n [44] | RGB | DarkNet-53 | 3.81 | 2.87 | 8.1 |
YOLOv5n [54] | RGB | DarkNet-53 + Focus | 1.88 | 1.9 | 4.5 |
SSD512 [55] | RGB | VGG16 | 0.7 | 24.39 | 268 |
Faster-RCNN [56] | RGB | ResNet50-FPN | 0.3 | 41.35 | 168 |
YOLOv8n [44] | T | DarkNet-53 | 38.8 | 2.87 | 8.1 |
YOLOv5n [54] | T | DarkNet-53 + Focus | 45.9 | 1.9 | 4.5 |
SSD512 [55] | T | VGG16 | 42.3 | 24.39 | 268 |
Faster-RCNN [56] | T | ResNet50-FPN | 38.1 | 41.35 | 168 |
TINet [17] | RGB + T | ResNet50-FPN | 28.30 | 100.86 | 92.80 |
SuperYOLO* [57] | RGB + T | DarkNet-53 | 32.94 | 4.62 | 56.2 |
CFT [22] | RGB + T | CSPDarkNet-PANet | 34.2 | 206.03 | 224.40 |
QFDet [5] | RGB + T | ResNet50-FPN | 42.08 | 60.19 | 81.43 |
QFDet’ [5] | RGB + T | ResNet50-FPN | 46.72 | 60.25 | 242.82 |
VTSaRNet [9] | RGB + T | DarkNet-53 + Focus | 40.42 | 6.9 | 15.2 |
PTANet (ours) | RGB + T | DarkNet-53 | 47.8 | 4.72 | 29.5 |
Table 3.
Performance comparison of different methods on VTSaR dataset. Bold values indicate results of the proposed method.
Table 3.
Performance comparison of different methods on VTSaR dataset. Bold values indicate results of the proposed method.
Method | Modality | Backbone | mAP50 (%) | Params (M) | GFLOPs |
---|
YOLOv8n [44] | RGB | DarkNet-53 | 96.2 | 2.87 | 8.1 |
YOLOv5n [54] | RGB | DarkNet-53 + Focus | 94.88 | 1.9 | 4.5 |
SSD512 [55] | RGB | VGG16 | 93.5 | 24.39 | 268 |
Faster-RCNN [56] | RGB | ResNet50-FPN | 84.8 | 41.35 | 168 |
YOLOv8n [44] | T | DarkNet-53 | 94.4 | 2.87 | 8.1 |
YOLOv5n [54] | T | DarkNet-53 + Focus | 94 | 1.9 | 4.5 |
SSD512 [55] | T | VGG16 | 76.5 | 24.39 | 268 |
Faster-RCNN [56] | T | ResNet50-FPN | 73.8 | 41.35 | 168 |
TFDet [58] | RGB + T | DarkNet-53 + Focus | 96.9 | 69.89 | 111.1 |
SuperYOLO*
[57] | RGB + T | DarkNet-53 | 96.99 | 4.62 | 56.2 |
QFDet [5] | RGB + T | ResNet50-FPN | 94.6 | 60.19 | 81.43 |
QFDet’ [5] | RGB + T | ResNet50-FPN | 95.9 | 60.25 | 242.82 |
PTANet (ours) | RGB + T | DarkNet-53 | 97.3 | 4.72 | 29.2 |
Table 4.
Performance of PTANet on small person targets.
Table 4.
Performance of PTANet on small person targets.
Model | Dataset | | |
---|
PTANet | VTUAV-det | 63.9 | 27.4 |
PTANet | RGBTDronePerson | 32.1 | 11.7 |
PTANet | VTSaR | 95.4 | 44.8 |
Table 5.
Statistical tests for PTANet.
Table 5.
Statistical tests for PTANet.
Model | Seed | mAP50 | mAP50:95 |
---|
PTANet | 0 | 79.5 | 41.5 |
PTANet | 1 | 79.3 | 41 |
PTANet | 2 | 79.2 | 40.9 |
Table 6.
Performance of PTANet on Jetson Orin NX.
Table 6.
Performance of PTANet on Jetson Orin NX.
Model | Precision | Model Size (MB) | Inference Time (ms) |
---|
PTANet | FP32 | 21.8 | 16.707 |
PTANet | FP16 | 13.1 | 12.539 |
PTANet | INT8 | 8.4 | 11.177 |
Table 7.
Results of ablation study for modules in PTANet.
Table 7.
Results of ablation study for modules in PTANet.
GAFFM | PSAB | CMBM | mAP50 (%) | Params (M) |
---|
VTUAV-det | RGBTDronePerson |
---|
| | | 78.4 | 44.9 | 4.08 |
✔ | | | 78.9 | 45.1 | 4.44 |
| ✔ | | 78.5 | 45.1 | 4.36 |
| | ✔ | 78.6 | 45.7 | 4.08 |
✔ | ✔ | | 79.4 | 47.5 | 4.72 |
✔ | | ✔ | 78.9 | 45.8 | 4.44 |
| ✔ | ✔ | 78.8 | 46.2 | 4.36 |
✔ | ✔ | ✔ | 79.5 | 47.8 | 4.72 |
Table 8.
Performance comparison of GAFFM variants.
Table 8.
Performance comparison of GAFFM variants.
Method | mAP50 | mAP50:95 | Params (M) | GFLOPs |
---|
GAFFM | 78.9 | 40.7 | 4.44 | 29.2 |
Add | 78.6 | 40.4 | 4.44 | 29.2 |
Concat | 78.8 | 40.5 | 4.60 | 29.5 |
Swapping Q and K | 79.0 | 40.7 | 4.44 | 29.2 |
Symmetric Cross-Attention | 79.2 | 40.7 | 4.44 | 42.1 |
Multi-scale Tokens | 79.3 | 40.8 | 4.85 | 29.6 |
Table 9.
Performance using different forms of target masks on the VTUAV-det dataset.
Table 9.
Performance using different forms of target masks on the VTUAV-det dataset.
Mask | mAP50 | mAP50:95 |
---|
Box-Driven (ours) | 78.5 | 40.5 |
Gaussian Center Heatmap | 78.3 | 40.0 |
Boundary Erosion | 78.9 | 40.7 |
Boundary Relaxation | 78.0 | 40.3 |
Table 10.
Performance of CMBM under perturbations and alignment errors.
Table 10.
Performance of CMBM under perturbations and alignment errors.
Method | mAP50 | Latency (ms) |
---|
Baseline (OTSU) | 79.5 | 0 |
Morphology | 79.5 | +0.5 |
Adaptive Windowed Threshold | 79.4 | +0.5 |
Channel Switching (G) | 79.5 | +0.0 |
Channel Switching (B) | 79.4 | +0.0 |
Misregistration (+4 px) | 79.4 | +0.3 |