1. Introduction
Wildfires present an increasing threat to ecosystems, communities, and climate stability, underscored by the critical need for a proactive assessment of wildfire risk. Monitoring tree conditions to identify early signs of stress or health degradation plays an essential role in mitigating these risks. Recent advances in unmanned aerial vehicle (UAV) technology and computer vision algorithms have demonstrated significant promise in automating complex tasks. The latest studies show that combining UAVs and computer vision can revolutionize traditional tasks in various areas including airport maintenance [
1], aircraft inspection [
2], search and rescue [
3], railway inspection [
4], wildlife preservation [
5], and so on. However, the potential of using diverse color spaces to enhance object detection performance remains underexplored, despite its potential to uncover environmental changes.
Previous studies have predominantly focused on enhancing algorithms to achieve improved outcomes [
6,
7,
8,
9]. Also, traditional computer vision workflows often rely on the standard sRGB color space [
10]. Although sRGB is widely used, it is an 8-bit compressed color format that may not capture the fine spectral nuances crucial to detecting early indicators of tree stress. In contrast, alternative color spaces offer enhanced data fidelity. Some previous studies investigating alternative color space formats for object detection have demonstrated that certain color spaces converted from raw images yield improved performance [
6,
11,
12,
13]. For example, Linear RGB can preserve 16-bit depth, while Log RGB extends to 32-bit, both capturing richer and more nuanced spectral information. Recent research by Bruce Maxwell demonstrates that Linear RGB and Log RGB color spaces have the potential to improve image classification accuracy by better retaining fine details and subtle variations in image data [
14].
Modern UAVs, equipped with advanced sensors, are capable of capturing high-resolution raw images in formats such as DNG, which retain extensive spectral information. These raw images can be converted into various color space formats, enabling exploration of how different spectral perspectives influence object detection task’s performance. Using the raw image capture capabilities of UAVs and systematically analyzing diverse color spaces, including Linear RGB, Log RGB, and others, this study aims to identify optimal approaches to enhance tree condition detection.
By integrating advanced color representations with state-of-the-art transformer-based detection architectures—including CO-DETR [
15], DDQ [
16], and the latest YOLO variant, YOLO11 [
17]—this research leverages novel, recently published methodologies to systematically investigate how color space diversity enhances the detection of subtle environmental changes. These models were selected not only for their strong benchmark performance but also for their recency, as they represent cutting-edge advancements in transformer and CNN (convolutional neural network) architectures for object detection. By coupling their capabilities with raw UAV-derived color spaces (e.g., Linear RGB, Log RGB), this work bridges the gap between high-fidelity spectral data and actionable wildfire risk insights, while ensuring alignment with the latest innovations in computer vision.
2. Related Work
Two primary approaches are commonly used to predict fire susceptibilities in machine learning: traditional machine learning algorithms and deep neural networks (DNNs). Traditional algorithms are favored for their simplicity and ease of interpretation, making them well suited for tabular data and smaller datasets. In contrast, deep neural networks, though often regarded as opaque and complex, offer greater flexibility, enabling their application to diverse data types and larger datasets.
2.1. Machine Learning Methods for Fire Susceptibility Prediction
Several studies have used classical machine learning algorithms, including random forest (RF) and support vector machine (SVM), to model wildfire susceptibility. For instance, S. Tavakkoli Piralilou et al. [
18] applied machine learning techniques such as artificial neural networks (ANNs), SVM, and RF to train and validate predictive models. Their work addressed the challenge of class imbalance in the prediction of wildfire susceptibility by employing the synthetic minority oversampling technique (SMOTE). This approach generates synthetic samples for the minority class (wildfire locations) to balance the dataset and enhance the performance of the model.
In another study, the susceptibility to wildfires was analyzed using machine learning methods and the Google Earth Engine dataset in Gangwon-do, Republic of Korea [
19]. The researchers developed forest fire susceptibility maps (FFSMs) using classification and regression trees (CART), boosted regression trees (BRT), and RF algorithms. They evaluated model performance using metrics such as the receiver operating characteristic (ROC) curve and the area under the curve (AUC). The input data included variables such as distance to urban areas, rainfall, annual average temperature, drainage density, normalized difference vegetation index (NDVI), topographic wetness index, aspect, slope, river distance, road distance, and elevation. The study also highlighted the significant role of human factors in influencing wildfire occurrences.
2.2. Object Detection Algorithms for Forest Fire
Recent research has explored a variety of methodologies for detecting forest fires, leading to the development of advanced and effective models. Neural networks have been used, using environmental data to achieve high detection accuracy. Autonomous unmanned aerial vehicles (UAVs) have also been proposed to reduce fire risks, highlighting the potential of self-adaptive dispatch techniques [
20]. Furthermore, transfer learning approaches have been proven to be effective in forest fire detection, with models such as ResNet-18 demonstrating notable accuracy [
21].
Innovations in detection include integrating deep learning with LiDAR data, deploying mixed-learning models on UAV platforms to achieve high classification accuracy and real-time performance. The YOLO algorithm has been widely applied in forest fire detection tasks, with enhanced variants such as Fire-YOLO surpassing existing frameworks in performance [
22]. Lightweight models, such as those that utilize MobileNetV3, have been designed to reduce parameter counts while maintaining detection accuracy. Spatial convolutional neural networks based on YOLOv3 enable real-time forest fire monitoring for effective prevention.
A comprehensive study compared classical object detection methods, including Faster R-CNN, YOLO variants, and SSD, to enhance real-time detection accuracy and minimize false positives, thus offering practical solutions for forest safety [
23]. In addition, an ensemble learning approach was proposed to integrate individual models such as YOLOv5 and EfficientDet for fire detection, while EfficientNet was used to capture global information and reduce false positives. Experimental results demonstrated an improvement in detection performance by 2.5% to 10.9% and a reduction in false positives by 51.3%, all achieved without introducing additional latency [
24].
Recent advances in fire detection systems have focused on improving YOLOv5/v8-based architectures through various optimization techniques. Geng et al. (2024) introduced YOLOFM, which employs a FocalNext network and QAHARep-FPN to enhance multi-scale information integration while reducing model parameters, achieving significant improvements in accuracy metrics (mAP50-95 increased by 7.9%) [
7]. Taking a different approach, Han et al. (2024) developed LUFFD-YOLO, specifically designed for UAV-based forest fire detection, which uses GhostNetV2 and novel ESDC2f and HFIC2f structures to better detect small fires while reducing the parameter count by 13% [
25]. Complementing these approaches, Ren et al. introduced FCLGYOLO, which uniquely addresses the relationship between positive sample characteristics and implements a feature invariance and a covariance constraint structure, showing particular effectiveness in challenging conditions such as heavy smoke or tree occlusions [
26]. All three studies demonstrate the ongoing evolution of fire detection algorithms, each contributing novel architectural improvements that balance detection accuracy with computational efficiency, though they differ in their specific optimization strategies and target applications.
2.3. Object Detection Algorithms for Forest Fuel Types
Recent research has explored various deep learning techniques using UAV imagery for individual tree detection (ITD) in forest environments. Yu et al. compared classical computer vision methods with the Mask R-CNN algorithm and found that Mask R-CNN, particularly with a multi-band combination, provided the highest accuracy. Their study also analyzed trade-offs between accuracy, algorithm complexity, and data availability [
27]. Similarly, Jiang et al. introduced an improved Faster R-CNN framework optimized for detecting images of visible light [
28]. Harris et al. evaluated canopy health using UAV-based ortho imagery and photogrammetric structure from motion software [
29]. Another study improved the Faster R-CNN model by incorporating the Swin Transformer, achieving notable performance in urban forest detection tasks [
30]. In addition, a comprehensive framework integrated with object detection was developed to advance forest fire risk assessment, offering visual insights for end users.
For forest health and risk analysis, Wang proposed the lightweight LDS-YOLO architecture to detect dead trees, which demonstrated high precision and compact parameter size [
31]. Kislov extended this work to identify forest damage caused by windthrows and bark beetles using a CNN algorithm based on U-Net [
32]. In China, a novel framework that integrates deep learning, geographic data, and multi-source information achieved high precision in the prediction of forest fire risk [
33].
Alonso-Benito et al. utilized four classification algorithms to map forest fuel types using Terra-ASTER sensor imagery in Tenerife Island, Spain. Their comparison with reference datasets revealed that an object-based image analysis approach achieved the highest accuracy (95%), outperforming pixel-based methods such as maximum likelihood, neural networks, and support vector machines by 12%. The inclusion of contextual information in the object-based method significantly improved the differentiation of fuel types with similar spectral characteristics, demonstrating its efficacy for accurate mapping of forest fuel types [
34].
Ruiz et al. proposed an object-oriented classification framework combining low-density LiDAR data and multispectral imagery to classify Mediterranean forests into structural types, including grasslands, shrubs, forests, mixed forests, and dense young forests. Their evaluation of four classification algorithms showed that integrating LiDAR with multispectral data improved overall accuracy (90.75%). This method demonstrated the efficiency of object-oriented classification in stratifying Mediterranean forests into structural and fuel-related categories [
35].
2.4. Log RGB Image Data for Computer Vision
The exploration of deep network architectures and training methods in computer vision has been extensive, with significant contributions from various foundational architectures such as AlexNet, VGG-16/19, GoogLeNet, ResNet, MobileNet, and DenseNet. These architectures have been pivotal in advancing image classification tasks, particularly with the advent of large-scale databases like ImageNet.
However, the preprocessing of image data has received comparatively less attention. Most existing large-scale databases, including ImageNet, COCO, Pascal VOC, and others, consist of images collected from the Web, where the preprocessing steps are often unknown. This has led to a reliance on sRGB images, which are optimized for human viewing rather than for computational tasks. Maxwell et al. (2024) investigated this gap by exploring the use of RAW and Linear RGB data for various computer vision tasks [
14]. For instance, the PascalRAW and LOD datasets support object detection and instance segmentation, while the ROD database provides high dynamic range linear imagery for object detection. Additionally, there has been some exploration of using Log RGB data, which has shown promise in improving performance on small datasets and specific tasks such as illumination invariant HOG features and material priors.
The current paper builds on this body of work by demonstrating that networks trained on Log RGB data exhibit improved performance on image classification tasks and invariance to intensity and color balance modifications. This is achieved by using data derived directly from RAW images to ensure their integrity and by introducing a new 10-category 10k RAW image dataset (RAW10) for further exploration. The findings suggest that existing databases may also benefit from this type of preprocessing, as gains from high-quality Log RGB data can be partially or fully realized from data in 8-bit sRGB-JPG format by inverting the sRGB transform and taking the log.
2.5. Research Gap
Despite significant advances in the field of environmental monitoring and object detection, several critical areas remain underexplored. The following points outline how our study aims to address specific areas where further investigation is warranted.
Impact of Diverse Color Spaces in Object Detection Tasks: While diverse color spaces such as Linear RGB and Log RGB are known to offer richer spectral and dynamic range information, their potential impact on object detection tasks remains largely unexplored. Current research relies primarily on the standard sRGB color space, which may limit the accuracy of models in capturing subtle details critical for environmental monitoring. This study represents a pioneering effort to systematically investigate the influence of alternative color spaces on object detection in the context of tree condition analysis.
Utilization of Raw Drone Images for Wildfire Risk Assessment: Despite the capabilities of modern UAVs to capture high-resolution raw images (e.g., DNG format) with extensive spectral and dynamic range information, their use in wildfire risk assessment remains underutilized. Existing methodologies predominantly rely on compressed image formats, which may not fully exploit the potential of UAV-captured data. This research addresses this gap by using raw drone imagery to improve tree conditions detection, offering a novel approach to improving wildfire risk assessment.
Application of SOTA Object Detection models in Wildfire Risk Assessment: While vision transformer (ViT) models have revolutionized many computer vision tasks due to their superior ability to capture global and contextual features, especially for detecting subtle and complex features in crowded scenes, their application in wildfire risk assessment has not yet been fully explored. Most existing studies rely on traditional convolutional neural networks, potentially overlooking the advanced capabilities of ViT. This research aims to be the first to employ state-of-the-art ViT models to detect tree conditions in areas prone to wildfires, using their transformative potential for more accurate and robust environmental monitoring. In addition, we will compare the performance of ViT models with the latest YOLO model to better demonstrate the generalizability of our research.
To the best of our knowledge, no prior research has optimized Log RGB specifically for wildfire risk assessment tasks related to fuel type detection and flammability evaluation. Existing studies have focused predominantly on active fire detection and standard sRGB processing. In contrast, our work uniquely uses Log RGB to assess fuel conditions and potential flammability, filling an important gap in the literature.
3. Methodology
Our study demonstrates how the combination of UAV technology, log-transformed RGB imagery, and state-of-the-art object detection models can enhance object detection performance for forest fire risk assessment process, illustrated in
Figure 1. Our methodology consists of four main phases. First, we collect high-resolution aerial imagery using unmanned aerial vehicles (UAVs). Second, we process the collected imagery through various color space transformations. Third, we train state-of-the-art object detection models on the transformed data to detect and classify forest tree health conditions. Finally, we analyze the detection results to estimate the levels of wildfire risk for each surveyed area. This comprehensive pipeline enables an accurate assessment of forest conditions and potential fire hazards using advanced computer vision techniques.
3.1. Data Collection
The original images used in this study were captured in digital negative format (DNG) using a high-resolution camera with a resolution of 8192 × 5460 pixels. This format was chosen for its ability to retain extensive spectral and dynamic range information, which is crucial for accurate image analysis. The images were captured using the DJI ZenmuseP1 drone model, known for its advanced imaging capabilities.
3.2. Color Space Conversion
Let us now discuss the color space conversions used in this paper. When cameras capture images, the result is typically due to physical interactions between incident illumination and the reflecting material. The Bi-Illuminant Dichromatic Reflection (BIDR) Model is often used to describe incident illumination, which consists of two separate components: direct illuminant and ambient illumination [
36]. For reflecting materials, the dichromatic reflection model is commonly used, which also has two components: body reflection and surface reflection [
37]. In Linear RGB, the reflected light is the product of the material reflectance and the incident illumination. As shown in Formula (
1), the image value I is derived from the combination of body reflection
and ambient illumination
, plus the direct illumination
modified by
, which represents both geometric shading and cast shadows that affect the intensity of direct illumination.
Compared with Linear RGB, we apply log transformation on the body reflection model to separate the confounded body reflection and illumination terms, as shown in Formula (
2).
Applying this log transformation specifically leverages the additional information contained in the RAW UAV images. RAW images provide higher spectral fidelity, preserving more precise and consistent color information under varying illumination conditions. The Log RGB transformation exploits this spectral fidelity and higher dynamic range by transforming multiplicative illumination effects into additive ones, significantly reducing the variability caused by changing lighting conditions. Consequently, the transformation improves the visibility of the features, particularly in areas with subtle variations in shadow or highlights, and, thus, improves the model’s ability to recognize fine details related to tree health and stress conditions.
In the XYZ color space, the colors are represented by three values: X, Y, and Z. These values correspond to the response of the human eye’s three types of color receptor (cone) to different wavelengths of light. The Y value represents the luminance (brightness) of the color, while the X and Z values represent the chromaticity (color quality). For a specific spectral power distribution, P(
), the XYZ values are calculated as follows in Formula (
3):
The LMS color space is based on the human visual system’s response to light. It stands for long, medium, and short wavelengths, which correspond to the three types of cone cells in the human eye that are sensitive to different parts of the light spectrum. Here is a quick breakdown: L (long): Sensitive to red light. M (medium): Sensitive to green light. S (short): Sensitive to blue light, with spectral distribution
defined as follows in Formula (
4):
Finally, we calculate the values of D-Log, based on the white paper on D-Log and D-gamut of DJI Cinema Color System with the following formula:
For the intensity
(Formula (
5)):
For the intensity
(Formula (
6)):
To evaluate the impact of different color spaces on the performance of the ViT models, the DNG images were converted into six different color spaces: sRGB, Linear RGB, Log RGB, XYZ, LMS, and D-Log. The conversion process was as follows, and the output images are shown in
Figure 2:
sRGB: Converted to JPEG format.
Linear RGB, Log RGB, XYZ, LMS, and D-Log: Converted to TIFF format.
3.3. Preprocessing
The converted images underwent a series of preprocessing steps to ensure that they were suitable for input into the vision transformer models. These steps included resizing, normalization, and augmentation to enhance the robustness of the model. The details for this part will be discussed in the Experiment section.
3.4. Model Training
The DDQ [
15], CO-DETR [
16], and YOLO11 [
17] models were selected and trained separately on each set of images corresponding to the different color spaces. The training process involved using a consistent set of hyperparameters and training epochs to ensure a fair comparison between the color spaces. The details for this part will be discussed in the next section.
3.5. Evaluation Metrics
The performance of the model was evaluated using mean average precision (mAP) and average recall (AR) metrics. These metrics were chosen for their ability to provide a comprehensive assessment of the model’s accuracy and ability to detect objects across different color spaces.
To understand these metrics, it is essential to first define precision and recall using the confusion matrix framework. For object detection tasks, we have
where:
True positives (TP): Correctly detected objects.
False positives (FP): Incorrect detections or false alarms.
False negatives (FN): Missed objects that should have been detected.
Mean average precision (mAP) is a key metric widely used to evaluate the performance of object detection models. It is calculated as the mean of the average precision (AP) across all classes, defined as
where
N is the total number of classes, and
represents the precision–recall integration for class
i, calculated as
In practice, mAP50 (also written as mAP@0.5) is commonly used, which specifically measures the mean average precision at an intersection over union (IoU) threshold of 0.5. This means a detection is considered correct if the predicted bounding box overlaps with the ground truth box by at least 50%. The formula remains the same, but the calculation of true positives is based on this IoU threshold:
Average recall (AR) complements mAP by evaluating the proportion of true positive detections over a range of intersection over union (IoU) thresholds. It is defined as
where
T represents the total number of IoU thresholds and
is the recall at a specific IoU threshold
t. Together, mAP and AR provide complementary insights into the accuracy and robustness of an object detection model. While mAP balances precision and recall to give an overall performance metric, AR focuses specifically on the model’s ability to find all relevant objects in the image.
4. SOTA Models for Object Detection
We selected three models for our color performance experiments: Dense Distinct Query (DDQ), CO-DEtection TRansformers (CO-DETR), and You Only Look Once (YOLO). All three models have demonstrated strong results on widely recognized object detection benchmarks and offer efficient end-to-end training. Their proven accuracy and ability to handle detailed scenes make them well suited for our wildfire risk assessment study.
4.1. DDQ
Object detection models traditionally use one-to-one label assignment to eliminate the need for post-processing steps like non-maximum suppression (NMS), enabling end-to-end training. However, this approach faces a challenge: sparse queries often fail to achieve high recall, while dense queries lead to many similar predictions, complicating optimization. To solve this, DDQ (Dense Distinct Query) introduces a novel approach that combines the strengths of sparse and dense queries, improving object detection performance [
15].
The key idea behind DDQ is to first generate a dense set of queries, similar to traditional object detectors, and then select distinct queries for one-to-one matching during training. It balances the coverage of dense queries with the diversity of distinct ones, addressing issues of high-recall and optimization difficulty. As a result, DDQ improves the performance of various object detection models, including FCN, R-CNN, and DETRs.
DDQ-DETR (
Figure 3), a specific application of this method, achieves impressive results on the MS-COCO dataset, reaching 52.1 AP in just 12 epochs using a ResNet-50 backbone. This outperforms all other detectors in the same setting. Furthermore, DDQ excels at handling crowded scenes, achieving 93.8 AP on the CrowdHuman dataset. These results highlight the ability of DDQ to significantly increase detection accuracy while maintaining the advantages of end-to-end training. Accurate detection of multiple types of vegetation stress and debris is vital for wildfire risk assessment, as a single drone image can contain crowded clusters of trees and mixed background elements. DDQ addresses these scenarios well by retaining broad query coverage while filtering out duplicates, leading to higher recall and precise bounding boxes. Given the need to spot subtle signs of tree health degradation early, the enhanced performance of DDQ in complex scenes makes it a strong choice for our wildfire assessment task.
4.2. CO-DETR
DETR (DEtection TRansformers) has revolutionized object detection by using transformers for end-to-end training, eliminating the need for anchors or non-maximum suppression (NMS). However, DETR’s one-to-one matching of predicted queries with ground truth often results in sparse positive samples, which limits the encoder’s ability to learn useful features and the decoder’s ability to focus attention. CO-DETR addresses this by introducing a collaborative hybrid assignment scheme, which enhances training efficiency and improves both encoder and decoder performance [
16] as shown in
Figure 4.
CO-DETR uses multiple auxiliary heads trained with one-to-many label assignments, similar to methods like ATSS and Faster R-CNN. This approach provides richer supervision for the encoder and improves the attention mechanism in the decoder. Additionally, CO-DETR generates additional positive queries from these auxiliary heads, which helps the decoder to focus more effectively on positive samples during training. Importantly, CO-DETR does not increase the computational cost or the size of the model at inference time because the auxiliary heads are discarded after training. This makes it a highly efficient solution for improving DETR-based models without adding complexity.
Experimental results show that CO-DETR improves the performance of models such as DINO-Deformable-DETR, improving COCO AP from 58.5% to 59.5%. With a ViT-L backbone, CO-DETR achieves 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods with fewer parameters. Detecting subtle signs of tree stress often involves discovering small or partially occluded targets under dynamic illumination conditions. CO-DETR’s training strategy, which includes multiple auxiliary heads, delivers more robust decoder supervision for those challenging targets. Its efficiency at inference and capacity to learn from additional positive samples suit large-scale drone imagery, making it a compelling choice for supporting continuous, wide-area wildfire monitoring.
4.3. YOLO11
YOLO11 represents the latest advancement in the YOLO series, introducing key architectural innovations that enhance real-time object detection performance. Building on its predecessors, YOLO11 incorporates the C3k2 block for efficient feature extraction, SPPF for multi-scale pooling, and the novel C2PSA module with parallel spatial attention to improve focus on critical regions as shown in
Figure 5. These refinements enable YOLO11 to achieve higher mean average precision (mAP) on benchmark datasets like COCO while reducing parameter counts by up to 22% compared to YOLOv8, making it more efficient for deployment across edge devices and cloud platforms. Additionally, YOLO11 extends its capabilities beyond traditional object detection to support instance segmentation, pose estimation, and oriented bounding box detection, broadening its applicability in fields such as autonomous driving, medical imaging, and industrial automation [
17].
The evolution of YOLO models highlights a consistent trend toward balancing accuracy and computational efficiency, with YOLO11 emerging as a state-of-the-art solution for real-time vision tasks. Unlike earlier versions, YOLO11 optimizes its backbone, neck, and head structures with attention mechanisms and adaptive kernel sizing, enabling superior performance in detecting small or occluded objects. Benchmark studies demonstrate that YOLO11 variants (nano to x-large) outperform previous iterations in both speed and accuracy, particularly in low-latency scenarios. Its versatility across multiple computer vision tasks, coupled with improved feature extraction and parameter efficiency, positions YOLO11 as a leading choice for next-generation vision systems, though challenges remain in computational demands for larger models.
5. Experiment
5.1. Raw Data Preprocessing
Given the limited availability of RAW datasets, we used the DJI Zenmuse P1 to collect data. The DJI Zenmuse P1 is a high-performance aerial imaging system, equipped with a 45 MP full-frame sensor and designed specifically for photogrammetry missions. A total of 335 RAW images were selected, representing various types of trees, including healthy trees, dead trees, beetle-infested trees, and forest debris. Approximately 80% of the dataset consists of healthy trees, while the other categories are underrepresented due to the natural composition of the forest. These images were captured primarily over areas in the Lower Mainland of Vancouver and were saved in RAW DNG format. The full dataset originally comprised over 2000 images. After a rigorous clean-up and deduplication process, we selected 335 high-quality images. These images were collected from three separate geographical locations, with approximately 110 images from each location. Each location provided a diverse range of tree species and distinct lighting scenarios, and each image contains at least 100 labeled trees, ensuring that the dataset is robust enough for generalizability. The dataset was divided into 234 images for training, 68 for validation, and 33 for testing, maintaining a 70/20/10 split.
Figure 2 presents examples from the training set. The decision of preserving quality versus size was made based on the work of other researchers, pointing out the advantage of having good training examples vs. more data input [
38].
Six different image formats were generated from the RAW DNG files: JPG-sRGB, TIFF-Linear, TIFF-Log, TIFF-XYZ, TIFF-LMS, and TIFF-D-Log for model training. Initial preprocessing involved using the rawPy library to adjust the white balance parameters of the camera metadata while restricting pixel saturation to below 0.001%. The images were then resized using OpenCV’s INTER_AREA interpolation, maintaining a short-edge resolution of 480 pixels. The processed data retained 16-bit depth per channel, covering a numerical range of [0, 65,535].
To produce the sRGB dataset, the RAW images were resized proportionally to a width of 1024 pixels, converted to a 32-bit floating point format, and normalized to [0, 1]. The sRGB color space conversion was applied, followed by scaling to an 8-bit integer range [0, 255] for JPG compression via OpenCV’s default settings. For the linear RGB variant, the resized images were stored directly as 16-bit TIFF files. The log RGB conversion required resizing, 32-bit float conversion, and applying a natural logarithm transformation to non-zero pixel values before saving as 16-bit TIFFs [
14].
In addition, three different color spaces were created for training: XYZ, LMS, and D-Log.
For the XYZ dataset, the raw data were read directly to XYZ output color parameter. The data were then resized to a width of 1024 pixels with proportional height, converted to float32, scaled to the range [0, 1], and saved in TIFF format with default settings.
To create the LMS dataset, the data were first transformed into XYZ, then a dot product with Bradford transformation was applied, and the data were saved in TIFF format.
For the D-Log dataset, the raw data were read and resized to a width of 1024 pixels with proportional height. The corresponding D-Log curve was applied, and the data were saved as 16-bit TIFF images.
5.2. Core Experiments
To evaluate our hypothesis, we selected three state-of-the-art (SOTA) object detection models, DDQ (DDQ-DETR), CO-DETR, and YOLO11, for testing on six diverse datasets. These models were chosen due to their strong performance in object detection tasks and their ability to handle a range of complex visual data. Both models were trained using the same experimental setup to ensure a fair and consistent comparison.
Training configurations are included in
Table 1. The hyperparameters chosen for this model were critical in ensuring proper training while maintaining computational efficiency. The model was trained for 200 epochs, providing enough time for the model to converge to an optimal set of weights without excessive overfitting. A batch size of 2 was selected to accommodate memory constraints while allowing for efficient updates to the model’s parameters, though smaller batches tend to result in noisier gradients. To further mitigate overfitting, early stopping was implemented with a patience of 10 epochs, halting training when the validation loss failed to improve over consecutive epochs. This approach helps prevent unnecessary training and ensures that the model is generalized well.
The purpose of loss functions is to optimize different aspects of an anchor-free object detection model, ensuring it learns to classify objects, localize them accurately, and improve the quality of predicted boxes. The following are the loss functions for the three models:
is the L1 loss for bounding box regression. It measures the absolute difference between the predicted and ground-truth bounding box coordinates, ensuring accurate localization. L1 loss is robust to outliers and does not overpenalize large errors like L2 loss, making it more stable during training. However, it does not consider the shape, size, or overlap of bounding boxes, which is why additional IoU-based losses are often used.
is the focal loss for classification. It is designed to address class imbalance in object detection by reducing the influence of easy-to-classify background samples and focusing more on hard, misclassified examples. The focal loss formula includes a modulating factor that down-weights well-classified samples, where (typically 2.0) controls this effect. The parameter (typically 0.25) balances positive and negative samples, ensuring better learning for underrepresented object classes.
is the generalized IoU loss for better box localization. Unlike standard IoU, which only considers overlap, GIoU accounts for the distance between predicted and ground-truth boxes by incorporating the area of the smallest enclosing rectangle. This makes it effective even when boxes do not overlap, guiding the model toward better localization. GIoU loss is particularly useful for refining bounding box predictions in cases where L1 loss alone is insufficient.
ensures that the predicted bounding boxes are centered on objects. In anchor-free object detectors, centerness is used to suppress low-quality predictions by assigning higher confidence scores to boxes closer to the object’s center. The centerness score is computed as the geometric mean of the normalized distances to the box edges, reducing the contribution of off-center predictions. This helps improve precision and reduces redundant detections around the same object.
improves bounding box localization by extending the generalized IoU (GIoU) loss. Unlike GIoU, which penalizes non-overlapping boxes by incorporating the area of the smallest enclosing rectangle, CIoU (complete IoU) additionally considers the aspect ratio and center distance between predicted and ground-truth boxes. This makes it sensitive to both shape alignment and positional accuracy, addressing scenarios where boxes partially overlap or have mismatched dimensions. CIoU is particularly effective for refining anchor-free predictions by directly optimizing the geometric relationship between boxes.
handles class probability estimation using binary cross-entropy with logits. It independently scores each class presence, avoiding softmax-based competition between classes. This design simplifies multi-class detection by treating classes as separate binary tasks, which is computationally efficient and robust to class imbalance. The loss focuses on distinguishing foreground (object) from background while penalizing misclassifications proportionally to their confidence deviations.
(distribution focal loss) refines bounding box regression by modeling coordinates as a learned distribution over discrete bins. Instead of regressing box offsets directly, DFL predicts probabilities for neighboring bins and computes expectations to estimate precise coordinates. This “soft” regression approach reduces quantization errors caused by hard bin assignments, enabling sub-bin accuracy. By optimizing cross-entropy between predicted and target distributions, DFL stabilizes training and improves localization fidelity, especially for small or ambiguous objects.
Training and inference were conducted on an NVIDIA A100 80 GB GPU, ensuring consistent hardware for all experiments. Given the varying color spaces and pixel value distributions across the datasets, we first calculated the mean and standard deviation of pixel values for each dataset’s color space (e.g., sRGB). These values were used to normalize the input images, ensuring that each dataset’s color distribution was appropriately aligned for model training.
The DDQ and CO-DETR models were implemented and trained using the MMDetection framework, an open-source PyTorch-based toolbox that supports a wide range of object detection models. Adam’s optimizer was used with default settings, and a learning rate scheduler was used to adjust the learning rate during training to ensure efficient convergence. The YOLOv11 model was trained directly using the official Ultralytics Python library. The lightweight model version, YOLOv11n, was selected due to its computational efficiency, enabling a fair comparison with heavyweight vision transformer models.
Model performance was evaluated using mean average precision (mAP) and average recall (AR) to measure detection accuracy across all object classes and FLOPs (floating point operations per second) to show the computational efficiency of each model.
6. Results
All the training sessions were completed within 100 epochs due to the early stopping settings. We take the log RGB dataset and the DDQ model as an example. As shown in
Figure 6, both the classification loss (
) and the bounding box regression loss (
) decrease rapidly during the first 2000 iterations. Then, the decline slows down and flattens after approximately 3000 iterations, indicating that the training has converged properly.
The experiments evaluate the performance of DDQ, CO-DETR, and YOLO11 across six color spaces: sRGB, Linear RGB, Log RGB, XYZ, LMS, and D-Log. The four classes under study are Alive Tree, Beetle Fire Tree, Dead Tree, and Debris.
Table 2,
Table 3 and
Table 4 list the class-wise mAP50 scores for DDQ, CO-DETR, and YOLO11, respectively.
YOLO11 significantly outperforms transformer-based models (DDQ and CO-DETR) on the “Beetle Fire Tree” class, achieving an impressive 61.8% mAP50 in Log RGB color space compared to 32% and 17.2% for DDQ and CO-DETR, respectively. This suggests that YOLO’s architecture excels at detecting objects with distinct visual characteristics like fire-damaged trees. Conversely, CO-DETR demonstrates superior performance on “Alive Tree” and “Debris” classes in Log RGB space, reaching 62.5% and 49.8% mAP50, respectively, likely due to the transformer’s ability to capture complex contextual relationships between healthy trees and their surroundings. Across all tests, Log RGB consistently produces the best results for all models (with average mAP50 of 37.8% for DDQ, 45.7% for CO-DETR, and 44.3% for YOLO11), indicating that this color space better preserves the distinguishing features critical for forest element detection. The performance difference between models can be attributed to their fundamental architectural differences—YOLO’s single-shot detection approach excels at objects with distinct visual boundaries, while transformer-based models like CO-DETR better handle objects requiring more contextual understanding.
Now, let us explore more details across all classes and color spaces.
Table 5 and
Table 6, along with
Figure 7,
Figure 8,
Figure 9 and
Figure 10, clearly show that Log RGB achieves the best performance in both mAP50 and mAR. Linear RGB comes in second, with both outperforming sRGB. In contrast, the XYZ, LMS, and D-Log color spaces do not improve upon sRGB. Using sRGB as the benchmark, Log RGB improves mAP50 by 7% for DDQ, 57% for CO-DETR, and 22.7% for YOLO11, while for mAR the gains are 7.6% for DDQ, 55% for CO-DETR, and 46.13% for YOLO11.
The superior performance of Linear and especially Log RGB can be attributed to key factors. Linear RGB preserves the actual physical interactions of light with surfaces for capturing subtle variations in forest-related scenes. Moreover, the Log RGB transformation offers a consistent representation of the features under different lighting conditions. Its logarithmic transformation converts multiplicative illumination effects into additive ones, making it easier for neural networks to learn invariant features regardless of natural lighting variations. In addition, Log RGB compresses high-intensity values while keeping detail in dark areas, which improves the signal-to-noise ratio. This benefit is crucial for the framework of wildfire predictions based on forest fuel mapping in different seasons.
A deeper technical analysis reveals several reasons why Log RGB and Linear RGB outperform sRGB. First, Log RGB preserves a significantly greater dynamic range by compressing high-intensity values while maintaining details in dark regions. This compression mitigates the saturation effects often observed in sRGB images, which are limited by their 8-bit representation, and allows the network to access a richer spectrum of information. Second, the logarithmic transformation in Log RGB effectively converts multiplicative illumination variations into additive differences. This conversion enhances illumination invariance, enabling the model to focus on intrinsic object features rather than being misled by variable lighting conditions. Third, Linear RGB maintains the true linearity of the captured light data, preserving subtle spectral variations that are otherwise lost during gamma correction in sRGB. This linear representation ensures better retention of fine environmental details, such as texture and edge information, which are critical for distinguishing between different tree conditions. Lastly, by minimizing quantization errors and reducing noise through higher bit-depth processing, both Log RGB and Linear RGB provide more robust input data for the vision transformer models. Together, these factors lead to improved feature extraction and, consequently, higher detection accuracy.
The improvements in mAP50 and mAR with Log RGB are particularly evident in underrepresented classes (e.g., Beetle Fire Tree and Debris). Although sRGB performs well for classes with abundant samples, the extra detail provided by Linear RGB and Log RGB makes them more effective when sample numbers are low. Finally, the balance between high detection accuracy and computational demand, as highlighted by the efficiency difference between DDQ, CO-DETR, and YOLO11, indicates that further model optimization is needed to enhance performance while reducing resource use.
The experimental results also show that the models trained in XYZ and LMS perform worse than those trained in sRGB. This is likely because sRGB images, stored as 8-bit integers, are both computationally efficient and memory-friendly. In contrast, XYZ and LMS require floating-point precision to maintain their linearity and wide gamut, which increases computational overhead and can introduce noise during conversion. Similarly, the D-Log space performs the poorest, likely due to its limited standardization and tooling compared to more established log formats. However, YOLO11 seems to be effective in training the D-Log dataset with a 0.4 mAP50, which could indicate potential advantages in using the D-Log dataset.
In terms of computational complexity,
Table 7 shows that both DDQ and CO-DETR models have high FLOPs due to their transformer backbones. However, CO-DETR is nearly twice as efficient as DDQ. Training takes about 351 s per epoch for DDQ versus 280 s for CO-DETR, and the per-image prediction time is approximately 0.294 s for DDQ and 0.206 s for CO-DETR. Although these speeds are acceptable for offline processing, they are not fast enough for real-time detection. YOLO11 has traditional CNN architecture which is much more efficient than DDQ and CO-DETR. It performs a high inference speed of 8 ms per image, and has potential for real-time applications.
Experimental results show that training with Log RGB leads to a clear improvement in both mAP50 and mAR across all three models—DDQ, CO-DETR, and YOLO11. This improvement is largely due to the ability of Log RGB to capture fine details that are critical to detecting less frequent classes such as Beetle Fire Tree and Debris. These classes benefit from the extra detail provided by both Log RGB and Linear RGB, making them far more effective than when using the standard sRGB. Moreover, while other color spaces like XYZ, LMS, and D-Log were tested, they did not show any improvement over sRGB. This underlines the importance of selecting the appropriate color space, in this case, Log RGB, to achieve better detection performance while balancing accuracy and efficiency.
7. Discussion
Our findings indicate that the use of Log RGB significantly improves the detection of subtle environmental changes critical to the assessment of the risks of wildfires. The enhanced dynamic range and superior illumination invariance of Log RGB allow a more accurate identification of tree stress indicators, which are essential for early warning systems in forest management.
From a practical standpoint, these results suggest that integrating Log RGB processing into existing UAV systems is feasible and can lead to substantial improvements in monitoring capabilities. With minor software updates, current drone platforms can convert raw imagery into Log RGB format, enabling real-time or near-real-time data collection and analysis without the need for extensive hardware modifications. This flexibility holds particular promise for agencies and organizations looking to optimize forest health monitoring and fire prevention efforts with minimal additional investment.
Moreover, the improved performance observed with Log RGB may be beneficial beyond the assessment of the risk of wildfire. The methodology could be extended to other environmental monitoring applications where preserving subtle image details is crucial, such as precision agriculture, invasive species detection, and habitat monitoring for conservation purposes. For example, this approach can help identify early-stage disease symptoms in crops or forests, improve resource allocation, and reduce economic losses. Furthermore, Log RGB’s ability to capture richer spectral and dynamic range information may support post-disaster damage assessments, including floods or storms, where accurately identifying structural and vegetative damage is vital. Overall, the practical implications of our work underscore the value of advanced color space optimization in developing robust and efficient UAV-based monitoring systems. Through the targeted integration of Log RGB, professionals in diverse sectors can gain deeper insight, respond more effectively to environmental challenges, and better protect both natural and human-managed ecosystems, regardless of environmental conditions, such as seasons of the year, light amount, or noise ratios.
8. Conclusions and Future Work
In conclusion, our research successfully shows that Linear RGB and Log RGB color spaces can improve the accuracy of the object detection task for wildfire risk assessment. In particular, the implementation of Log RGB resulted in a significant enhancement, showing a substantial increase in mAP50 by 27.16% and mAR by 34.44% compared to the conventional sRGB color space.
Our experiments further revealed that alternative color spaces such as XYZ, LMS, and D-Log did not perform as well. This underperformance can be attributed to several technical factors. The XYZ and LMS color spaces, while designed to mimic human visual perception, require precise floating-point representations that can introduce additional noise and computational overhead, leading to potential information loss during transformation. Similarly, the D-Log conversion process is less standardized, which may result in inadequate preservation of fine image details. In contrast, the Log RGB transformation efficiently preserves the dynamic range and subtle contrast variations, allowing the model to detect nuanced changes in tree conditions more effectively.
The findings of this research not only emphasize the importance of considering alternative color spaces for environmental monitoring, but also highlight the potential for transformative advances in automated wildfire risk assessment through the synergy of UAV technology, machine learning, and spectral diversity analysis. By bridging the gap between raw spectral data and actionable insights, our study contributes to ongoing efforts to manage wildfires and preserve ecosystems. It is important to note that our study does not address active fire detection, but, rather, concentrates on assessing fuel conditions and flammability—a research direction that has not been explored previously with Log RGB optimizations.
Looking ahead, several important directions remain for future research. First, larger and more geographically diverse datasets would help validate the broader applicability of our findings and ensure robust performance across different forest ecosystems. Second, the methods presented here could be extended to other natural disaster risk assessments, such as identifying early signs of drought stress, flood damage, or storm-induced tree falls, where high-dynamic-range imaging can reveal subtle environmental cues. Third, testing our approach under varying sensor types and different UAV platforms could further establish its practicality for large-scale operational deployments. Finally, integrating segmentation techniques or multimodal data (e.g., thermal or LiDAR) in conjunction with these optimized color spaces can improve the overall predictive power, opening new possibilities for real-time forest health monitoring and early intervention strategies.