Next Article in Journal
Enhancing Unmanned Aerial Vehicle Task Assignment with the Adaptive Sampling-Based Task Rationality Review Algorithm
Previous Article in Journal
Nonlinear Adaptive Control Design for Quadrotor UAV Transportation System
Previous Article in Special Issue
Analysis of Unmanned Aerial Vehicle-Assisted Cellular Vehicle-to-Everything Communication Using Markovian Game in a Federated Learning Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Military Image Captioning for Low-Altitude UAV or UGV Perspectives

1
School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
2
Science and Technology on Electromechanical Dynamic Control Laboratory, Xi’an 710065, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(9), 421; https://doi.org/10.3390/drones8090421
Submission received: 22 July 2024 / Revised: 19 August 2024 / Accepted: 21 August 2024 / Published: 24 August 2024

Abstract

:
Low-altitude unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which boast high-resolution imaging and agile maneuvering capabilities, are widely utilized in military scenarios and generate a vast amount of image data that can be leveraged for textual intelligence generation to support military decision making. Military image captioning (MilitIC), as a visual-language learning task, provides innovative solutions for military image understanding and intelligence generation. However, the scarcity of military image datasets hinders the advancement of MilitIC methods, especially those based on deep learning. To overcome this limitation, we introduce an open-access benchmark dataset, which was termed the Military Objects in Real Combat (MOCO) dataset. It features real combat images captured from the perspective of low-altitude UAVs or UGVs, along with a comprehensive set of captions. Furthermore, we propose a novel encoder–augmentation–decoder image-captioning architecture with a map augmentation embedding (MAE) mechanism, MAE-MilitIC, which leverages both image and text modalities as a guiding prefix for caption generation and bridges the semantic gap between visual and textual data. The MAE mechanism maps both image and text embeddings onto a semantic subspace constructed by relevant military prompts, and augments the military semantics of the image embeddings with attribute-explicit text embeddings. Finally, we demonstrate through extensive experiments that MAE-MilitIC surpasses existing models in performance on two challenging datasets, which provides strong support for intelligence warfare based on military UAVs and UGVs.

1. Introduction

On the battlefield, low-altitude UAVs and UGVs are frequently employed in combat surveillance, targeted patrols, and strategic reconnaissance missions [1] to ensure swift responses to potential threats. Image data generated by UAVs and UGVs provide tactical information upon which critical decisions and actions are taken, especially for situational awareness and ISR (intelligence, surveillance, and reconnaissance)-type applications [2]. As a vital pathway to achieving situational awareness, the common operational picture (COP) can be delineated using scene descriptions and reference terms familiar to the commander [3]. Images that have been pre-tagged with relevant keywords or sentences can provide an advantage for the timely sharing of scene information in a COP [4]. Therefore, utilizing image captioning for the sentence-level pre-tagging of military images is a feasible solution to enhance the comprehension of military intelligence.
Image captioning is a multimodal task that generates descriptive sentences for images, which has evolved from traditional retrieval-based and template-based methods to recent deep learning methods. As indicated in Figure 1, image captioning based on deep learning [5,6,7,8] mainly follows an encoder–decoder architecture, where an input image is processed by a visual encoder, and a caption is generated by an autoregressive language decoder. Then, considering a zero-shot and text-only training setup, some works [9,10,11,12,13,14] extend an encoder with visual language (VL) model to obtain an encoder (VL)–decoder architecture. Military image captioning (MilitIC), as an applied branch of image captioning, focuses on generating appropriate descriptive sentences for military-related images. The MilitIC task involves extracting image features, which includes objects, environments, and their interactions, and converting this information into coherent natural language sentences to support military analysis and intelligence decryption. In MilitIC, the related works [15,16,17] transfer generic image-captioning methods to the military domain. They also follow the vanilla encoder–decoder architecture to achieve promising performance.
However, there remain huge challenges in generating precise semantic descriptions from military imageries with complex and diverse military scenarios. In a brief investigation, the following factors are particularly prominent:
  • Dataset scarcity: There is no established benchmark dataset in the field of MilitIC. The datasets from military target detection and camouflaged detection are not readily applicable to this field.
  • Deficiency of a specific method: there is a scarcity of methods specifically proposed for MilitIC tasks that can effectively highlight military semantics in caption generation.
To address these challenges, we created a MilitIC benchmark dataset MOCO and put forward a novel encoder–augmentation–decoder architecture with the MAE mechanism named MAE-MilitIC. Figure 1c and Figure 2 showcase our dataset and novel architecture, respectively. The MAE-MilitIC leverages both image and text modalities as a guiding prefix for caption generation, in contrast to the other two architectures, which utilize only a single modality. Rather than directly guiding caption generation by an initial image or text embedding in VL joint space, we propose a MAE mechanism to build a semantic subspace, with the aim to close the modality gap between the image and text and enhance the military semantics of the image through explicit military text. For the MAE mechanism in Figure 1d, we first mapped the embeddings of the input image onto a semantic subspace constructed by relevant texts that described the military content that were to be enhanced and recorded a residual vector. Subsequently, this mapped embedding in the semantic subspace was augmented with explicit military text, which enabled a user-specified augmenting power to control the intensity of the augmentation. Finally, the residual was added back to reconstruct a vector close to the “image region” of the VL joint space. It was demonstrated that the embeddings constructed by the MAE served as better guidance for precise caption generation.
In short, we highlight the three major contributions of this paper:
  • We created the first open-access MilitIC benchmark dataset, which was named the Military Objects Dataset in Real Combat (MOCO) and not only includes real and complex military scenes but also provides a vast number of rich captions. MOCO is available at https://github.com/Panlizhi/MOCO (accessed on 1 August 2024).
  • We propose a novel encoder–augmentation–decoder architecture, MAE-MilitIC, to leverage image–text modalities to guide caption generation, and introduce the MAE mechanism to augment the image embeddings with explicit military text in the semantic subspace.
  • Extensive qualitative and quantitative experiments on two challenging datasets demonstrated that the proposed MAE-MilitIC outperformed existing state-of-the-art image-captioning models.

2. Related Works

Benefiting from the rapid advancement of artificial intelligence and deep learning technology, significant progress has been made in image captioning. In the following sections, we separately delve into the recent advancements in captioning for natural images and military images. Furthermore, we provide an overview of the military images dataset.

2.1. Image Captioning

Image captioning is the task of comprehending the image’s content in terms of objects, attributes, and relationships, and expressing the represented semantic content using appropriately formulated sentences.
(1) Traditional methods: The early works on image captioning used retrieval-based methods and template-based methods with hand-crafted natural language generation techniques. Retrieval-based methods [18,19,20,21,22] utilize sentences generated from similar images to describe the query image. These methods involve building a candidate sentence bank, followed by retrieving corresponding sentences based on similarity calculations between the query image and the candidate sentences. Template-based methods [23,24,25,26,27,28,29,30] generate descriptive sentences for a given image through a two-step process of object detection and template filling. These methods match the extracted various features into the pre-defined template slots, thereby creating a concise description for the image.
(2) Deep learning methods: Deep learning-based image captioning mainly follows an encoder–decoder architecture, where an input image is processed by a visual encoder, and a caption is generated by an autoregressive language decoder. Pioneering works [31,32,33,34] employed CNN and RNN/LSTM architectures to extract features and generate descriptions. Additionally, transformers were applied in encoder/decoder methods to enhance the algorithm’s ability for long-range modeling and high parallel computation [5,6,7,8].
Recently, progress in large vision-and-language pre-trained models (such as SimVLM [35] and CLIP [36]) has improved image-captioning works. The pre-trained paradigm enables models to be trained on vast image–text pairs and learn a generic representation. Some studies explored the pre-training model CLIP in image-captioning tasks due to its strong ability to extract and align features from images and text. As a supervised method, ClipCap [37] produces a fixed-length prefix for each caption by mapping CLIP visual embeddings into the space of a generative language model. The training-free methods ZeroCap [9] and MAGIC [10] leverage the excellent visual–text alignment of CLIP to guide the language model toward a desired visual direction and make the caption semantically related to a given image. Some of the text-only training methods, such as CapDec [11], DeCap [12], and ViECap [38], map the visual feature to the text feature and generate accurate captions with fewer image data. Similarly, CLOSE [13] proposed the task of zero-shot cross-modal transfer, where they learn semantics from textual data and transfer them to visual data based on the CLIP alignment space. In addition, DeCap [12], SmallCap [39], and MeaCap [14] use CLIP to build a memory bank, retrieve relevant captions, and combine them for more effective generation prompts. RLHMN [40] and IVRC [41] emphasize hierarchical multi-granularity semantic alignment between visual and text modality, and highlight the caption refinement strategies.

2.2. Military Image Captioning (MilitIC)

The three primary application areas of image captioning are general-purpose image captioning, medical image captioning, and remote sensing image captioning. Research on image-captioning tasks specifically tailored for the military domain is relatively scarce. Military image captioning focuses on generating appropriate descriptive sentences for military-related images and provides strong support for military intelligence generation. Different from the existing military object detection method, which merely captures the location and class of objects, MilitIC is able to produce accurate descriptions of images, including objects, environments, and their interactions. Information on the interaction between military environments and military targets is crucial for situational awareness tasks. The basic application of MilitIC is the automatic generation of intelligence from image data. More advanced applications involve using UAV images for battlefield reconnaissance by adhering to certain human-defined alarm rules for early attack warnings, and even employing image–text bimodality for image–text intelligence retrieval. These applications have significantly enhanced the capabilities of intelligence processing and decision support in military operations.
To our knowledge, the research [15] is the first to propose military image captioning based on deep learning. It presents a proof-of-concept demonstration following the show-and-tell model [32] and extends the Flickr8K dataset [42] with military-specific images to ensure relevance to the Department of Defense domain. The study [16] explored the benefits of using image captioning to support military decision making. It utilized an encoder–decoder model inspired by [32] and trained it with the COCO dataset [43], which is non-military-specific. The work [17] presents a concept that extracts entities and attributes from military multi-modal data in the Visual Genome dataset [44], converts them into knowledge graphs, and reconstructs contextual phrases. Notably, a video-captioning model for the military domain is proposed in [45], which utilized the existing method I3D [46] to extract entity and motion features from a given video, and then employed a GloVe-Transformer [47,48] generating model to generate video captions.

2.3. Military Image Dataset

The size, quality, and availability of datasets have a significant impact on deep learning methods. In the military domain, due to the sensitivity of equipment information, the overall quality of datasets is often not high. This section will provide an overview of the typical military image datasets, which can be seen in Table 1.
The Ground Military Target Dataset (GMTD) [49] is a ground military vehicle dataset for object detection tasks. The Small Firearms Dataset (SFD) [50] includes various small firearms, such as pistols and rifles. This dataset has been collected for the purpose of using small unmanned aerial systems to search for abandoned small firearms. The work [51] newly added camouflaged soldier images in different environments to the COD10K+ dataset. The research [52] provided a Military High-Level Camouflage Object Detection (MHCD) dataset and aimed to identify objects that are extremely embedded in backgrounds. The Military Aircraft Detection Dataset (MADD) [53], which was designed for object detection, includes 49 distinct military aircraft types, with certain types and their variants grouped into a single class.
Our work provided a real combat dataset called MOCO, which includes 7449 images with 37,245 captions, and has the advantage of offering a rich set of captions for real military combat scenarios. This dataset is particularly beneficial for research on military image captioning, as it provides a large and diverse collection of images accompanied by detailed captions. The open-access nature of the MOCO dataset also ensures that it can be widely utilized by the research community, thus fostering innovation and collaboration in the field of military image analysis.
Considering the size and availability of datasets, this work utilized the MADD and MOCO as benchmark datasets for performance validation.

3. MOCO: A Military Objects Dataset in Real Combat for Military Image Captioning

The deficiency of existing benchmark datasets in terms of the image scale, scene type, and descriptive diversity limits the advancement of military captioning approaches. In order to explore the military image captioning for UAVs, we collected a large-scale military target dataset called MOCO. We emphasized images with a low-altitude UAV perspective for their rich detail in capturing military personnel and equipment movements, in contrast with the vague information in high-altitude captures. In close combat, the imaging range and usage scenarios of UAVs and UGVs tend to be similar, and this work collects some datasets from UGV perspectives to provide richer scene information for the captioning method.
Table 1 comprehensively compares existing benchmark datasets and MOCO in terms of the category, scale, task applicability, and availability. Some typical images and captions shown in Figure 2 include military personnel (soldier, combat engineer), combat equipment (tank, boat, UAV, UGV), and the battlefield environment (trench, anti-tank obstacle). Figure 3a illustrates that the MOCO dataset contains images of various sizes that range from 224 × 224 to 1024 × 1024 pixels and include high-resolution 2K images at 1440 × 1440 pixels.
The MOCO dataset not only provides a variety of real combat images but also offers more professional military descriptions. As presented in Figure 3b, high-frequency words in the word cloud include descriptions of military equipment, personnel, and the environment, where a rich vocabulary is provided by extensive professional human captions. The captions can precisely capture the state characteristics of military targets—whether they are moving, parked, or standing—and illustrate key interactions between objects, including firing and operating. This enables more detailed descriptions of military scenarios and allows for fine-grained battlefield perception.

4. MAE-MilitIC: Map Augmentation Embedding to Enhance Semantics for Military Image Captioning

We introduce a novel encoder–augmentation–decoder structure, MAE-MilitIC, which focuses on the inaccurate description issues and semantic augmentation methods in specific tasks, such as military image caption. The general framework of our MAE-MilitIC is illustrated in Figure 4. Adopting ClipCap [37] as the baseline, which has an encoder–decoder structure (shown in Figure 1a), we introduce an MAE mechanism to enhance the semantics of image embeddings and improve the performance of image captioning.
In MAE, we have three main objectives to fulfill: (i) the augmented image embeddings should act as a more precise guide for the image-captioning task compared with the original one (see Equation (5)); (ii) the augmented image embeddings should be more aligned to the attribute-explicit text in a semantic subspace (see Equation (6)); (iii) the augmented image embeddings should ensure modal consistency (see Equation (10)).

4.1. Backbone of MAE-MilitIC

The general framework of our MAE-MilitIC is illustrated in Figure 4. Given an input image I and a text caption T, our goal is to learn the generation of a meaningful caption for an input image, which can be formulated as a conditional probability optimization:
max θ log p θ ( T | I )
where θ denotes the model’s trainable parameters.
I and T are taken into the image encoder E n c i m g and text encoder E n c t x t to obtain the image embeddings e I and the text embeddings e T in the VL joint space V , respectively. The V is established through a large VL pre-training, which enables the preliminary alignment of the image and text. Furthermore, we propose the MAE mechanism to construct a semantic subspace W to bridge the modality gap and enhance the semantics of the image embeddings for a specific task. The details of the MAE mechanism are provided in the following section.
The text decoder D e c t x t includes a generative language model for translating embeddings into text captions. Following recent works [37,54], we considered the augmented image embeddings as a prefix for the input of generative language model. Therefore, the generating process with a prefix–text concatenation can be formalized as follows:
max θ log p θ ( T | e I ) max θ log p θ ( T | ( e I , T ) )
where T is the predicting text and ( e I , T ) represents the concatenation operation of the prefix embeddings and the text embeddings. Since the essential image semantic information is encapsulated in the prefix embedding, the language model can utilize it to predict the next token without considering future tokens. Thus, the generating process of D e c t x t can be modified to
max θ log p θ T k | ( e I , [ T i ] i < k )
where T k is the k-th word of the predicting caption.
In order to enhance the image semantics in W , the MAE module is considered in Equation (3). The specific generating process of D e c txt can be formalized as follows:
T k = D e c t x t M A E E n c i m g ( I ) , [ T i ] i < k
where M A E ( ) is an MAE mechanism. We constructed a generation loss L GEN to guide the generating process of D e c txt :
L GEN = k = 1 n C E T k , T k
where T k is the true caption and C E ( , ) corresponds to the cross entropy.
The optimization goal of MAE-MilitIC includes two parts: one is to maintain the similarity between the generated texts T and the true texts T, as shown in Equation (5); another is to close the gap between the generated text embeddings and the augmented image embeddings in the semantic subspace W , which is expressed as a mapping augmentation loss L M A E :
L M A E = C o s S i m ( t x t 2 e m b ( T ) , A ( m I ) )
where C o s S i m ( , ) is the cosine similarity and A ( m I ) is the augmented image embeddings in MAE. The t x t 2 e m b ( ) represents the text-to-embedding conversion operation.
Considering caption generation and semantic augmentation, the total loss function can be formulated as
L = ( 1 λ ) L G E N λ L M A E
where λ is a tradeoff parameter.

4.2. Map Augmentation Embedding (MAE)

On the basis of the pre-aligned VL joint space V , an MAE mechanism is proposed to enhance the semantics of image embeddings for a specific task, such as a (military) image-captioning task. We constructed the semantic subspace W using relevant texts that describe the semantics we aimed to disentangle and enhance, map e I and e T onto W to acquire mapped image and text embeddings m I and m T , and then augment m I with the semantic information contained in m T .

4.2.1. Semantic Subspace W

The works [55,56,57] discovered that since the VL joint space is a vector space, an attribute subspace can be constructed by using a set of relevant text prompts as basis vectors, and if an attribute on the image embedding changes, the corresponding semantics in the subspace will be changed. Inspired by this finding, we considered that the disentangled military attributes from text in the subspace can be used to augment the semantics of image embeddings, and the augmented image embeddings will guide the language model to generate more precise captions encompasses specific military knowledge. The results illustrated in Figure 5 and Figure 6 confirmed our proposed viewpoint.
From the MOCO dataset in Figure 2, one can see that the military image primarily consisted of single or multiple elements of military personnel, combat equipment, and the battlefield environment. Thus, we selected basic text prompts from these three elements, which are given in Section 5.2.
After selecting N = 100 basic text prompts, we obtained N basic vectors b n through E n c t x t to form a set of base vectors B = { b n } n = 1 N . We introduced dimension-reduction techniques to find the basis semantic vectors for W , as the subspace was a semantically meaningful structure. For example, if we aimed to augment the information regarding the battlefield environment, we could use text embeddings from a set of base environmental elements, such as smoke, clouds, rain, mist, grassland, and desert, as the basis semantic vectors.
In order to preserve more basic vectors and eliminate ambiguous semantic information in basis vectors, we employed two dimension-reduction techniques to construct a semantic subspace and use Gram–Schmidt orthogonalization (GSO) and principal component analysis (PCA). GSO applies the Gram–Schmidt process to obtain an orthonormal basis b ^ n to yield a set of base vectors B ^ G S O = { b ^ n } n = 1 N , which constructs the semantic subspace W GSO . PCA extracts N ^ < N principal components as the basis set B ^ P C A = { b ^ n } n = 1 N ^ , which constructs the semantic subspace W PCA . Actually, the space spanned by B ^ G S O or B ^ P C A is a space approximating W that encompasses all the related texts in the target corpus, such as a military corpus.

4.2.2. Mapping Step M

After constructing the semantic subspace W , we mapped e I and e T onto W to acquire the mapped image and text embeddings m I and m T :
m I = M V W ( e I ) , m T = M V W ( e T )
where M V W is the mapping operation between V and W . In the subspace, the original vector can be represented as a linear combination of the basis vectors. Then, the operation M V W is achieved by computing a dot product with the following basis vectors:
m I = n = 1 N e I b ^ n b ^ n , m T = n = 1 N e T b ^ n b ^ n
where ( . ) indicates the transpose operation. It should be noted that when using the PCA method, N in Equation (9) should be replaced with N ^ .
Furthermore, considering that if too many text embeddings are added to image embeddings during the augmentation step, the rich semantic information in the image will be drowned out. Then, we recorded the residual r between e I and m I :
r = e I m I
which was added back on the augmented image embeddings A ( m I ) and we ensured it was close to the image region rather than the text region.

4.2.3. Augmentation Step A

To enable the MAE to incorporate attribute-explicit information from the mapped text m T such that it could well serve as a guide for the caption generating process, we enhanced the impact of the intended attributes on m I while mitigating the influence of unintended attributes on m I .
Mathematically, we first calculated the coefficients α n of m I and the coefficients β n of m T , which was expressed under the basis B ^ = { b ^ n } n = 1 N ^ in W :
α n = m I b ^ n , β n = m T b ^ n
Next, to strengthen the influence of semantic information in m T on m I , we weakened all the components of the mapped image m I , added back the mapped text m T , and introduced the residual r:
A ( m I ) = ( 1 ε ) m I + ε n = 1 N α n n = 1 N β n m T + r
where ε R + is a hyperparameter to control the augmenting power. In Equation (13), the first term is used to reduce the influence of both the intended and unintended attributes in m I , the second term refers to adding back the influence of intended attributes in m T , and the third term ensures modal consistency between the augmented image embeddings A ( m I ) and e I . It should be noted that for the second term, we weakened the unintended attributes in m T , which are the components where m T has a low value, while maintaining the sum of coefficients of m I .

5. Experiments and Results

5.1. Evaluation Metrics

Evaluating the accuracy of automatically generated image captions is challenging due to the diverse ways an image can be described and the potential variations in meaning even when sharing common words. Similar to a language translation task or text summarization task, image captioning uses adapted language metrics to score the similarity of generated captions to references. The following four metrics are employed in this article for the evaluation of image captioning.
BLEU [58] is a popular metric for machine translation evaluation and one of the first metrics used to evaluate image captions. It computes the geometric mean of n-gram precision scores multiplied by a brevity penalty in order to avoid overly short sentences. The BLEU-N score for a candidate caption against a ground truth sentence is calculated using
BLEU = B P × exp n = 1 N w n · log ( p n )
where B P is the brevity penalty, which ensures that shorter translations are not unfairly favored; w n are the weights for each n-gram precision; and p n is the precision of the n-gram, which measures the degree of match between n consecutive terms (i.e., n-grams) in the candidate text and the corresponding n-grams in the reference text. In this work, N takes the values of 1, 2, 3, and 4. For the 1-gram, which is a single word, it can reflect the accuracy of the individual vocabulary. For higher-order n-grams (where n > 1 ), they more significantly reflect the fluency and contextual accuracy of the translation.
ROUGE-L [59] is a package of measures initially developed for the evaluation of text summaries. ROUGE-L computes the F-measure based on the longest common subsequence (LCS), i.e., a set of words shared by two sentences that occur in the same order, without requiring consecutive matches.
METEOR [60] is another machine translation metric. It relies on the use of stemmers, WordNet synonyms, and paraphrase tables to identify matches between a candidate sentence and reference sentences.
SPICE [61] is a metric designed for image-captioning evaluation. It measures the quality of generated captions by computing an F-measure based on the propositional semantic content of candidate and reference sentences represented as scene graphs.
The ranges of BLEU, METEOR, ROUGE-L, and SPICE fall between 0 and 1. If the score of the metrics is high, it means that the predicted and the reference sentences marked by humans are highly similar.

5.2. Dataset Settings

We evaluated MAE-MilitIC on two datasets: MOCO and MADD, as detailed in Table 1. The MOCO dataset was split into a training set with 7192 images and 35,960 captions, and a testing set with 257 images and 1285 captions.
In MADD, the labels were customized for military target detection, but were not suitable for military image captioning. Therefore, we established caption sentences by building a transformation mechanism: the object name o b j e c t . n a m e was populated in a prompt This image includes o b j e c t . n a m e . For example, if the label of an aircraft image is F/A-18, then the following caption may be constructed: This image includes an F/A-18 Hornet aircraft. For MADD, the dataset was divided into two parts: 13,088 images and 13,088 captions for training, and 1000 images and 1000 captions for testing.
We selected basic text prompts that include three elements: military personnel, combat equipment, and battlefield environment. Some examples of basic text prompts that form the basis of the semantic subspace are shown in Table 2.

5.3. Experimental Settings

We performed experiments on a PC with Intel(R) Xeon(R) Gold 6149 CPU @ 3.10 GHz (16 core), 24 GB RAM, and NVIDIA V100. The operating system and deep learning platforms are Ubuntu 20.04 and PyTorch 1.10.0. The configuration settings are as follows.
For the image encoder E n c i m g and text encoder E n c t x t , we employed the corresponding encoders (RN50 × 4) from CLIP [36], which were pre-trained on 400 million image–text pairs. For the text decoder D e c t x t , we used a transformer block combined with a pre-trained language model, following the work [37]. The transformer block included eight multi-head self-attention layers, each with eight heads. The pre-trained language model was GPT-2 (Generative Pre-trained Transformer 2), which employed implementation details from previous work by Wolf et al. [62]. To maximize the text generation capabilities of the pre-trained language model on large-scale data, our work froze the model parameters, and left only the transformer block as the trainable component.
In the training process, we trained for 10 epochs using a batch size of 40. For the optimization, we used AdamW [63] with a weight decay fix, as introduced by Loshchilov et al. [64], with a learning rate of 2 × 10 5 and 5000 warm-up steps. The augmenting power ε was 0.1 and the loss weight coefficient λ was 0.1. In the testing process, the beam-search algorithm was adopted for sentence generation and the beam size was set to 5. Finally, the length of the generated sentences was set to no longer than 67 words.

5.4. Experimental Results

Considering that the previous MilitIC methods were not open source and had limited training data, we employed general image-captioning methods for comparison with our approach. As two typical methods for encoder–decoder and encoder (VL)–decoder architectures, ClipCap [37] and CapDec [11] show good performance in general image captioning. In the experiments, we compared MAE-MilitIC with ClipCap and CapDec. These methods first produce visual features by CLIP, which is pre-trained over millions of image–text pairs and then utilizes an large language model to generate the captions. ClipCap employs the CLIP visual features as a guide for caption generation. CapDec utilizes the single text modality and introduces a text-only training setup during training. However, the single-modality method is not optimal and fails to leverage the complementary nature of the two modalities for mutual enhancement. Thus, we introduced the MAE-MilitIC to augment the semantics of image embeddings with text embeddings in the semantic subspace, which provides better guidance for caption generation.
Table 3 shows the quantitative results of three methods on MOCO. The GSO variant of MAE-MilitIC achieved the best results, while the PCA variant yielded the second-best results. As a method of using images for guidance, ClipCap achieved a moderate performance, and as a method of training using text, CapDec exhibited poor captioning performance. The reason was that the last two methods failed to overcome the inherent modality gap present in the VL joint space. Furthermore, as shown in Figure 5, we collected 200 images and 200 textual descriptions in MOCO test data and visualized their embeddings in the VL joint space and semantic subspace. As shown in Figure 5a, the image and text embeddings in the VL joint space almost lie in two non-overlapping regions and exhibit a certain modal gap. In contrast, Figure 5b,c illustrate that the augmentation embeddings, as a better substitute for image embeddings, can be closer to the attribute-explicit text embeddings in the semantic subspace. Therefore, the use of both image and text data for aligning military semantics is necessary, and the aforementioned results also prove this point.
The quantitative results of MAE-MilitIC and baselines on MADD are shown in Table 4. As can be seen, MAE-MilitIC demonstrated an excellent overall caption performance. We note that CapDec obtained much better results on the ROUGE-L metric. The reason was that the caption template in MADD contained the unified subsequence This image includes, while ROUGE-L heavily focused on the longest common subsequence (LCS), which created the illusion that ROUGE-L was better. Therefore, only one single metric that was inflated did not necessarily equate to high-quality caption. The two variants of MAE-MilitIC achieved the best caption generation performance on most metrics. Figure 6 demonstrates that our MAE mechanism could further align the image and text modalities in the military semantic subspace and offer a superior guiding prefix for image captioning.
The visual results of MAE-MilitIC and previous works in MOCO are presented in Figure 7. In this figure, we present the visual results for eight military scenarios from four perspectives, including long-range perspectives (the first, fifth, and sixth images), low-altitude perspectives (the second and fourth images), a partial view (incomplete imaging, the third image), and close-up perspectives (the seventh and eighth images). As can be seen, our MAE-MilitIC can maintain semantic consistency in objects, actions, and environment with the ground truth, and overall successfully depict the image. As a text-only training method, CapDec missed out on major visual content and produced a large number of hallucinations, with the former due to the weak visual ability it inherited from CLIP, and the latter because it did not align vision and text embeddings well. The ClipCap, without semantic enhancement, was not accurate enough to describe military scenarios. For example, in the fifth image of Figure 7, it failed to describe the military high-level semantic concepts of outposts and trenches, and in the third image, it was unable to capture environmental information. The fourth image of Figure 7 illustrates the limited mathematical counting capability of the three methods, which we attributed to a weaker visual encoder that captured only two military soldiers (the corresponding heatmaps are shown in Figure 8). Moreover, as shown in Figure 8, we utilized the relevancy heatmap [65] to demonstrate the consistency between the images and the main caption components, which was generated by the image encoder E n c i m g and text encoder E n c t x t . The results indicate that MAE-MilitIC described the image content accurately and produced almost no hallucinations. Additional visualization results for complex military scenarios can be found in the Appendix A.
In summary, from various image perspectives, the main objects and environments in the MAE-MilitIC captions were well reflected in the images, which indicate that our method could accurately describe the images and avoid the generation of hallucinations. This could be attributed to the semantic subspace effectively preserving the relevant target semantics while weakening the semantics of irrelevant targets.

5.5. Ablation Studies

In this section, we optimized the parameters to identify the best configuration for our method and evaluated the effectiveness of each step in the MAE mechanism.

5.5.1. Parameter Optimization

To achieve the best model performance, we employed the grid search method for parameter optimization. We selected candidate values for λ from [0, 0.1, 0.3, 0.5] and for ε from [0.1, 0.3, 0.5]. There were 12 parameter combinations in total. After training in the MOCO dataset, the experimental results are shown in Figure 9a. The best BLEU-1 result was 0.494 when the parameter combination was set to λ = 0.1 and ε = 0.1 . For the MAE-MilitIC based on PCA mapping, the number of principal components N ^ was selected from [5, 10, 20, 30, 40]. As illustrated in Figure 9b, the metrics BLEU-1, METEOR, and ROUGE-L were consistently achieved optimal values when N ^ was set to 10. Thus, we determined that the values of the three parameters were λ = 0.1 , ε = 0.1 , and N ^ = 10 .

5.5.2. Ablation of MAE-MilitIC

To explore the impacts of the main components in MAE-MilitIC, i.e., the mapping step, the augmentation step, and the residual item, we conducted comprehensive ablation studies on the MOCO dataset. We evaluated MAE-MilitIC with both GSO and PCA mapping, whose results are provided in Table 5.
As we can see, for two variants of MAE-MilitIC, retaining only the augmentation step yielded the most obvious increase in results. This indicates that utilizing both image and text data, as well as employing one modality to augment the other, was effective. Furthermore, as is shown in the fourth row of Table 5, when using both the mapping and augmentation step, the performance continued to improve. By mapping to a semantic subspace instead of a VL joint space, the augmentation embedding yielded better results. This confirmed that the mapping step could highlight clear military semantics. Finally, after adding the residual item to ensure modal consistency between A ( m I ) and e I , the best performance of caption generation was achieved in the fifth row of Table 5. Moreover, for the results of MAE-MilitICPCA, retaining only augmentation yielded more competitive results than keeping both augmentation and mapping, yet it performed worse compared with retaining all three components. It can be inferred that using a mapping step that ignores the modal consistency can harm the method’s performance.
As a summary, both the mapping step, the augmentation step, and the residual item were helpful for the image-captioning task. Our MAE-MilitIC achieved the best performance when generating accurate military image captions.

6. Conclusions

In this study, we endeavored to improve the performance of MilitIC by offering a reliable solution for the advancement of military image understanding and intelligence generation. One of the contributions of our work was to build the first open-access MilitIC benchmark dataset MOCO, which includes detailed images of real combat scenarios and a rich set of captions. Another contribution was the proposal of a novel encoder–augmentation–decoder architecture MAE-MilitIC, which allows the MAE mechanism to leverage image–text modalities for guiding caption generation and to augment image embeddings with explicit military text in a semantic subspace. The efficacy of MAE-MilitIC was assessed across MOCO and MADD datasets by utilizing various baselines for captioning military images.
This research contributes to understanding military images from UAVs and UGVs, and generates accurate captions for challenging scenarios. Future work will leverage the modal alignment and semantic augmentation capabilities of the MAE mechanism to perform text-based image retrieval, which will quickly identify relevant visual patterns in large image data from UAVs and UGVs.

Author Contributions

Methodology, L.P.; software, L.P.; validation, X.G., K.X. and Y.X.; formal analysis, L.P.; investigation, L.P.; resources, X.G. and K.X.; data curation, L.P.; writing—original draft preparation, L.P.; writing—review and editing, L.P. and C.S.; visualization, X.G., K.X. and Y.X.; supervision, C.S. and Y.X.; project administration, C.S.; funding acquisition, C.S. All authors read and agreed to the published version of this manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61973038.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study can be obtained by applying to MOCO at https://github.com/Panlizhi/MOCO (accessed on 1 August 2024). For the MOCO data, we developed a clear policy for data security and access control. Each potential user will be asked to provide background check information for consideration.

DURC Statement

Current research is limited to intelligence generation based on drone images, which is beneficial to the field of security reconnaissance, such as danger perception and early attack warnings, and does not pose a threat to public health or national security. The authors acknowledge the dual-use potential of the research involving military image captioning and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, the authors strictly adhere to relevant national and international laws about DURC. The authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Different from the perspective transformation shown in Figure 8, this section demonstrates the captioning ability of our MAE-MilitIC across various complex military scenarios. We present the visual results of ten typical military scenarios in Figure A1. In a low-quality and blurry military image (Figure A1, the first image), our method was able to capture the tank’s position and surrounding environment and generate a smooth description. For the human–computer interaction scenario (Figure A1, the eighth image), the model succeeded in understanding the basic interaction mode or action, and then expressed it accurately. In the dimly lit scene shown in the ninth image of Figure A1, the proposed method had the ability to accurately describe the human silhouette in the shadows.
Therefore, it can be said that our method demonstrated strong generalization capabilities for complex military scenarios.
Figure A1. The relevancy heatmap between the image and the MAE-MilitIC captions on MOCO.
Figure A1. The relevancy heatmap between the image and the MAE-MilitIC captions on MOCO.
Drones 08 00421 g0a1

References

  1. Peng, H.; Zhang, Y.; Yang, S.; Song, B. Battlefield image situational awareness application based on deep learning. IEEE Intell. Syst. 2019, 35, 36–43. [Google Scholar] [CrossRef]
  2. Monteiro, J.; Kitamoto, A.; Martins, B. Situational awareness from social media photographs using automated image captioning. In Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; IEEE: Piscataway Township, NJ, USA, 2017; pp. 203–211. [Google Scholar]
  3. Robertson, J. Integrity of a common operating picture in military situational awareness. In Proceedings of the 2014 Information Security for South Africa, Johannesburg, South Africa, 13–14 August 2014; IEEE: Piscataway Township, NJ, USA, 2014; pp. 1–7. [Google Scholar]
  4. Schwartz, P.J.; O’Neill, D.V.; Bentz, M.E.; Brown, A.; Doyle, B.S.; Liepa, O.C.; Lawrence, R.; Hull, R.D. AI-enabled wargaming in the military decision making process. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications II. SPIE, Online, 27 April–8 May 2020; Volume 11413, pp. 118–134. [Google Scholar]
  5. Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4467–4480. [Google Scholar] [CrossRef]
  6. Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
  7. Fang, Z.; Wang, J.; Hu, X.; Liang, L.; Gan, Z.; Wang, L.; Yang, Y.; Liu, Z. Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18009–18019. [Google Scholar]
  8. Wang, Y.; Xu, J.; Sun, Y. End-to-end transformer based model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2585–2594. [Google Scholar]
  9. Tewel, Y.; Shalev, Y.; Schwartz, I.; Wolf, L. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17918–17928. [Google Scholar]
  10. Su, Y.; Lan, T.; Liu, Y.; Liu, F.; Yogatama, D.; Wang, Y.; Kong, L.; Collier, N. Language models can see: Plugging visual controls in text generation. arXiv 2022, arXiv:2205.02655. [Google Scholar]
  11. Nukrai, D.; Mokady, R.; Globerson, A. Text-Only Training for Image Captioning using Noise-Injected CLIP. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Online, 7–11 December 2022; pp. 4055–4063. [Google Scholar]
  12. Li, W.; Zhu, L.; Wen, L.; Yang, Y. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv 2023, arXiv:2303.03032. [Google Scholar]
  13. Gu, S.; Clark, C.; Kembhavi, A. I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2672–2683. [Google Scholar]
  14. Zeng, Z.; Xie, Y.; Zhang, H.; Chen, C.; Wang, Z.; Chen, B. MeaCap: Memory-Augmented Zero-shot Image Captioning. arXiv 2024, arXiv:2403.03715. [Google Scholar]
  15. Das, S.; Jain, L.; Das, A. Deep learning for military image captioning. In Proceedings of the 2018 21st International Conference on Information Fusion (FUSION), Cambridge, UK, 10–13 July 2018; IEEE: Piscataway Township, NJ, USA, 2018; pp. 2165–2171. [Google Scholar]
  16. Ghataoura, D.; Ogbonnaya, S. Application of image captioning and retrieval to support military decision making. In Proceedings of the 2021 International Conference on Military Communication and Information Systems (ICMCIS), Hague, The Netherlands, 4–5 May 2021; IEEE: Piscataway Township, NJ, USA, 2021; pp. 1–8. [Google Scholar]
  17. Lee, C.E.; Baek, J.; Son, J.; Ha, Y.G. Deep AI military staff: Cooperative battlefield situation awareness for commander’s decision making. J. Supercomput. 2023, 79, 6040–6069. [Google Scholar] [CrossRef]
  18. Pan, J.Y.; Yang, H.J.; Duygulu, P.; Faloutsos, C. Automatic image captioning. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), Taipei, Taiwan, 27–30 June 2004; IEEE: Piscataway Township, NJ, USA, 2004; Volume 3, pp. 1987–1990. [Google Scholar]
  19. Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar]
  20. Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24, 1143–1151. [Google Scholar]
  21. Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 2013, 26, 2121–2129. [Google Scholar]
  22. Karpathy, A.; Joulin, A.; Li, F. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2, Valencia, Spain, 2–4 May 2014; pp. 1889–1897. [Google Scholar]
  23. Yao, B.; Yang, X.; Lin, L.; Lee, M.; Zhu, S.C. I2T: Image parsing to text description. Proc. IEEE 2010, 98, 1485–1508. [Google Scholar] [CrossRef]
  24. Aker, A.; Gaizauskas, R. Generating image descriptions using dependency relational patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 1250–1258. [Google Scholar]
  25. Yang, Y.; Teo, C.; Daume III, H.; Aloimonos, Y. Corpusguided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, 27–31 July 2011; pp. 444–454. [Google Scholar]
  26. Li, S.; Kulkarni, G.; Berg, T.; Berg, A.; Choi, Y. Composing simple image descriptions using webscale ngrams. In Proceedings of the 5th Conference on Computational Natural Language Learning, Portland, OR, USA, 23–24 June 2011; pp. 220–228. [Google Scholar]
  27. Gupta, A.; Verma, Y.; Jawahar, C. Choosing linguistics over vision to describe images. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; pp. 606–612. [Google Scholar]
  28. Mitchell, M.; Grishman, R. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; pp. 747–756. [Google Scholar]
  29. Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. BabyTalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef]
  30. Kuznetsova, P.; Ordonez, V.; Berg, T.; Choi, Y. TreeTalk: Composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 351–362. [Google Scholar] [CrossRef]
  31. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv 2014, arXiv:1412.6632. [Google Scholar]
  32. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
  33. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
  34. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
  35. Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv 2021, arXiv:2108.10904. [Google Scholar]
  36. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  37. Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
  38. Fei, J.; Wang, T.; Zhang, J.; He, Z.; Wang, C.; Zheng, F. Transferable decoding with visual entities for zero-shot image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3136–3146. [Google Scholar]
  39. Ramos, R.; Martins, B.; Elliott, D.; Kementchedjhieva, Y. Smallcap: Lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2840–2849. [Google Scholar]
  40. Li, G.; Ye, H.; Qi, Y.; Wang, S.; Qing, L.; Huang, Q.; Yang, M.H. Learning Hierarchical Modular Networks for Video Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1049–1064. [Google Scholar] [CrossRef]
  41. Tian, M.; Li, G.; Qi, Y.; Wang, S.; Sheng, Q.Z.; Huang, Q. Rethink video retrieval representation for video captioning. Pattern Recognit. 2024, 156, 110744. [Google Scholar] [CrossRef]
  42. Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
  43. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  44. Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
  45. Shingare, S.; Haribhakta, Y.; Rajmane, D. Video Captioning using Deep Learning and NLP to Detect Suspicious Activities. In Proceedings of the 2022 International Conference on Signal and Information Processing (IConSIP), Pune, India, 26–27 August 2022; IEEE: Piscataway Township, NJ, USA, 2022; pp. 1–5. [Google Scholar]
  46. Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  47. Brochier, R.; Guille, A.; Velcin, J. Global vectors for node representations. In Proceedings of the The World Wide Web Conference, San Francisco, CA, USA, 13 May 2019; pp. 2587–2593. [Google Scholar]
  48. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  49. Meng, F.j.; Li, Y.q.; Shao, F.m.; Yuan, G.h.; Dai, J.y. Visual-simulation region proposal and generative adversarial network based ground military target recognition. Def. Technol. 2022, 18, 2083–2096. [Google Scholar] [CrossRef]
  50. Ma, J.; Yakimenko, O.A. The concept of sUAS/DL-based system for detecting and classifying abandoned small firearms. Def. Technol. 2023, 30, 23–31. [Google Scholar] [CrossRef]
  51. Liu, Y.; Wang, C.q.; Zhou, Y.j. Camouflaged people detection based on a semi-supervised search identification network. Def. Technol. 2023, 21, 176–183. [Google Scholar] [CrossRef]
  52. Liu, M.; Di, X. Extraordinary MHNet: Military High-level camouflage object detection Network and Dataset. Neurocomputing 2023, 549, 126466. [Google Scholar] [CrossRef]
  53. Nakamura, T. Military Aircraft Detection Dataset. 2021. Available online: https://www.kaggle.com/datasets/a2015003713/militaryaircraftdetectiondataset/code (accessed on 19 May 2024).
  54. Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar]
  55. Zhou, C.; Zhong, F.; Öztireli, C. CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–9. [Google Scholar]
  56. Wu, F.; Ma, Y.; Jin, H.; Jing, X.Y.; Jiang, G.P. MFECLIP: CLIP with Mapping-Fusion Embedding for Text-Guided Image Editing. IEEE Signal Process. Lett. 2023, 31, 116–120. [Google Scholar] [CrossRef]
  57. Wolff, M.; Brendel, W.; Wolff, S. The Independent Compositional Subspace Hypothesis for the Structure of CLIP’s Last Layer. In Proceedings of the ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, Kigali, Rwanda, 4 May 2023. [Google Scholar]
  58. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  59. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches out, Barcelona, Spain, July 25–26 2004; pp. 74–81. [Google Scholar]
  60. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Prague, Czech Republic, 23 June 2005; pp. 65–72. [Google Scholar]
  61. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]
  62. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  63. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  64. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  65. Chefer, H.; Gur, S.; Wolf, L. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 397–406. [Google Scholar]
Figure 1. The architecture comparison of image captioning based on deep learning. Encoder–decoder architecture utilizes image embeddings as a guiding prefix (the blue square) for caption generation in both training and testing. Encoder (VL)–decoder architecture uses text embeddings in training and image embeddings in testing as guiding prefixes. A novel encoder–augmentation–decoder architecture leverages an image guiding prefix that is augmented by text through the MAE mechanism. (a) Encoder–decoder architecture. (b) Encoder (VL)–decoder architecture. (c) Encoder–augmentation–decoder architecture (our MAE-MilitIC). (d) The calculation of the MAE mechanism.
Figure 1. The architecture comparison of image captioning based on deep learning. Encoder–decoder architecture utilizes image embeddings as a guiding prefix (the blue square) for caption generation in both training and testing. Encoder (VL)–decoder architecture uses text embeddings in training and image embeddings in testing as guiding prefixes. A novel encoder–augmentation–decoder architecture leverages an image guiding prefix that is augmented by text through the MAE mechanism. (a) Encoder–decoder architecture. (b) Encoder (VL)–decoder architecture. (c) Encoder–augmentation–decoder architecture (our MAE-MilitIC). (d) The calculation of the MAE mechanism.
Drones 08 00421 g001
Figure 2. Typical images and captions in MOCO.
Figure 2. Typical images and captions in MOCO.
Drones 08 00421 g002
Figure 3. Statistics of MOCO dataset. (a) Average size of images in MOCO. (b) Word cloud of image captions in MOCO.
Figure 3. Statistics of MOCO dataset. (a) Average size of images in MOCO. (b) Word cloud of image captions in MOCO.
Drones 08 00421 g003
Figure 4. Framework of main backbone.
Figure 4. Framework of main backbone.
Drones 08 00421 g004
Figure 5. PCA visualization of image, text, and augmentation embeddings in two spaces on MOCO. Note that in the VL joint space, the image and text embeddings were almost non-overlapping. However, compared with the image embeddings, the augmentation embeddings exhibited a higher degree of overlap with the text embedding in the semantic subspace. (a) VL joint space. (b) Semantic subspace W GSO . (c) Semantic subspace W PCA .
Figure 5. PCA visualization of image, text, and augmentation embeddings in two spaces on MOCO. Note that in the VL joint space, the image and text embeddings were almost non-overlapping. However, compared with the image embeddings, the augmentation embeddings exhibited a higher degree of overlap with the text embedding in the semantic subspace. (a) VL joint space. (b) Semantic subspace W GSO . (c) Semantic subspace W PCA .
Drones 08 00421 g005
Figure 6. PCA visualization of image, text, and augmentation embeddings in two spaces on MADD. (a) VL joint space. (b) Semantic subspace W GSO . (c) Semantic subspace W PCA .
Figure 6. PCA visualization of image, text, and augmentation embeddings in two spaces on MADD. (a) VL joint space. (b) Semantic subspace W GSO . (c) Semantic subspace W PCA .
Drones 08 00421 g006
Figure 7. Qualitative results of MAE-MilitIC and baselines from different perspectives of military objects in MOCO. In the captions, red indicates objects, blue signifies actions or interactions, and green represents the background or environment. A strikethrough denotes a hallucination (non-factual content).
Figure 7. Qualitative results of MAE-MilitIC and baselines from different perspectives of military objects in MOCO. In the captions, red indicates objects, blue signifies actions or interactions, and green represents the background or environment. A strikethrough denotes a hallucination (non-factual content).
Drones 08 00421 g007
Figure 8. The relevancy heatmaps between the image and the MAE-MilitIC captions using MOCO. Visual–text relevancy is indicated by red for a high correlation and blue for low. The captions generated by our method could accurately align with the image content and contained very few hallucinations.
Figure 8. The relevancy heatmaps between the image and the MAE-MilitIC captions using MOCO. Visual–text relevancy is indicated by red for a high correlation and blue for low. The captions generated by our method could accurately align with the image content and contained very few hallucinations.
Drones 08 00421 g008
Figure 9. The parameter optimization of MAE-MilitIC on MOCO. (a) The parameteroptimization of the λ and ε . (b) The parameter optimization of N ^ < N .
Figure 9. The parameter optimization of MAE-MilitIC on MOCO. (a) The parameteroptimization of the λ and ε . (b) The parameter optimization of N ^ < N .
Drones 08 00421 g009
Table 1. Military image datasets. Note that our dataset includes more target categories and captions, and it is open-source resource.
Table 1. Military image datasets. Note that our dataset includes more target categories and captions, and it is open-source resource.
DatasetCategoriesNumber of CaptionsTaskOpen Access
GMTDTank, transporter, and rocket launcher11,036 images, 20,132 targetsObject detection for ground equipment-
SFDPistol and rifle3140 images, 3140 labelsObject detection for guns-
COD10K+Person2592 images, 2592 labelsObject detection for camouflage-
MHCDPerson, airplane, vehicle, and warship3000 images, 3000 labelsObject detection for camouflageYes
MADDAircraft14,088 images, 14,088 labelsObject detection for aircraftsYes
MOCO (ours)Person, tank, rocket, fighter jet, helicopter,
AV, UGV, boat, and defensive structure
7449 images, 37,245 captionsImage captioning for real combatYes
Table 2. Examples in basic text prompts.
Table 2. Examples in basic text prompts.
Text Prompts Main Elements
Military PersonnelCombat EquipmentBattlefield Environment
Soldiers crossed the frontline under heavy fire.Soldier Fire
Snipers positioned on rooftops monitored enemy movements.Sniper Rooftop
Unmanned aerial drones conducted reconnaissance missions. Unmanned aerial drones
UGVs transported supplies to forward positions. UGV
Fighter jets engaged in combat maneuvers above the foggy mountains. Fighter jetFoggy mountains
The jet conducted a reconnaissance mission under the cover of darkness. JetDarkness
The tank engaged enemy forces in the open plains. TankOpen plains
The military vehicle traversed the muddy battlefield under a heavy downpour. Military vehicleHeavy downpour
The soldiers that operated the anti-aircraft unit targeted enemy aircraft in the sky.SoldierAnti-aircraft unitSky
Armored vehicles deployed smoke screens to conceal troop movements. Armored vehicleSmoke screens
The howitzer launched rounds into the night sky. HowitzerNight sky
Cannons positioned on the coastal cliffs targeted ships at sea. CannonSea
The combat engineers are setting up fences and trenches.Combat engineer Trench
Engineering vehicles repaired damaged roadway. Engineering vehicleRoadway
Table 3. Quantitative results of MAE-MilitIC and baselines on MOCO. Bold indicates the best result, while underlined indicates the second-best result.
Table 3. Quantitative results of MAE-MilitIC and baselines on MOCO. Bold indicates the best result, while underlined indicates the second-best result.
MethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LSPICE
ClipCap0.4410.2750.1730.1110.1580.3370.102
CapDec0.3630.1840.0860.0410.1110.2640.059
MAE-MilitICGSO0.4940.3220.2120.1410.1630.3530.113
MAE-MilitICPCA0.4600.2900.1810.1120.1580.3360.102
Table 4. Quantitative results of MAE-MilitIC and baselines on MADD.
Table 4. Quantitative results of MAE-MilitIC and baselines on MADD.
MethodBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LSPICE
ClipCap0.5470.4430.3690.2710.2300.5760.437
CapDec0.5420.4450.3700.2650.2410.5860.430
MAE-MilitICGSO0.5530.4520.3790.2820.2360.5830.449
MAE-MilitICPCA0.5550.4480.3710.2690.2410.5780.459
Table 5. Ablation studies of MAE-MilitIC on MOCO.
Table 5. Ablation studies of MAE-MilitIC on MOCO.
MethodsMappingAugmentationResidualBLEU-1BLEU-2BLEU-3BLEU-4METEORROUGE-LSPICE
MAE-MilitICGSO0.3350.1830.1020.0570.1150.2790.068
0.4400.2760.1740.1120.1600.3310.106
0.4750.3030.1930.1240.1560.3460.107
0.4940.3220.2120.1410.1630.3530.113
MAE-MilitICPCA0.3960.2200.1230.0730.1270.2990.076
0.4400.2760.1740.1120.1600.3310.106
0.4530.2780.1720.1050.1500.3250.100
0.4600.2900.1810.1120.1580.3360.102
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pan, L.; Song, C.; Gan, X.; Xu, K.; Xie, Y. Military Image Captioning for Low-Altitude UAV or UGV Perspectives. Drones 2024, 8, 421. https://doi.org/10.3390/drones8090421

AMA Style

Pan L, Song C, Gan X, Xu K, Xie Y. Military Image Captioning for Low-Altitude UAV or UGV Perspectives. Drones. 2024; 8(9):421. https://doi.org/10.3390/drones8090421

Chicago/Turabian Style

Pan, Lizhi, Chengtian Song, Xiaozheng Gan, Keyu Xu, and Yue Xie. 2024. "Military Image Captioning for Low-Altitude UAV or UGV Perspectives" Drones 8, no. 9: 421. https://doi.org/10.3390/drones8090421

Article Metrics

Back to TopTop