In this section, we conducted experiments on standardized tasks using public datasets, verifying the effectiveness of the basic modules through performance comparisons with similar models. Specifically, to assess the impact of cross-modal alignment, we performed species image recognition experiments on the iNaturalist 2017 dataset, as shown in
Table 5, and cross-modal retrieval experiments on the MSCOCO dataset, as shown in
Table 6. This set of experiments aims to verify whether the alignment between images and text is achieved after projecting them into a shared space using a pair of momentum encoders. We also conducted multi-species text-to-image retrieval on the NACID dataset, as shown in
Table 7. The purpose of this set of experiments is to validate the clustering of language descriptions around images in the cross-modal space through cross-modal correlation search. After validating the fundamental functionality of the model, we performed cross-modal question-answering experiments on the public ScienceQA dataset and compared our results with leading models on this dataset, as shown in
Table 8. Finally, we demonstrated the performance of our proposed method in completing cross-modal question-answering tasks in forestry ecology, as illustrated in
Figure 10. The purpose of this set of experiments is to validate the ability of a pair of momentum encoders and the language generation module to collaboratively perform question-answering inference.
4.1. Main Results
We abbreviate the method proposed as PaMA (Parameterization before Meta-Analysis). To validate the effectiveness of the model, we conducted experiments on standardized tasks using public datasets and compared them with state-of-the-art (SOTA) models on the Leaderboard. Firstly, to validate the effectiveness of the image encoder, we performed image classification tasks on the iNaturalist 2017 dataset [
38] and compared them with SOTA models on the Leaderboard. Secondly, to validate the effectiveness of the cross-modal encoder, we conducted image-text cross-modal retrieval experiments on the MSCOCO dataset [
63] and compared them with SOTA models on the Leaderboard. Lastly, to validate the effectiveness of the question-answering model, we conducted experiments on the ScienceQA dataset [
46] and compared them with SOTA models on the Leaderboard.
As shown in
Table 5, we conducted experiments on the image classification task of the iNaturalist 2017 dataset, comparing it with the SOTA methods on the leaderboard. The experimental results show that the image encoder trained using the momentum method proposed in
Section 3.5 achieved a top-1 accuracy higher by 1.7% to 51% compared to similar methods, demonstrating the effectiveness of momentum encoding.
The performance of the image encoder determines whether image information can be accurately extracted and expressed in cross-modal question-answering tasks within forestry ecology. In other words, if the image encoder lacks sufficient accuracy in visual object recognition and does not possess high discriminative power for species identification within the domain of forestry ecology, it will not support the cross-modal question-answering tasks of this study. Comparing our method with similar approaches on standardized tasks in public datasets can more rigorously validate its effectiveness and performance level. The results indicate that the image encoder in our proposed method can accurately identify visual objects in images, ranking among the top methods. The selected iNaturalist 2017 dataset, which includes 5089 species, demonstrates that our proposed image encoder can distinguish fine-grained visual species targets. In summary, the image encoder in our method is capable of encoding visual information for question-answering tasks in forestry ecology.
As mentioned in
Section 3, our proposed method includes a pair of encoders: an image encoder for extracting visual features and a text encoder for extracting linguistic features. These encoders project the extracted features from both modalities into a shared space, aiming to bring similar features closer and push dissimilar ones further apart. To evaluate the performance of this pair of encoders, a suitable experiment is cross-modal retrieval. The goal of this experiment is to retrieve similar text from a text dataset given an image or to retrieve similar images from an image dataset given a text. Higher retrieval accuracy indicates better performance of the cross-modal encoders. Conducting standardized experiments on public datasets and comparing them with similar methods can both validate the effectiveness of the proposed method and measure its performance level. For these reasons, we chose the cross-modal retrieval experiment on the MSCOCO dataset, which has many comparable methods and a high degree of standardization, to verify the performance of our proposed pair of cross-modal encoders. The experimental results are shown in
Table 6. PaMA achieved comparable performance to SOTA models on the leaderboard in both image-to-text and text-to-image retrieval tasks, demonstrating the effectiveness of our proposed cross-modal momentum encoder. The evaluation metric used is R@K, which reflects the accuracy of cross-modal retrieval. A higher R@K value indicates higher retrieval accuracy.
To more intuitively demonstrate the cross-modal retrieval capabilities of the model, the top 3 retrieval results for conservation image data using forestry ecology descriptions are presented in
Table 7.
The experimental results indicate that the performance of the cross-modal encoders in our proposed method ranks among the top compared to similar approaches, ensuring the accuracy of data feature extraction in cross-modal question-answering tasks within forestry ecology. Specifically, this pair of cross-modal encoders can effectively represent data from their respective modalities and reflect the correlation between image and text features through cross-modal representation space embedding. Higher performance of the cross-modal encoders implies greater accuracy in extracting features from each modality and in computing cross-modal similarity. As shown in
Table 7, the accuracy of cross-modal retrieval is generally high, although there are instances where errors occur, such as mistaking Ludwigia alternifolia for Lotus corniculatus. Nevertheless, the cross-modal alignment between the literature knowledge and the image data are semantically consistent. In summary, the pair of cross-modal encoders in our proposed method provides robust support for cross-modal question-answering in forestry ecology.
The experiments above indicate that using fact features within images as centroids for cross-modal representation clustering in a shared semantic space is feasible. The results in
Table 5 demonstrate that the image encoder can effectively classify image data even after projecting visual features into the cross-modal space, indicating that the embedding of image features in space conforms to the distribution of similarities among individual samples in the image set. The results in
Table 6 indicate that after projecting the representations of each modality into the shared space, the image encoder and text encoder can accurately calculate their mutual similarity, demonstrating the effectiveness of the proposed embedding clustering method.
After validating the effectiveness of the cross-modal encoder module, we proceeded to verify the overall effectiveness of the proposed method, specifically its performance in executing cross-modal question-answering tasks when all modules are integrated into a forestry ecology question-answering model. Consistent with the principles of the previous experiments, we conducted standardized experiments on public datasets. These experiments not only validated the model’s effectiveness but also assessed the performance level of the proposed method by comparing it with similar approaches.
The scores in
Table 8 represent the percentage of correct answers. The questions in the ScienceQA dataset are divided into several categories, with the publicly available leaderboard mainly listing the following question categories: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1–6 = grades 1–6, G7–12 = grades 7–12, the term “Avg” represents the average score across the aforementioned eight categories. Compared to GPT-3.5, PaMA achieved an average score that was 16.46% higher, with a particularly notable increase of 21.94% in the IMG category, demonstrating that the proposed method possesses the capability for cross-modal question-answering.
The performance of PaMA on QA tasks stems from two aspects: effective encoding and decoding of the model, as well as the effect of orthogonal information superposition. As seen in the data transformation process illustrated in
Figure 9, the raw data are first extracted for features through an encoder, then pass through a feature fusion network, and are finally decoded by a language generator. In this encoding–decoding process, ensuring the effectiveness of both encoding and decoding is crucial for successfully completing the QA task. From the experimental results, it is evident that our proposed method achieves this. However, considering the performance of models like GPT-3.5 in the IMG category, as shown in
Table 8, there is a significant difference compared to models incorporating visual information. What could be the reason for this? Firstly, in terms of model comparison, our method adopts GPT-2, which is not as proficient in natural language processing as GPT-3.5. Secondly, all comparisons are made on the same ScienceQA dataset and task; the difference lies in our model’s incorporation of momentum visual encoding and cross-modal feature fusion.
How does this difference enable a relatively weaker language model to exhibit better performance? The primary reason is the effect of orthogonal information superposition. Because GPT fundamentally involves prediction, as indicated in Equation (
5), more known information leads to lower information entropy and higher prediction accuracy. Furthermore, the information content in the same dataset remains constant. Moreover, since GPT-3.5 outperforms GPT-2, it can extract more information to a greater extent from the same dataset. In summary, our method’s ability to significantly surpass GPT-3.5 in the IMG category is due to effectively introducing information gained from visual features.
From the perspective of cross-modal representation spaces, the effective superposition of visual and linguistic information depends on two factors. Firstly, whether the encoder can extract features from data samples to the maximum extent, and secondly, the cross-modal shared semantic space embedding, as it determines whether the embedding positions of feature points in the space better reflect the data distribution and their similarities. These two aspects are precisely addressed by the cross-modal momentum encoder, also known as cross-modal embedding clustering. In other words, the cross-modal momentum encoder is an effective method for performing cross-modal embedding clustering.
Then, to what extent does the fusion module ResAtt play a role in the process of cross-modal information superposition between visual and linguistic features? To address this, we conducted two ablation experiments.
4.2. Ablation Study
In cross-modal question-answering tasks within forestry ecology, the model needs to accurately distinguish between species with high intra- and inter-species similarity in visual information. Additionally, it must align these fine-grained visual details with specialized forestry ecology knowledge. To more precisely encode image and textual data in the forestry ecology domain, we optimized the model performance using a momentum encoder. To evaluate its optimization effect, we replaced the momentum encoder with two widely used visual encoders and compared their performance with and without the momentum encoder. This comparison validated the effectiveness of the momentum encoder and its contribution to the overall model performance. Before guiding the language model to generate answers for users, we performed cross-modal information fusion on the input-side visual and language features. To assess its contribution to the overall model performance, we conducted ablation experiments comparing models with and without the cross-modal fusion module. Our ablation experiments were conducted on 6532 samples that include both textual and image context, and the results of the ablation experiments are presented in
Table 9.
In
Table 9, PaMA/ResNet represents replacing the momentum visual encoder with ResNet [
64], PaMA/CLIP represents replacing the momentum visual encoder with CLIP [
31], “with ResAtt” indicates the presence of a feature fusion network, and “without ResAtt” indicates the absence of a feature fusion network.
Compared to the 89.37% accuracy of PaMA in the IMG column of
Table 8, the models with a feature fusion network in
Table 9 experienced a decrease in accuracy of approximately 5% to 8%, while models without a feature fusion network saw a decrease of about 7% to 12%. In the “with ResAtt” column, the relatively lower performance decrease indicates that our proposed momentum encoder performs better than ResNet and CLIP when the feature fusion network is retained, highlighting the higher quality of visual features extracted by momentum-based contrastive learning methods. In the “without ResAtt” column, the relatively larger performance decrease shows that removing both the visual encoder and the feature fusion network leads to greater performance drops, demonstrating the indispensability of both the momentum encoder and the feature fusion network.
Moreover, the reasoning chain in the question-answering task is relatively long, resulting in a lengthy textual context. In such cases, our proposed feature fusion network with visual residual connections is more suitable for integrating visual and language features within extended textual context, thereby effectively reducing uncertainties during the language model’s prediction process.
4.3. Qualitative Analysis of Forestry Ecological Question-Answering
A human–machine question-answering instance using PaMA, as shown in
Figure 10.
The Response1 in
Figure 10 illustrates the effective cross-modal alignment achieved after projecting visual and language features into a shared space. This alignment is demonstrated by the accurate interpretation of image-text pairs created for forestry ecology on the NACID dataset during the inference process. Furthermore, it indicates that the embeddings of feature points conform to the original similarity distribution of the dataset, thereby proving the effectiveness of the image encoder and text encoder in embedding clustering.
Responses 2–5 demonstrate the process by which the model extracts knowledge from the literature to respond to user queries. This experimental result indicates that PaMA’s language generation module can accurately predict subsequent text based on preceding text and output fluent natural language. On the other hand, this also demonstrates the effectiveness of our proposed method, which utilizes factual information from images as centroids for cross-modal embedding. In terms of PaMA’s model structure, two designs are crucial to achieving this effectiveness. First, the ResAtt module introduces residual connections for visual features, ensuring that visual information continually contributes to enhancing predictions in the language module during question-answering processes, rather than being overshadowed by increasingly lengthy language sequences. Second, the invariant factual information contained within images ensures invariance of clustering centroids in visual and language embeddings within the shared semantic space. This approach avoids issues such as centroid drift caused by linguistic ambiguity, which could lead to chaotic or illusory language generation in single-modality question-answering tasks due to high uncertainty [
65]. The effectiveness of our proposed cross-modal momentum encoder ensures the quality of feature extraction and embedding in the shared semantic space, enabling orthogonal information superposition across modalities and reducing uncertainty during the language generation process, thereby improving model performance.
Reference [
62] proposes a two-stage framework that first infers the rationale and then the answer, allowing the answer inference process to leverage cross-modal information provided by the rationale, resulting in more accurate answers. Reference [
46] presents a method that first uses an image captioning model to map images to text, then employs this text to drive a language generation model for inference. This approach leads to information loss from the image data and lacks a shared cross-modal representation space, resulting in insufficient or inaccurate mutual information representation and computation. From the combined analysis of these two references, it is evident that cross-modal question-answering is a reasoning process that integrates vision and language, grounded in cross-modal representation space. These representations enable accurate and effective mutual information computation from data of different modalities. The rationale guides the reasoning process and plays a crucial role in enhancing the accuracy of cross-modal question-answering, as demonstrated in Reference [
62] and our proposed method. Furthermore, the experimental comparison results shown in
Table 8 indicate that our proposed cross-modal embedding clustering method performs better, demonstrating the effectiveness of our approach.
Admittedly, our proposed method also has its limitations. From the perspective of the model, its computations and outputs are black-box in nature, lacking interpretability. This makes it challenging to rigorously validate the reliability of the model’s output during practical applications. From the perspective of training data, the public datasets used for model testing do not cover all scenarios in forestry ecology question-answering. Specifically, the datasets employed in the experiments, such as iNaturalist 2017, MSCOCO, and ScienceQA, do not share the same distribution as the data encountered in actual forestry ecology question-answering tasks. Even with the improvements made to the iNaturalist 2017 dataset by expert annotation as described in this study, it is still impossible to cover all real-world situations. Feeding such manually annotated data into AI models extends the subjective judgments of human experts, rather than creating universally applicable intelligent models. From the perspective of the tasks used for testing (such as image classification tasks and cross-modal retrieval tasks mentioned earlier), the predefined input-output relationships and evaluation standards constrain the ultimate behavior of the model. These constraints and evaluations are difficult to align with the conditions encountered in practical applications. The primary limitation of the multi-task perspective is that different tasks exhibit varying degrees of data fitting. Specifically, this manifests as the model overfitting on some tasks while underfitting on others. If a particular task leads the model to learn noise rather than meaningful features, it can further deteriorate model convergence. Moreover, balancing multiple tasks relies on empirical adjustments of a limited set of hyperparameters, which further increases the model’s uncertainty. In summary, our research has limitations and biases in terms of interpretability, dataset distribution, and task design.