Figure 1.
Motivation for our work. Existing LVLMs for RS still face the following challenges: (1) Hallucinations in LVLMs for RS have consistently troubled researchers. The multi-agent debate for collaborative inference is often employed to mitigate hallucinations, but this approach demands significant computational power and imposes a substantial computational burden on hardware devices. However, utilizing a lightweight model and LVLM, our model collaboration approach can achieve efficient RS VQA. (2) There is a significant scale difference between objects in RS images, such as small objects (e.g., cars) and large objects (e.g., bridges and buildings). Existing LVLMs often overlook the scale variations in RS images. Extracting multi-scale features can enhance LVLM’s perception of RS image details. (3) Due to the many small objects within RS images, reducing the resolution to fit the visual encoder makes these small objects even harder to distinguish. Therefore, a relatively small resolution is insufficient to understand the details presented in RS images. Enlarging the visual encoder’s resolution can benefit the RS VQA.
Figure 1.
Motivation for our work. Existing LVLMs for RS still face the following challenges: (1) Hallucinations in LVLMs for RS have consistently troubled researchers. The multi-agent debate for collaborative inference is often employed to mitigate hallucinations, but this approach demands significant computational power and imposes a substantial computational burden on hardware devices. However, utilizing a lightweight model and LVLM, our model collaboration approach can achieve efficient RS VQA. (2) There is a significant scale difference between objects in RS images, such as small objects (e.g., cars) and large objects (e.g., bridges and buildings). Existing LVLMs often overlook the scale variations in RS images. Extracting multi-scale features can enhance LVLM’s perception of RS image details. (3) Due to the many small objects within RS images, reducing the resolution to fit the visual encoder makes these small objects even harder to distinguish. Therefore, a relatively small resolution is insufficient to understand the details presented in RS images. Enlarging the visual encoder’s resolution can benefit the RS VQA.
![Remotesensing 17 00466 g001]()
Figure 2.
Framework of the proposed Co-LLaVA. The multi-scale visual features are generated via our multi-scale feature fusion (MFF) module. CoCa and large language model (LLM) leverage visual and text features to produce each answer. If the visual understanding results from CoCa and LLM are different, we utilize them as a prompt and re-input into the LLM for a final answer.
Figure 2.
Framework of the proposed Co-LLaVA. The multi-scale visual features are generated via our multi-scale feature fusion (MFF) module. CoCa and large language model (LLM) leverage visual and text features to produce each answer. If the visual understanding results from CoCa and LLM are different, we utilize them as a prompt and re-input into the LLM for a final answer.
Figure 3.
Typical image–question–response trios from Test Set 1 of RSVQA-HR. Given an image–question pair, CoCa* generates an answer . With , Co-LLaVA w/o MC (without the help of CoCa*) generates another answer . With , , and , Co-LLaVA (with the help of CoCa*) generates the final answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 3.
Typical image–question–response trios from Test Set 1 of RSVQA-HR. Given an image–question pair, CoCa* generates an answer . With , Co-LLaVA w/o MC (without the help of CoCa*) generates another answer . With , , and , Co-LLaVA (with the help of CoCa*) generates the final answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 4.
Typical image–question–response trios from Test Set 2 of RSVQA-HR. In the left-bottom corner of the figure, when CoCa* and Co-LLaVA w/o MC (without the help of CoCa*) generate wrong answers, Co-LLaVA can generate a third response which is correct after rethinking thanks to . Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 4.
Typical image–question–response trios from Test Set 2 of RSVQA-HR. In the left-bottom corner of the figure, when CoCa* and Co-LLaVA w/o MC (without the help of CoCa*) generate wrong answers, Co-LLaVA can generate a third response which is correct after rethinking thanks to . Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 5.
Typical image–question–response trios from test set of RSVQA-LR. When Co-LLaVA w/o MC (without the help of CoCa*) generates an answer different from CoCa*’s, Co-LLaVA can revise its response with the help of CoCa* in most cases. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 5.
Typical image–question–response trios from test set of RSVQA-LR. When Co-LLaVA w/o MC (without the help of CoCa*) generates an answer different from CoCa*’s, Co-LLaVA can revise its response with the help of CoCa* in most cases. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 6.
Typical image–question–response trios from a test set of the subset of CRSVQA. With the help of CoCa*, Co-LLaVA can generate the correct response in most cases. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 6.
Typical image–question–response trios from a test set of the subset of CRSVQA. With the help of CoCa*, Co-LLaVA can generate the correct response in most cases. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 7.
Response comparison with other methods (i.e., GeoChat and RS-LLaVA) on test set of RSVQA-LR. (a) Both GeoChat and RS-LLaVA provide incorrect answers for the “Rural/Urban” and “Comparison” question types, while Co-LLaVA successfully delivers the correct responses. (b) For the “Count” question type, GeoChat and RS-LLaVA also yield incorrect answers, whereas Co-LLaVA is able to provide the correct answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Figure 7.
Response comparison with other methods (i.e., GeoChat and RS-LLaVA) on test set of RSVQA-LR. (a) Both GeoChat and RS-LLaVA provide incorrect answers for the “Rural/Urban” and “Comparison” question types, while Co-LLaVA successfully delivers the correct responses. (b) For the “Count” question type, GeoChat and RS-LLaVA also yield incorrect answers, whereas Co-LLaVA is able to provide the correct answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.
Table 1.
Accuracy comparison with other methods on Test Set 1 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models.
Table 1.
Accuracy comparison with other methods on Test Set 1 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models.
Method | # Parameters | Count | Presence | Comparison | Area | AA | OA |
---|
Lightweight Models: |
RSVQA [37] | 85.69M | 68.63 | 90.43 | 88.19 | 85.24 | 83.12 | 83.23 |
EasyToHard [39] | 148.83M | 69.06 | 91.39 | 89.75 | 85.92 | 83.97 | 84.16 |
Bi-Modal [38] | - | 69.80 | 92.03 | 91.83 | 86.27 | 84.98 | 85.30 |
SHRNet [5] | 105.56M | 70.04 | 92.45 | 91.68 | 86.35 | 85.13 | 85.39 |
MADNet [40] | - | 70.02 | 92.36 | 91.87 | 86.58 | 85.21 | 85.51 |
Large Vision Language Models: |
LLaVA-v1.5 [13] | 7B | 43.34 | 63.97 | 64.69 | 1.27 | 43.32 | 38.19 |
MiniGPT-v2 [58] | 7B | - | 64.80 | 59.17 | - | - | - |
MiniGPT-4 [59] | 7B | - | 52.91 | 54.76 | - | - | - |
Shikra [60] | 13B | - | 58.85 | 57.40 | - | - | - |
RSGPT [42] | 13B | - | 91.86 | 92.15 | - | - | - |
SkyEyeGPT [61] | 7B | - | 84.95 | 85.63 | - | - | - |
Co-LLaVA | 7B | 70.12 ± 0.33 | 92.56 ± 0.24 | 92.20 ± 0.27 | 85.49 ± 0.44 | 85.09 ± 0.08 | 85.55 ± 0.18 |
Table 2.
Accuracy comparison with other methods on Test Set 2 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.
Table 2.
Accuracy comparison with other methods on Test Set 2 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.
Method | # Parameters | Count | Presence | Comparison | Area | AA | OA |
---|
Lightweight Models: |
RSVQA [37] | 85.69M | 61.47 | 86.26 | 85.94 | 76.33 | 77.50 | 78.23 |
EasyToHard [39] | 148.83M | 61.95 | 87.97 | 87.68 | 78.62 | 79.06 | 79.29 |
Bi-Modal [38] | - | 63.06 | 89.37 | 89.62 | 80.12 | 80.54 | 81.23 |
SHRNet [5] | 105.56M | 63.42 | 89.81 | 89.44 | 80.37 | 80.76 | 81.37 |
MADNet [40] | - | 63.38 | 89.69 | 89.82 | 80.58 | 80.87 | 81.51 |
Large Vision Language Models: |
LLaVA-v1.5 [13] | 7B | 42.14 | 68.15 | 65.72 | 0.64 | 44.19 | 39.64 |
MiniGPT-v2 [58] | 7B | - | 66.34 | 59.40 | - | - | - |
MiniGPT-4 [59] | 7B | - | 50.43 | 52.60 | - | - | - |
Shikra [60] | 13B | - | 57.28 | 56.63 | - | - | - |
GeoChat [16] | 7B | - | 58.45 | 83.19 | - | - | - |
EarthGPT [41] | 13B | - | 62.77 | 79.53 | - | - | - |
RSGPT [42] | 13B | - | 89.87 | 89.68 | - | - | - |
SkyEyeGPT [61] | 7B | - | 83.50 | 80.28 | - | - | - |
H2RSVLM [43] | 7B | - | 65.00 | 83.70 | - | - | - |
SkySenseGPT [25] | 7B | - | 69.14 | 84.14 | - | - | - |
Co-LLaVA | 7B | 63.51 ± 0.26 | 89.85 ± 0.16 | 90.73 ± 0.13 | 80.14 ± 0.35 | 81.06 ± 0.19 | 81.84 ± 0.28 |
Table 3.
Accuracy comparison with other methods of the test set of the RSVQA-LR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.
Table 3.
Accuracy comparison with other methods of the test set of the RSVQA-LR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.
Method | # Parameters | Count | Presence | Comparison | Rural/Urban | AA | OA |
---|
Lightweight Models: |
RSVQA [37] | 85.69M | 67.01 | 87.46 | 81.50 | 90.00 | 81.49 | 79.08 |
EasyToHard [39] | 148.83M | 69.22 | 90.66 | 87.49 | 91.67 | 84.76 | 83.09 |
Bi-Modal [38] | - | 72.22 | 91.06 | 91.16 | 92.66 | 86.78 | 85.56 |
SHRNet [5] | 105.56M | 73.87 | 91.03 | 90.48 | 94.00 | 87.34 | 85.85 |
MADNet [40] | - | 72.85 | 90.96 | 91.68 | 95.00 | 87.62 | 85.97 |
Large Vision Language Models: |
LLaVA-v1.5 [13] | 7B | 26.13 | 54.45 | 65.72 | 59.00 | 51.32 | 50.66 |
MiniGPT-v2 [58] | 7B | - | 49.85 | 63.09 | 59.00 | - | - |
MiniGPT-4 [59] | 7B | - | 43.86 | 57.55 | 62.00 | - | - |
Shikra [60] | 13B | - | 46.47 | 60.31 | 63.62 | - | - |
GeoChat [16] | 7B | - | 91.09 | 90.33 | 94.00 | - | - |
RSGPT [42] | 13B | - | 91.17 | 91.70 | 94.00 | - | - |
SkyEyeGPT [61] | 7B | - | 88.63 | 75.00 | 88.93 | - | - |
LHRS-Bot [26] | 7B | - | 88.51 | 90.00 | 89.07 | - | - |
H2RSVLM [43] | 7B | - | 89.58 | 89.79 | 88.00 | - | - |
SkySenseGPT [25] | 7B | - | 91.07 | 92.00 | 95.00 | - | - |
RS-LLaVA [27] | 7B | 74.38 | 92.80 | 91.33 | 94.00 | 88.13 | 86.95 |
Co-LLaVA | 7B | 73.53 ± 0.32 | 91.44 ± 0.22 | 92.73 ± 0.30 | 98.00 ± 1.00 | 88.92 ± 0.12 | 86.75 ± 0.23 |
Table 4.
Accuracy comparison with LLaVA-v1.5 and Geochat of the test set of the subset of the CRSVQA dataset. The results from LLaVA-v1.5 and GeoChat were generated by open-source model weights. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. The values after “±” are the standard deviations.
Table 4.
Accuracy comparison with LLaVA-v1.5 and Geochat of the test set of the subset of the CRSVQA dataset. The results from LLaVA-v1.5 and GeoChat were generated by open-source model weights. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. The values after “±” are the standard deviations.
Method | # Parameters | S.U. | O.D. | R.R. | AA | OA |
---|
LLaVA-v1.5 [13] | 7B | 21.92 | 36.84 | 22.37 | 27.04 | 25.91 |
Geochat [16] | 7B | 20.55 | 28.95 | 18.42 | 22.64 | 21.59 |
Co-LLaVA | 7B | 79.45 ± 0.22 | 61.84 ± 0.44 | 78.29 ± 0.52 | 73.19 ± 0.39 | 74.42 ± 0.19 |
Table 5.
Test accuracy (%) of each variant of our method Co-LLaVA on four test sets (Test Set 1 of RSVQA-HR, Test Set 2 of RSVQA-HR, test set of RSVQA-LR, and test set of the subset of CRSVQA). S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively.
Table 5.
Test accuracy (%) of each variant of our method Co-LLaVA on four test sets (Test Set 1 of RSVQA-HR, Test Set 2 of RSVQA-HR, test set of RSVQA-LR, and test set of the subset of CRSVQA). S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively.
Methods | RSVQA-HR (Test Set 1) |
---|
Count | Presence | Comparison | Area |
---|
(w/o) MFF and MC | 69.49 | 92.26 | 91.94 | 85.23 |
(w/o) MC | 69.78 | 92.40 | 92.05 | 85.44 |
Co-LLaVA | 70.12 | 92.56 | 92.20 | 85.49 |
Methods | RSVQA-HR (Test Set 2) |
Count | Presence | Comparison | Area |
(w/o) MFF and MC | 63.14 | 89.35 | 90.66 | 79.95 |
(w/o) MC | 63.42 | 89.51 | 90.69 | 80.01 |
Co-LLaVA | 63.51 | 89.85 | 90.73 | 80.14 |
Methods | RSVQA-LR (Test Set) |
Count | Presence | Comparison | Rural/Urban |
(w/o) MFF and MC | 72.54 | 90.80 | 91.46 | 94.00 |
(w/o) MC | 72.66 | 91.02 | 92.61 | 96.00 |
Co-LLaVA | 73.53 | 91.44 | 92.73 | 98.00 |
Methods | Subset of CRSVQA (Test Set) |
S.U. | O.D. | R.R. |
(w/o) MFF and MC | 77.65 | 59.87 | 73.68 |
(w/o) MC | 78.08 | 60.47 | 74.34 |
Co-LLaVA | 79.45 | 61.84 | 78.29 |