A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model

Yu, Piaofang; Lin, Bo

doi:10.3390/app14188350

Open AccessArticle

A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model

by

Piaofang Yu

^1,2 and

Bo Lin

^1,2,*

¹

School of Software Technology, Zhejiang University, Ningbo 315048, China

²

Innovation Centre for Information, Binjiang Institute of Zhejiang University, Hangzhou 310053, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8350; https://doi.org/10.3390/app14188350

Submission received: 9 July 2024 / Revised: 12 September 2024 / Accepted: 13 September 2024 / Published: 17 September 2024

(This article belongs to the Special Issue Advanced Large Language Models and Natural Language Processing Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Smart agriculture has become an inevitable trend in the development of modern agriculture, especially promoted by the continuous progress of large language models like chat generative pre-trained transformer (ChatGPT) and general language model (ChatGLM). Although these large models perform well in general knowledge learning, they still have certain limitations and errors when facing agricultural professional knowledge about crop disease identification, growth stage judgment, and so on. Agricultural data involves images and texts and other modalities, which play an important role in agricultural production and management. In order to better learn the characteristics of different modal data in agriculture, realize cross-modal data fusion, and thus understand complex application scenarios, we propose a framework AgriVLM that uses a large amount of agricultural data to fine-tune the visual language model to analyze agricultural data. It can fuse multimodal data and provide more comprehensive agricultural decision support. Specifically, it utilizes Q-former as a bridge between an image encoder and a language model to achieve a cross-modal fusion of agricultural images and text data. Then, we apply a Low-Rank adaptive to fine-tune the language model to achieve an alignment between agricultural image features and a pre-trained language model. The experimental results prove that AgriVLM demonstrates great performance in crop disease recognition and growth stage recognition, with recognition accuracy exceeding 90%, demonstrating its capability to analyze different modalities of agricultural data.

Keywords:

visual language large model; cross-modal fusion; image recognition; agricultural knowledge understanding

1. Introduction

Agriculture is the foundation of economic development and plays an important role in social stability. With the continuous development of modern agricultural technology, the process of agricultural production has produced a large amount of agricultural image and text data, which contain a variety of information and are significant for improving the efficiency of agricultural production and promoting sustainable development. The traditional methods of agricultural data analysis appear to be inadequate in facing these complex data, unable to efficiently and accurately extract valuable information; hence, smart agriculture has become the trend in future agriculture [1,2]. Therefore, how to quickly and accurately analyze data in different modalities such as agricultural images and texts and mine useful information to assist people in making decisions is one of the current hot topics in agricultural informatization research.

Recently, with the developments of artificial intelligence technology, deep learning has been widely used to analyze complex data in agriculture, providing powerful technical support for the intelligence of agricultural production. Among them, large language models (LLMs) [3,4,5] have obtained outstanding achievements in the field of natural language processing, providing a new viewpoint and method for analyzing agricultural text data. LLMs are usually built on the transformer [6] architecture and then learn the rules and features through a large amount of text corpus. LLMs have shown quite a high level of capability in tasks such as natural language understanding and text generation. For example, the GPT series [7,8,9], ChatGLM [10], and large language model Meta AI (LLAMA) [11] and others have excellent performance in inference and analysis tasks, and have been applied in various fields such as medicine [12,13], education [14], finance [15,16], and law [17,18]. However, large language models mainly focus on analyzing text data and have a significant shortcoming in analyzing visual data, i.e., images. In contrast, visual language large models are able to deeply analyze both image and text data using the powerful cross-modal understanding ability [19], which can be an effective way to solve this shortcoming. They offer a new direction for the analysis of different modal data in the field of agriculture and provide a more comprehensive solution for the understanding and application of agricultural knowledge.

Most of the current large visual language models like ViLT [20], Qwen-VL [21], and VisCPM [22] learn from a huge amount of image–text pair data, achieving the analysis and recognition of target objects in images, as well as the extraction and generation of important information in text, and learning the correlation and matching between images and text simultaneously. However, although these models have presented excellent capabilities in general fields, when they are applied to agricultural knowledge understanding, especially in essential tasks such as crop disease recognition and growth stage judgment, their performance is often unsatisfactory due to the lack of in-depth understanding of specific agricultural knowledge, which creates certain errors and limitations.

In order to reduce the impact of these problems, we propose a method, AgriVLM, to extract the features of agricultural images and text data through a visual language large model. AgriVLM uses Vision transformer as image encoder to extract image features, then employs Q-Former to connect the image encoder and pre-trained language model. Q-Former receives the text data and the output of the image encoder, combines them with query vectors to learn visual features related to the text, and exports a visual representation that the LLM can understand. Moreover, we applied Low-Rank adaptive (LoRA) to fine-tune the visual language large model through a lot of preprocessing agricultural data to achieve feature alignment. The experimental results show that the model can analyze different modal data in agriculture and achieve a precise recognition of crop diseases and growth stages, with the accuracy of all tasks reaching more than 90%. At the same time, it revealed excellent ability in processing agricultural knowledge conversations.

2. Datasets and Methods

In this section, we introduce the experimental data used to conduct the experiments and provide a detailed description of our proposed method. Finally, we also introduce the evaluation metrics of the model.

2.1. Datasets

By analyzing the growth situation of crops, the growth trend in crops can be determined to the greatest extent, providing timely and reliable growth information for crop production management personnel and accurately estimating crop yields [23]. At the same time, there are many types, and a high density, of crop diseases, which can easily cause a large reduction in crop yields and seriously restrict agricultural production [24]. Therefore, the precise recognition of crop growth stages and crop diseases plays an important role in improving crop yields.

In order to verify the effectiveness of the AgriVLM model on crop growth stages, disease recognition, and poultry recognition, this paper collected five datasets to conduct experiments on the model.

The first is cabbage growth stage data, which includes four growth stages, namely the germination stage, seedling stage, rosette stage, and dormancy stage, with a total of 400 images. The second is strawberry growth stage data. It contains 787 images and has been divided into four growth stages: the flourishing stage, flowering stage, fruiting stage, and maturity stage. The third is chili pepper disease data, which includes both healthy and bacterial spot data and contains 2475 images. Then, there is tomato disease data, including health, as well as septoria leaf spot, late blight, target spot, and yellow leaf curl virus—four kinds of disease data. There are a total of 5500 images, including five categories. The last is poultry recognition data, which contains 1213 images of chickens and pigs. We will conduct experiments on these five datasets.

Finally, in order to realize the agricultural data analysis service process, we learned agricultural knowledge by learning a large amount of agricultural conversation data. The agricultural conversation data comes from the “China Agricultural Technology Extension” APP, which contains a large number of answers from agricultural experts and technicians to questions raised by users. Here, we collected a total of 15,000 pieces of conversation data from the APP to conduct the experiments.

2.2. Methodology of Research

The overall framework of the AgriVLM is shown in Figure 1. AgriVLM uses a large amount of preprocessed agricultural images and text data to fine-tune the visual language large model for intelligent analysis, i.e., AgriVLM consists of three important components: data preprocessing, the visual language large model, and fine-tuning. The visual language large model uses Vision transformer (ViT) [25] to extract agricultural image features. Then, it uses the learnable query vector set through Q-former [26] to extract the visual features most relevant to the text from the visual model to achieve image–text feature alignment. Finally, the output of Q-Former is connected to a large language model to achieve intelligent analysis of agricultural data. While the number of parameters in the visual language large model is huge, it is impractical to optimize it from scratch. Therefore, AgriVLM uses the LoRA [27] fine-tuning visual language large model to further learn agricultural image and text features.

2.2.1. Data Preprocessing

Dataset division is a vital step in deep learning and machine learning that helps in training and optimizing the model, evaluating the performance of the model, and understanding the generalization ability and robustness of the model. For the crop growth stage, disease recognition, and poultry recognition tasks, the data division of the training and test sets is shown in Table 1.

Suitable prompts can provide with context, constraints, and guidance to enable the larger model to have a better understanding of the question and to generate more accurate responses that match human expectations. Here, AgriVLM achieves accurate recognition of each task by manually designing the following prompt. For different recognition tasks, the prompt is designed as “Please tell me which type of {dataset task category} {categories} this image belongs to”, and the answer of the model is the corresponding name of the category. In particular, for the cabbage growth stage recognition task, the prompt is designed as “Please tell me which of the cabbage growth stages germination, seedling, rosette, and dormancy this image belongs to?”. AgriVLM will answer with the name of one of the four stages, e.g., “germination”. Other tasks can be designed as the same way. In this way, the type of task that the model is required to implement can be clearly indicated and the model can be effectively guided to generate output that conforms to a specific format. This not only ensures clarity of the task, but also improves the standardization of the output.

Low-quality conversation data may increase the training time and computational requirements of the model and reduce the accuracy. High-quality conversation data plays an important role in improving model performance, enhancing interaction experience, and enhancing model generalization ability. Therefore, in order to improve the quality of agricultural conversation data, we clean the meaningless and repeated conversation data among them, and finally obtained 5593 available data. The definition of meaningless data is shown in Table 2.

2.2.2. Visual Language Large Model

The visual language model used in AgriVLM mainly consists of three parts: image encoder, pre-trained large language model, and Q-former [26].

The image encoder used by AgriVLM to extract information from agricultural images is ViT [23]. ViT firstly divides the input image into patches of the same size and then maps each patch to a fixed-dimensional feature space through a linear projection layer, thus ensuring that each patch can be effectively understood and processed by the model. At the same time, to ensure that the model can make full use of the spatial relationships of each patch, ViT uses positional encoding to provide the model with the position information of each patch in the original image, which enables the model to understand the structure of the image more accurately. After the above processing, each patch is converted into a feature vector. Finally, these feature vectors are fed into the transformer encoder module which learns the image features by capturing the global dependencies of the image using the self-attention mechanism.

The large language model employs ChatGLM [10] to understand agriculture-related knowledge and further analyze agricultural text features. ChatGLM has 6.2 billion parameters and is trained on approximately 1 trillion bilingual corpora, including Chinese and English. With the support of technologies such as supervised fine-tuning, self-feedback, and reinforcement learning, ChatGLM performs excellently in bilingual question-and-answering in Chinese and English, and can understand user intent and generate fluent, natural responses. The most important component of ChatGLM is the GLM pre-training framework based on an autoregressive blank filling task [28].

The GLM framework trains the model by autoregressively predicting the blank parts of the text to learn the inner rules and structure of the language. Specifically, for a given input text x = [x₁, x₂, …, x_n], GLM randomly selects multiple consecutive text spans {s₁, s₂, …, s_m} for masking, where each span s_i contains consecutive tokens of length l in x and is replaced by a special token [Mask], forming the masked text. GLM predicts the masked tokens s_i through autoregression. When predicting the tokens in the mask, the model can refer to the information of previous spans. Therefore, GLM randomly shuffles the order of each text span to fully capture the interdependencies between the individual spans.

AgriVLM uses Q-Former as a bridge connecting the image encoder and the language model [24] to achieve alignment of image and text features. Q-Former is a lightweight transformer with a two-stage pre-training strategy. The first stage is visual language representation learning, which enables Q-Former to learn the visual representation most relevant to the text and achieve semantic fusion of cross-modal representations. Q-Former consists of two modules: an image transformer and a text transformer. The image transformer is used to interact with the frozen image encoder, while the text transformer plays the role of both a text encoder and text decoder. Specifically, Q-Former contains three different training tasks to learn features, namely image–text contrastive learning (ITC), image–text matching (ITM) and image-grounded text generation (ITG), respectively. The main idea of ITC is consistent with Contrastive Language–Image Pre-training (CLIP) [29], which uses contrastive learning to learn the similarity of image–text pairs, and the training objective is to make the similarity of positive pairs greater than the similarity of negative pairs. ITM is a binary classification task used to predict whether an image–text pair matches. The objective of ITG is to train Q-Former so that it has the ability to generate text for the given image. This task requires the model to be able to understand the visual information in the image and simultaneously convert this information into the form of natural language. The second stage is visual-to-language generative learning. The Q-Former output is mapped into a vector with the same dimension as the large language model word embedding through a linear layer and is concatenated in front of the large language model input text embedding so as to connect the output of the Q-Former to the frozen large language model so that the visual representation learned by Q-Former can be interpreted by the large language model.

2.2.3. LoRA Parameter Fine-Tuning

LLMs usually contain hundreds of millions or even billions of parameters. Training an LLM from scratch requires a large amount of computing resources, which makes the calculation cost extremely expensive. Meanwhile, the huge number of parameters may lead to overfitting the training results. In order to solve these problems, AgriVLM uses LoRA technology [27] to fine-tune the model parameters to learn the characteristics of agricultural data. LoRA fine-tuning allows the model to capture the characteristics of a specific task while maintaining pre-trained knowledge, allowing it to quickly adapt to new tasks. LoRA effectively avoids risk of overfitting while lowering the computational cost, providing an efficient and reliable solution for processing agricultural data.

LoRA effectively adjusts the parameters of the model by adding two trainable matrices with Low-Rank decomposition into the model parameter space. Its overall workflow is shown in Figure 2. In the fine-tuning process, LoRA is mainly used in the transformer layers of the language model. Specifically, a trainable Low-Rank separable matrix is added to each layer of the transformer. Assume that the parameters of the pre-trained model need to be updated during fine-tuning, which is expressed as

W_{p m} + W_{l o r a}

, where

W_{p m} \in R^{d \times k}

are the parameters of the pre-trained model and

W_{l o r a}

are the parameters that should be updated. The parameters are updated as shown in Equation (1):

W_{p m} + W_{l o r a} = W_{p m} + B_{l o r a} A_{l o r a}

(1)

LoRA freezes the

W_{p m}

parameters during the training process and only updates the parameters in A_lora and B_lora, where

B_{l o r a} \in R^{d \times r}, A_{l o r a} \in R^{r \times k}

, r << min (d, k). Matrix A_lora is initialized using a random Gaussian distribution in order to keep the expressive capability of the model. The random initialization of the Gaussian distribution can help the model to learn the effective feature representation quickly in the early stage of training, which can improve the performance of the model. Matrix B_lora is initialized using a zero matrix in order to reduce the impact on the original model parameters while ensuring the stability of the model during fine-tuning. Finally, r is a hyperparameter which is the rank of the matrix A_lora and B_lora, and r << min (d, k) makes the number of parameters in the fine-tuning process significantly reduced, saving computational resources.

When LoRA is fine-tuning the transformer layer, it is mainly applied in the weight matrices used to learn the Q, K, and V vector representations in the multi-head self-attention module. To be specific, for the training input data

x

, Q, K, and V are calculated as shown in Equation (2):

\begin{array}{l} Q = x \cdot (W_{p m}^{q} + W_{l o r a}^{q}) = x \cdot W_{p m}^{q} + x \cdot W_{l o r a}^{q} \\ K = x \cdot (W_{p m}^{k} + W_{l o r a}^{k}) = x \cdot W_{p m}^{k} + x \cdot W_{l o r a}^{k} \\ V = x \cdot (W_{p m}^{v} + W_{l o r a}^{v}) = x \cdot W_{p m}^{v} + x \cdot W_{l o r a}^{v} \end{array}

(2)

where

W_{p m}^{*}

represents the pre-trained weight matrix and

W_{l o r a}^{*}

is the trainable weight matrix. After calculating Q, K, and V, the softmax operation is used, and the computational inference for matrices Q and K with LoRA layers can be expressed as in the following Equation:

\begin{matrix} S o f t m a x (Q, K) = S o f t m a x (x \cdot W_{p m}^{q} \cdot {(W_{p m}^{k})}^{T} \cdot x^{T} + x \cdot W_{p m}^{q} \cdot {(W_{l o r a}^{k})}^{T} \cdot x^{T} \\ + x \cdot W_{l o r a}^{q} \cdot {(W_{p m}^{k})}^{T} \cdot x^{T} + x \cdot W_{l o r a}^{q} \cdot {(W_{l o r a}^{k})}^{T} \cdot x^{T}) \end{matrix}

(3)

Finally, the calculation of self-attention is shown in Equation (4):

A t t e n t i o n = S o f t m a x (Q, K) V = S o f t m a x (Q, K) \cdot x \cdot W_{p m}^{v} + S o f t m a x (Q, K) \cdot x \cdot W_{l o r a}^{v}

(4)

2.3. Evaluation Metrics

We utilize four key indicators including accuracy, precision, recall, and F1-score to comprehensively evaluate the performance of the model in classification prediction tasks. Accuracy is the proportion of correct predictions in the total samples, representing the overall rate of correct predictions of the model, precision is the probability of actual positive samples among predicted positive samples, and recall is the ratio of the number of correctly predicted samples to the number of samples that should be predicted; it focuses on the ability of the model to recognize positive samples. F1-score combines the considerations of precision and recall, providing a more comprehensive and balanced evaluation view. The F1-score can reflect the overall performance of the model on the classification prediction task more effectively. The corresponding calculation method of each metric is shown in Equations (5)–(8):

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

where TP represents positive samples predicted to be positive, TN represents negative samples predicted to be negative, FP represents negative samples predicted to be positive, and FN represents positive samples predicted to be negative.

3. Experiments and Analysis of Results

We conducted experiments and analysis on five different sub-tasks and compared them with two existing classic models to illustrate the effectiveness of our model. At the same time, we also studied factors that may affect the experimental results, including fine-tuning methods, prompt designs, and different languages.

3.1. Experimental Setup

The AgriVLM model is implemented based on Python 3.10 and Pytorch 2.1.0 and is deployed on a Linux server. The graphics card used is an A800, the memory size is 80 GB, and the corresponding CUDA version is 12.0. In addition, the hyperparameter settings for different recognition tasks are shown in Table 3.

3.2. The Results of Five Recognition Tasks

AgriVLM evaluates the prediction effectiveness of the model on different tasks through four metrics: accuracy, precision, recall, and F1-score. The performance of the model on different tasks such as crop growth stage, disease recognition, and poultry recognition are shown in Table 4.

As can be seen in Table 4, the recognition accuracy of AgriVLM in different agricultural application scenarios has reached more than 90%. The accuracy of the five tasks is 0.925, 0.9739, 0.9756, 0.934, and 0.9952, respectively. The AgriVLM model has an excellent performance in recognition tasks in different application scenarios in agriculture, demonstrating extremely high reliability.

In order to show the ability of the model to recognize different categories more intuitively, the confusion matrix of the results of each task is plotted, which is shown in Figure 3. In the confusion matrix, the rows and columns represent the predicted and true labels, respectively. The diagonal elements of the confusion matrix indicate the number of samples in each category correctly recognized by the model, and the higher the value of the diagonal elements, the higher the recognition accuracy of the model for the corresponding category. On the contrary, the non-diagonal elements indicate the number of samples in one category that the model incorrectly recognized as other categories [30]. Therefore, it can be observed from the confusion matrices of the five tasks in Figure 3 that the model performs quite well and achieves an accurate recognition of different categories, illustrating the effectiveness of the model.

In addition, to validate the advantages of the model, we also compared it with classic visual language models such as Large Language and Vision Assistant (LLaVA) [31] and Flamingo [32]. The experimental results are shown in Table 5. From this table, it can be seen that our method has a higher prediction accuracy than existing methods. More specifically, in the poultry recognition task where the difference is obvious, the diversity between the three methods is not very big, but in other recognition tasks with less significant differences, especially in the recognition of tomato diseases, our method is significantly better than other methods.

3.3. The Results of Different Fine-Tuning Method

AgriVLM realizes the recognition of crop growth stages, pests and diseases, and other tasks by fine-tuning a large amount of agricultural data to a visual language large model, where, in order to verify the effect of different fine-tuning methods on the experimental results, we also used the p-tuning method [33] for fine-tuning to compare it with LoRA. The key idea of p-tuning is to add some consecutive embedding to the embedding layer of the model and train only this part. This method greatly reduces the complexity and cost of the model because, during the fine-tuning process, only the added embedding parameters need to be trained, and the original model parameters remain unchanged. The accuracy of different fine-tuning methods on different tasks is shown in Figure 4.

The results shown in this figure indicate that LoRA generally outperforms P-tuning. Although the results of the two fine-tuning methods are the same in poultry recognition, this is due to the relatively obvious differences between the two categories, which makes it easier for the model to recognize them. Secondly, in chili pepper disease recognition, the difference between the two is not obvious, only 0.82%. These results illustrate that in simple tasks such as binary classification or data with obvious differences, there is not much disparity in the learning effect between the two fine-tuning methods. However, in the multi-classification tasks in cabbage, strawberry growth stage recognition, and tomato disease recognition, the accuracy disparity between the two fine-tuning methods is constantly increasing, reaching 1.25%, 4.35%, and 5%, respectively, which shows that the stability as well as the adaptability of LoRA is better than P-tuning.

3.4. The Impact of Prompt with Different Granularity on the Experimental Results

A proper prompt can guide the large language model to learn more generalized knowledge representations and features, thus improving its ability on different prediction tasks. Meanwhile, a prompt can help the large language model to better understand and analyze the input data by including contextual information related to the prediction task, thus improving the accuracy and reliability of the prediction of the model. Here, we have designed three different prompts for the cabbage growth stage recognition task as an example, which is shown in Table 6, and so on for the other tasks. All three prompts tell the model a clear classification goal, with the difference that Prompt1 is more in line with people’s way of expression compared to Prompt2, while Prompt3 further enhances the interaction of contextual information by enriching the information in the answer, providing more fine-grained information. The impact of different prompt designs on the results is shown in Figure 5.

As can be seen in Figure 5, on the whole, the results of Prompt3 are better than those of Prompt1 and Prompt2, indicating that rich contextual information can improve the accuracy of the model prediction, while the results of Prompt1 and Prompt2 have their own merits and demerits in terms of performance on different tasks. However, when Prompt2 outperforms Prompt1, the distance between the two is not very large, but when Prompt1 outperforms Prompt2, the distance between the two is significant, illustrating the stability of prompt that conforms to the human expression based on clarifying the task goal of the model.

3.5. The Results of Different Language

The language model used by AgriVLM is ChatGLM, which is obtained by training on about 1 trillion Chinese–English bilingual corpora and supports Chinese–English bilingual question and answering. In order to verify the results of the model in different languages, we translated the above Prompt1 into English, then fine-tune the model, and its experimental results are shown in Figure 6.

As can be seen from this figure, the results in different languages are different for the same prompt. This may be because the number of English image–text pairs data in the training data used to pre-train the visual language large model is much higher than that of the Chinese. In the pre-training data, the Chinese data includes 30 million high-quality image–text pairs from the CogView dataset, and the English data are 300 million filtered image–text pairs. The difference in quantity between the two makes the model’s learning ability in English better than that in Chinese, so the accuracy of using English prompts is better than that of Chinese prompts on the whole.

3.6. Case Study in Agriculture

For agricultural workers, the quick and easy way to acquire relevant agricultural knowledge has become a vital problem to be solved. Here, the visual language large model is fine-tuned by using agricultural conversation data to achieve tasks such as growth stage analysis and agricultural knowledge question and answering. The results of the two tasks are shown in Table 7 and Table 8.

From the above two tables, we can find that AgriVLM can identify the growth stages of common fruits such as strawberries, apples, and peaches, and help farmers to better understand the growth conditions and needs of fruits, and make more scientific and proper planting plans, including choosing the right time for sowing, irrigation schemes, and so on.

At the same time, AgriVLM can also answer questions related to crop insect pests and diseases, fertilization, and poultry disease control, helping people to make a reasonable fertilization plan to improve soil fertility and thus increase crop yield, helping people to control the occurrence of poultry diseases, improving the survival rate and production efficiency of poultry and preventing the spread of poultry diseases to human beings.

4. Discussion

In this study, we designed AgriVLM, which uses ViT to extract image features, then, through Q-Former, connects the image encoder and language model, and finally, through LoRA, performs fine-tuning to achieve an intelligent analysis of agricultural knowledge. We conducted many experiments based on this model. Figure 7 systematically shows the various sub-tasks addressed by the model, including different crop growth stages, diseases, and poultry recognition tasks. The light-blue part in this figure represents the part where the model has already been experimented with, while the light-green part represents the part where experiments can be conducted in the future.

The experimental results in different sub-tasks show that the model performs quite well with an accuracy of more than 90%. Moreover, by comparing it with other existing classic visual language large models, it was found that the recognition ability of our model is significantly better than other methods, which illustrates the effectiveness of AgriVLM. In addition, in order to explore the influence of fine-tuning methods, prompt designs, and languages on experimental results, we conducted further experiments. We compared the effects of two different fine-tuning methods, LoRA and P-tuning, on the results and found that for simple tasks with obvious category differences, the performance difference between the two is not very obvious, but as the difficulty of the task increases, the performance disparity between the two methods will continue to increase. Then, we compared the effects of prompt designs of different granularities on the results and found that fine-grained prompts with rich contextual information can further improve the accuracy of model predictions. Finally, we compared the impact of different languages on the results and found that due to the differences in pre-training data, English prompts improved the accuracy of the model to a certain extent.

5. Conclusions

This paper proposes an agricultural analysis framework based on the visual language large model named AgriVLM, which realizes the analysis of agricultural data by injecting knowledge from the agricultural field into the model. Experimental results prove that it performs well in agricultural image recognition and agricultural knowledge-related understanding. AgriVLM provides a new direction for data analysis in the agricultural field and provides a reference for agricultural knowledge question-and-answer tasks. It also provides a way for the development of visual language large models in the agricultural field.

Although AgriVLM has achieved great results in agricultural image recognition and conversation tasks, how to design more effective knowledge injection methods to better improve the model’s generation quality and recognition accuracy remains to be studied. At the same time, how to obtain higher-quality visual data representations is also a challenge. AgriVLM only integrates image and text data generated during the agricultural production process, and ignores many other modal data such as audio and video. In the future, intelligent analysis of agriculture can be further achieved by fusing multi-modal data.

Author Contributions

Conceptualization, B.L.; methodology, B.L.; software, P.Y.; validation, P.Y.; formal analysis, P.Y.; investigation, P.Y.; resources, B.L.; data curation, P.Y.; writing—original draft preparation, P.Y.; writing—review and editing, B.L.; visualization, P.Y.; supervision, B.L.; project administration, B.L.; funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Scientific and Technological Innovation 2030-“New Generation Artificial Intelligence” Major Project, grant number 2021ZD0113603.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goel, R.K.; Yadav, C.S.; Vishnoi, S.; Rastogi, R. Smart agriculture—Urgent need of the day in developing countries. Sustain. Comput. Inform. Syst. 2021, 30, 100512. [Google Scholar] [CrossRef]
Yang, X.; Shu, L.; Chen, J.; Ferrag, M.A.; Wu, J.; Nurellari, E.; Huang, K. A Survey on Smart Agriculture: Development Modes, Technologies, and Security and Privacy Challenges. IEEE/CAA J. Autom. Sin. 2021, 8, 273. [Google Scholar] [CrossRef]
Teubner, T.; Flath, C.M.; Weinhardt, C.; van der Aalst, W.; Hinz, O. Welcome to the Era of ChatGPT et al. Bus. Inf. Syst. Eng. 2023, 65, 95–101. [Google Scholar] [CrossRef]
Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-Related research and perspective towards the future of large language models. Meta-Radiol. 2023, 1, 100017. [Google Scholar] [CrossRef]
Birhane, A.; Kasirzadeh, A.; Leslie, D.; Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 2023, 5, 277–280. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 11 June 2018).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; p. 2011. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Leoni Aleman, F.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-trained Model. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Yang, T.; Hu, K. Traditional Chinese Medicine Epidemic Prevention and Treatment Question-Answering Model Based on LLMs. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 5–8 December 2023; pp. 4755–4760. [Google Scholar]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Zhang, X.V.; Yang, Q. XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 4435–4439. [Google Scholar]
Li, H.; Gao, H.; Wu, C.; Vasarhelyi, M.A. Extracting Financial Data from Unstructured Sources: Leveraging Large Language Models. J. Inf. Syst. 2024, 1–22. [Google Scholar] [CrossRef]
Yang, X.; Wang, Z.; Wang, Q.; Wei, K.; Zhang, K.; Shi, J. Large language models for automated Q&A involving legal documents: A survey on algorithms, frameworks and applications. Int. J. Web Inf. Syst. 2024, 20, 413–435. [Google Scholar] [CrossRef]
Deroy, A.; Ghosh, K.; Ghosh, S. Applicability of large language models and generative models for legal case judgement summarization. Artif. Intell. Law 2024. [Google Scholar] [CrossRef]
Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J. Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends. Found. Trends Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar] [CrossRef]
Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar] [CrossRef]
Hu, J.; Yao, Y.; Wang, C.; Wang, S.; Pan, Y.; Chen, Q.; Yu, T.; Wu, H.; Zhao, Y.; Zhang, H.; et al. Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages. arXiv 2023, arXiv:2308.12038. [Google Scholar] [CrossRef]
Chen, X.; Shi, Y.; Li, X. Research on Recognition Method of Chinese Cabbage Growth Periods Based on Swin Transformer and Transfer Learning. Appl. Eng. Agric. 2023, 39, 381–390. [Google Scholar] [CrossRef]
Lee, H.; Park, Y.-S.; Yang, S.; Lee, H.; Park, T.-J.; Yeo, D. A Deep Learning-Based Crop Disease Diagnosis Method Using Multimodal Mixup Augmentation. Appl. Sci. 2024, 14, 4322. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; p. 814. [Google Scholar]
Yu, Y.; Yang, C.H.H.; Kolehmainen, J.; Shivakumar, P.G.; Gu, Y.; Ren, S.R.R.; Luo, Q.; Gourav, A.; Chen, I.F.; Liu, Y.C.; et al. Low-Rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition. In Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan, 16–20 December 2023; pp. 1–8. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 320–335. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Orchi, H.; Sadik, M.; Khaldoun, M.; Sabir, E. Automation of Crop Disease Detection through Conventional Machine Learning and Deep Transfer Learning Approaches. Agriculture 2023, 13, 352. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26296–26306. [Google Scholar]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Liu, X.; Ji, K.; Fu, Y.; Du, Z.; Yang, Z.; Tang, J.J.A. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 61–68. [Google Scholar]

Figure 1. The framework of AgriVLM.

Figure 2. The workflow of LoRA fine-tuning.

Figure 3. The confusion matrices of five different recognition tasks. (A) Cabbage growth stage. (B) Strawberry growth stage. (C) Chili pepper disease. (D) Tomato disease. (E) Poultry recognition.

Figure 4. Comparison of LoRA and P-tuning results.

Figure 5. Comparison of results for different prompts.

Figure 6. Comparison of results for different language.

Figure 7. Agricultural sub-tasks handled by AgriVLM.

Table 1. The division of different datasets.

	Cabbage	Strawberry	Chili Pepper	Tomato	Poultry
Training set	320	557	2229	5000	1003
Test set	80	230	246	500	211

Table 2. Conversation data filtering.

Question	Answer	Availability
How many growth stages is wheat growth divided into?	Learn from all agricultural technical experts through the platform!	Not available
What are the techniques for cultivating tomatoes?	Tomatoes cost five yuan a kilogram at the market.	Not available
What are the freeze preventive measures for wheat?	Thanks for sharing this with the teachers, it’s great.	Not available
What is the best temperature for strawberries?	The best temperature for strawberries is between 18 and 25 °C.	Available

Table 3. The hyperparameter settings for different recognition tasks.

	Cabbage	Strawberry	Chili Pepper	Tomato	Poultry
lr	0.0002	0.0002	0.0002	0.0002	0.0002
batch_size	16	16	16	16	16
Trian_iters	1500	2000	2000	10,000	400
layer	4	4	4	6	2
r	10	10	10	10	10

Where lr represents the learning rate of the training process, trian_iters is the number of iterations in the training, layer represents the number of LoRA fine-tuning layers, and r is the rank of LoRA fine-tuning.

Table 4. The results of five different recognition tasks.

	Cabbage	Strawberry	Chili Pepper	Tomato	Poultry
Accuracy	0.925	0.9739	0.9756	0.934	0.9952
Precision	0.9348	0.97	0.9782	0.94	0.9954
Recall	0.925	0.97	0.9713	0.93	0.9951
F1	0.9242	0.97	0.9745	0.93	0.9952

Table 5. The results of different methods on five recognition tasks.

	Cabbage	Strawberry	Chili Pepper	Tomato	Poultry
LLaVA	0.6	0.8434	0.7967	0.4240	0.9810
Flamingo	0.4375	0.4957	0.7195	0.2	0.9810
AgriVLM	0.925	0.9739	0.9756	0.934	0.9952

Table 6. Three different prompt designs.

	Question	Answer
Prompt1	Please tell me which of the cabbage growth stages germination, seedling, rosette, and dormancy this image belongs to?	Germination
Prompt2	Please extract the key information from the image and classify this image into one of the following given categories: germination, seedling, rosette, and dormancy.	Germination
Prompt3	Please tell me which of the cabbage growth stages germination, seedling, rosette, and dormancy this image belongs to?	This image belongs to the seedling of the cabbage germination, seedling, rosette, and dormancy stages.

Table 7. Case studies of AgriVLM: growth stage analysis.

	Q: Please tell me which stage of the apple growth stage this image belongs to: flourish stage, flowering stage, fruiting stage, and maturity stage? A: The flowering stage.
	Q: Please tell me which stage of the strawberry growth stage this image belongs to: flourish stage, flowering stage, fruiting stage, and maturity stage? A: The fruiting stage.
	Q: Please tell me which stage of the peach growth stage this image belongs to: flourish stage, flowering stage, fruiting stage, the maturity stage? A: The maturity stage.

Table 8. Case studies of AgriVLM: agriculture questions and answers.

	Q: What are the main pests and diseases of chili pepper? A: The main pests and diseases of chili pepper are: bacterial wilt, powdery mildew, soft rot, blight, anthracnose, and so on. Q: What does the harm symptom of chili pepper blight disease have? A: In chili pepper blight disease plants, the leaves are dark green and brown and necrotic between the leaf veins. The stems become pale or brown in color, and irregular water-stained spots appear at the beginning of the piece, and then expand into round black-brown spots.
	Q: What reasons are wheat leaf yellow caused at present? A: The reasons of yellowing of wheat leaves are related to insufficient soil fertility, lack of water, and poor root development. Q: What are the main reasons of wheat collapse? A: 1. Lack of water: the root system is unable to absorb enough water, resulting in slow growth and easy to collapse; 2. Over-fertilization: over-application of nitrogen fertilizer can lead to oversized and thin leaves, which lose their supporting capacity and are prone to collapse; 3. Low-temperature freezing: wheat is prone to collapse after freezing under cold weather conditions.
	Q: What do the drugs that treat pig parasitic disease have? A: Drugs for the treatment of swine parasitic diseases include ivermectin, albendazole, avermectin, and so on. Q: How many times a year can fattening pigs be vaccinated against swine fever? A: Fattening pigs can be vaccinated against swine fever twice a year.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, P.; Lin, B. A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model. Appl. Sci. 2024, 14, 8350. https://doi.org/10.3390/app14188350

AMA Style

Yu P, Lin B. A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model. Applied Sciences. 2024; 14(18):8350. https://doi.org/10.3390/app14188350

Chicago/Turabian Style

Yu, Piaofang, and Bo Lin. 2024. "A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model" Applied Sciences 14, no. 18: 8350. https://doi.org/10.3390/app14188350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model

Abstract

1. Introduction

2. Datasets and Methods

2.1. Datasets

2.2. Methodology of Research

2.2.1. Data Preprocessing

2.2.2. Visual Language Large Model

2.2.3. LoRA Parameter Fine-Tuning

2.3. Evaluation Metrics

3. Experiments and Analysis of Results

3.1. Experimental Setup

3.2. The Results of Five Recognition Tasks

3.3. The Results of Different Fine-Tuning Method

3.4. The Impact of Prompt with Different Granularity on the Experimental Results

3.5. The Results of Different Language

3.6. Case Study in Agriculture

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI