1. Introduction
India is an agricultural nation; it relies on agriculture as a major contributor towards its economy. According to the estimates, released by the Ministry of Statistics & Programme Implementation (MoSPI), the Gross Value Added (GVA) of agriculture and allied sectors in 2020–2021 was 20.3%, it was 19% in 2021–2022 and it again came down to 18.3% in 2022–2023 [
1]. India being one of the world’s agricultural powerhouses, is the world’s top producer of several spices, cereal crops such as rice and wheat, fruits and vegetables, along with commercial crops such as tea. Being one of the biggest producers not only produces for its own consumption but is also a key exporter of agricultural goods to several nations, rice and sugar being the major agricultural exports. India’s export trade was around Rs 380,000 crore in 2021–2022 (Department of Commerce, Government of India). Majority of India’s people live in rural areas, and agriculture is the major source of income for these people. Moreover, over 50% of the Indian population depend on agriculture. And about 54% of the total workers are agricultural workers [
2]. 57% of rural households are engaged in agriculture.
The total food grain production of the country is estimated to be 3296.87 Lakh tons [
3]. The major cereal crops are wheat, and rice grown in states such as UP, Punjab, Haryana, West Bengal, Andhra Pradesh. cash crops such as sugarcane are grown in UP and Maharashtra. Tamil Nadu, Kerala, Andhra Pradesh, Rajasthan, Gujarat grow several oil seeds. Fiber crops such as cotton, jute, silk and hemp are also grown in the states of Maharashtra, Gujarat, UP, Kerala. Plantation crops such as tea, coffee, rubber is also grown in the states of Assam, Karnataka, Kerala. Spices such as pepper, ginger, turmeric are grown in Tamil Nadu, Kerala, UP and Andhra Pradesh, this exemplifies how much India is dependent on agriculture and how almost the entire nation is a contributor to producing agricultural crops.
The Indian subcontinent is position in the southern Asia with the Indian Ocean influencing its climate, bordered by the Himalayas in the north. It has a tropical monsoon climate with several parts receiving monsoon showers, but the mean rainfall is very varied, parts of Meghalaya receive the heaviest rainfalls whereas some locations like Ladakh and Thar desert remain dry for most of the year. Regions of Ganga plains and coastal areas receive rainfall during the months of July and August, whereas places like Goa and Hyderabad receive rainfall during June and July. Several coastal areas of Tamil Nadu and West Bengal also face cyclonic rains during the months of October-November. The Indian climate and agriculture are greatly influenced by these rains. India has great annual temperature ranges with the coastal regions having lower moderate temperature, with the desert regions of Rajasthan having extreme high temperatures and the regions of Himalayas having cold temperature. The monsoon winds along with the temperatures and rainfalls greatly affect the agriculture production of India.
India also has a great distribution of soils. Alluvial soil being the most predominant one covers about 40% of the total land area and are present throughout the northern plains and river valleys. Black Soil covers about 15% of the land area and is found in regions of Deccan Plateau and are used to cultivate cotton, pulses and sugarcane. Crops such as wheat, oilseeds and cotton are also grown in red and yellow soils which are found in regions of low rainfall such as Odisha and Chhattisgarh. Peaty soil which is rich in humus is also found in regions of Tamil Nadu, Uttarakhand, Bihar and West Bengal. Laterite soil, Mountain soil, Desert Soil, Saline Soil is also found in parts of India.
These above climactic, soil and other factors greatly influence the agriculture of India and its production capabilities. These factors have a complex relationship that predicts the production of crop and its yield. The changing climate patterns have also made it that much difficult to predict the production of crops year after year.
Crop production and yield is an important parameter that is used to predict if the demand is met by the supply, to ensure if the crops produced meet the consumptive needs of the nation as well as the export needs of the nation. It enables the farmers to predict how much yield their farms will produce, which can be useful in many ways such as to prepare storage, optimize the use of fertilizers and natural resources and to increase efficiency and decrease costs and also allows to choose the crops that give the most yield.
Crop yield prediction is a crucial task that is of great importance in many areas. Crop production depends on several parameters such as the crop itself, the soil, the region, the climate. The parameters behave in a complex fashion that determines the crop production and yield. Traditionally, farmers predict the crop production and yield, by considering the rainfall trends, and the number of crops sown, but the changing climate as well as the intrinsic complexity of the factors that influence the crop production have made the prediction of crop production more difficult and less accurate.
To overcome these limitations, there are other ways in which crop production can be predicted which mainly include linear models, crop models and Machine Learning (ML) models. Linear models aim to predict by assuming the additive nature of parameter under consideration and fall short of effective prediction due this assumption of linearity of variables. Crop models aim to model by defining and including several explicit ways in which interactions occur through equations, this can be very complicated and expensive to construct along with the necessity of domain-specific knowledge. Crop models may also be computationally slow and resource-intensive.
ML models on the other hand are efficient in agricultural predictions and are not as resource intensive or complicated as crop models. These models are trained on data, and once trained, it can be used to predict crop production and yield. ML models are essentially models that learn over data provided to them by identifying patterns and such, and can use this learned knowledge to predict from unseen data. The availability of huge amounts of data such as climate datasets, soil data, and region data enables the building and training of such ML models to great accuracy, moreover the complex relationships among the weather, region, and soil parameters that influence the production of crops make ML models an optimal choice to tackle this problem statement.
Machine learning is especially useful in agriculture, where a huge amount of data is available and this can be used to make predictions or decisions on plants and animals. It also helps reduce the uncertainty in agricultural trends that are arising due to climatic changes hence providing a more consistent prediction. ML has widespread applications in agriculture such as disease detection, grading, irrigation, weather monitoring, animal welfare monitoring and so on.
In this paper, we have explored several ML models and trained them on the datasets available to predict the crop production and yield. The regions taken into account are the Indian states of Andhra Pradesh, Telangana, Karnataka, Kerala and Tamil Nadu.
Figure 1 highlights the areas considered in this research for crop yield prediction.
The crops that have been considered include rice, sorghum, cotton, sugarcane and rabi. Soil data consists of various soil attributes for each of the chosen districts from the chosen states.
The ML models that are used prevalently can be categorized into: regression, clustering, Bayesian models, and Artificial Neural Networks. Clustering Models aim to group data points based on similar characteristics, and hence enable grouping of a dataset into subgroups that have some inherent similarities. Regression and Bayesian models are suitable for predicting various agricultural parameters. We have trained with several ML models such as Linear Regression, Gradient Boosting Regression, Random Forest Regression, K-nearest Neighbors Regression, LGBM Regression and Decision Tree and Bagging Regression. The models have been evaluated using Mean Absolute Error (MAE), Root Relative Squared Error (RRSE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared score.
Prediction of crop production and yield using machine learning models greatly benefit the farmers and to predict the supply to meet the demand in an efficient and optimal way.
Our main contributions are:
A unique dataset has been compiled by us, encompassing essential factors influencing crop growth such as meteorological factors, soil factors, and agricultural factors for the certain specific states in India as mentioned above.
A statistical feature analysis was conducted on the influence of specific features related to meteorological, soil, and crop data on crop production and yield. This feature selection analysis helped in identifying the most contributing features to the prediction of crop production and yield. The impact of these features on crop production and yield was examined, focusing on their relationship to the outcome variables.
A comprehensive analysis has been conducted on the trained ML models, evaluating their performance using selected metrics. The ML models were also compared with each other based on these metrics.
2. Literature Review
Anakha Venugopal et al. [
4] has proposed a mobile application which predicts the crop’s name and also calculates its yield. The dataset used included some meteorological data such as temperature, wind speed and humidity and also the crop production related data. However, the data related to soil was not included in the dataset and the absence of the soil data limited the analysis of soil-related factors. The classification algorithms used in the paper are Logistic Regression, Naïve Bayes and Random Forest. And among these, the highest accuracy of 92.81% was shown by Random Forest model followed by Naïve Bayes with 91.50% accuracy and Logistic Regression with 87.80% accuracy. Thomas van Klompenburg et al. [
5] conducted a Systematic Literature Review (SLR) to gather algorithms and attributes used in crop prediction studies. Their analysis revealed that the most used features in these studies are rainfall, temperature and soil type. And also, it showed that the most commonly used algorithm is Artificial Neural Networks (ANN). The paper also discussed the evaluation metrics used in these studies, which revealed that Root Mean Squared Error (RMSE) was the most popular choice. And further, it provided an additional analysis of deep learning-based studies. However, the review doesn’t delve into the challenges related to the data quality, feature selection or model interpretability. Sonal Agarwal et al. [
6] proposed a hybrid approach for crop yield prediction, combining machine learning and deep learning techniques. The enhanced model provided a better accuracy of 97% compared to the existing model which had an accuracy of 93%. Support Vector Machine (SVM) was used for machine learning algorithm and for deep learning algorithms, Long Short-Term Memory networks (LSTM) and Recurrent Neural Network (RNN) was used. The study lacks an in-depth exploration of the trade-offs between different hybrid models or their computational requirements.
A.B. Sarr et al. [
7] investigate crop yield prediction methods specifically for Senegal. They proposed a study in which they used three machine learning models which are SVM, Random Forest and Neural Network and one multiple linear regression that is Least Absolute Shrinkage and Selection Operator (LASSO) to predict the yield of essential food staple crops in Senegal. They used three combinations of predictors: vegetation data, climate data and combination of both, for training the models. The best performance was shown by models trained with combination of both vegetation and climate data. However, This study overlooked the influence of soil conditions on the prediction of crop yields. S. S Kale et al. [
8] aimed to predict crop yield of different crops using neural network regression. Dataset is obtained from Indian government websites for districts in Maharashtra, India. The model predicted with an 82% accuracy by using three layered ANN that uses
Rectified Linear activation function (RELU) activation function and Adam optimizer. N. Bali et al. [
9] explored various machine learning algorithms and techniques used in crop yield prediction, and assesses advanced techniques like deep learning in such estimations and also explores the efficiency of hybridized models. It concluded that factors such as precipitation and temperature were the most influencing factors along with agronomic practices adopted by farmers. ANN and Adaptive Neuro-Fuzzy Inference System (ANFIS), hybridized fuzzy and ANN models showed the best accuracy. In addition to the mentioned studies, various other studies have successfully incorporated neural networks into crop yield prediction, such as those referenced in citations [
10,
11,
12,
13].
Research conducted by Hames Sherif [
14] identified the important factors responsible for staple crops production in semi-arid and desert climates in Africa to predict their yield. Machine learning models used in the research were Multiple Linear Regression (MLR) model and random forest regressor. Metrics such as RMSE and R-squared score were used to compare the performances between the models. It was found that random forest regressor had better accuracy in its predictions compared to the MLR model. However, the run time for training the random forest model was significantly higher than that of MLR model. One limitation of this research paper is that the accuracy of the data obtained from various sources is difficult to assess, as it is largely collected by member countries and includes imputations for missing data whose accuracy is unknown. H.A. Burhan [
15] evaluated regression machine learning methods for crop yield prediction of major crops in turkey. The data used to train these models includes pesticide use, meteorological factors and crop yield values. The best R-squared scores were shown by random forest and decision tree regression methods, but support vector regression seemed to show extremely poor performance.
M. Kuradusenge et al. [
16] used yields and weather data from a district in Rwanda to predict crop harvest specifically Irish potatoes and Maize. The models used to predict are Polynomial, Random Forest and Support Vector regressors. Among these models, random forest showed the best performance with R-squared score of 0.875 for potatoes and 0.817 for maize. However, the paper did not include other weather-related features such as humidity, wind speed and solar radiation and also soil data for training the models, the predictions made does not incorporate the impact of these factors, despite their actual significance. F. Abbas et al. [
17] used four machine learning algorithms, particularly linear regression, elastic net, K-nearest Neighbor (KNN) and Support Vector Regression (SVR) to predict the potato tuber yield from crop and soil data. The best performance among these models is shown by SVR model, while KNN model showed poor performance among them. Furthermore, several other studies have discussed the influence of soil factors on crop yield prediction such as [
18,
19].
P. Das et al. [
20] presented a novel hybrid approach of combining the soft computing algorithm, multivariate adaptive regression spline (MARS) for feature selection with SVR and ANN models to predict grain yield. The MARS-based hybrid models showed better performance compared to the regular models. Y. Shen et al. [
21] proposed an architecture combining long short-term memory neural network and random forest (LSTM-RF) to predict wheat yield using multispectral canopy water stress indices (CWSI) and vegetation indices (VIs) as training data. The combined model, LSTM-RF, showed a better R-squared score of 0.71 than just LSTM model with R-squared score of 0.61. Other research works that utilized LSTM for improved crop yield prediction performance include [
22,
23,
24,
25].
N. Banu Priya et al. [
26] has aimed to predict crop yield production using data mining techniques and machine learning algorithms, to improve the accuracy of the crop production to manage agricultural risk. Random forest regression, decision tree regression and gradient boost regression have been used and have achieved 88% R-squared score with random forest regression. However, a limitation is that the study uses relatively simple factors like the state, district, crop, and season, which may not fully capture the complexity of crop yield variability. The ability of statistical models to predict crop yield production with respect to changes in mean temperature and precipitation was examined by David B. Lobell et al. [
27] as simulated by a crop model (CERES Maize). Results suggested that statistical models when compared to crop models represent a useful but imperfect tool in prediction crop production, however it was also observed that they performed better at broader scales concluding that statistical models would still play an important role in predicting the impacts of climate change. One limitation of this study is that these models rely on historical yields and simplified weather measurements, which may not fully capture the complexity of climate impacts on agriculture.
Douglas K. Bolton et al. [
28] Used country level data from the USDA (United States Department of Agriculture) from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) to develop empirical models for predicting soybean and maize in Central United States. It was found that inclusion of phenology data greatly improved the model performance, however crop phenology data may not be readily available especially when taking into account regional crop varieties and it can also complicate the interaction relationships and increase model complexity. The paper provided by Yogesh Gandge et al. [
29] aims to predict crop yield to help farmers and government to plan better, and uses data mining techniques to efficiently extract features and data used to predict crop yield and finds that there is still scope for improvement and to use better unified models along with a bigger dataset to predict crop yield at a greater accuracy.
Keerthana Mummaleti et al. [
30] analyzed the usage and implementation of ensemble techniques in predicting the crop type from location parameters, by retrieving 7 features from various databases with 28,252 instances. The paper concluded that an ensemble of decision tree regression and Ada boost regression gave the best accuracy, giving a recommendation of which crop should be cultivated in the region based on weather conditions. Although this study uses various models to predict the crop type, it doesn’t consider other important climatic factors and soil quality, which can significantly affect the predictions. Shah Ayush et al. [
31] suggested the optimal climate factors to maximize crop yield and to predict crop yield, using multivariate polynomial regression, support vector machine regression and random forest models. It uses yield and weather data from United States Department of Agriculture. The paper found that support vector regression obtained the best possible results. One limitation of the paper is that it does not consider other factors such as soil quality, which can also impact yields.
S. Misra Veenadhari et al. [
32] predicted the influence of climatic parameters on the crop yields in selected districts of Madhya Pradesh, India but other agricultural parameters were not considered in this paper. A prediction accuracy of 76 to 90% was achieved for the selected crops and districts and an overall prediction accuracy of 82%. The paper’s limitation in predicting crop yield using machine learning and climatic parameters lies in its exclusive focus on climate-related factors, neglecting other crucial agro-input variables that influence crop productivity. V. Sellam et al. [
33] analyzed the influence of environmental parameters like Area Under Cultivation (AUC), Annual Rainfall (AR), and Food Price Index (FPI) in crop yield. This has been done using Regression Analysis and achieved an accuracy of 0.7 (R-squared measure) using their linear regression model with least squares fit. The limitation of the paper in its reliance on a single predictive model and data from a single country, suggesting the need for future studies to explore other machine learning algorithms and expand the scope of the research to other regions. P. Mishra et al. [
34] used Gradient Boosting Regression to improve the prediction of crop yields for districts in France. The model showed a R-squared score of 0.51 which was significantly better than other models, namely Ada Boosting, KNN, Linear and Random Forests. The limitation of this paper is that it focuses only on predicting maize yields in France and does not consider other crops or regions. Gradient Boosting Regression was also used in other studies [
35,
36,
37] to improve the accuracy of crop yield prediction.
Leveraging Data Mining techniques, particularly KNN, V. Latha Jothi et al. [
38] provided research which focused on using historical data like rainfall, temperature, and groundwater levels to predict future crop production, aiding in analyzing past and predicting future groundwater levels for improved agricultural planning. The limitation of this research paper is the difficulty in estimating the rainfall precisely, which is an important factor for crop yield prediction. Similar research work of using KNN models for crop yield prediction was done in [
39,
40,
41,
42].
5. Conclusions
In conclusion, this research paper has compiled data from various sources to analyze the primary factors influencing crop yield in selected districts. Our findings highlight the importance of soil factors, meteorological conditions, and agricultural practices. Each of these factors was thoroughly investigated by compiling a primary dataset for each category, which was later merged into a comprehensive dataset.
The study also examined the influence of specific features within each factor on crop yield. The original dataset was utilized to train various regression machine learning models, and their performance was compared using metrics such as the R-squared score and RMSE. The Extra Trees Regressor model achieved the highest R-squared score of 0.9615, indicating its good prediction accuracy. Furthermore, the ML models were categorized into distinct groups based on their underlying techniques and methodologies, specifically linear, neighbors-based, and tree-based models.
Analyzing the average performances of these model groups revealed that the tree-based models demonstrated the highest average R-squared score of 0.9353, followed by neighbors-based models with a score of 0.9002, and linear models with a score of 0.8568. Additionally, the study briefly discusses the performance of the models in predicting crop yields for each specific crop, which is presented in a tabulated format.
Overall, this research paper provides valuable insights into the factors influencing crop yield and demonstrates the effectiveness of machine learning models in predicting and understanding agricultural outcomes. The findings contribute to the existing body of knowledge and underscore the significance of considering various factors in optimizing crop production.