Optimization of Crop Recommendations Using Novel Machine Learning Techniques

Lahza, Husam; Naveen Kumar, K. R.; Sreenivasa, B. R.; Shawly, Tawfeeq; Alsheikhy, Ahmed A.; Hiremath, Arun Kumar; Lahza, Hassan Fareed M.

doi:10.3390/su15118836

Open AccessArticle

Optimization of Crop Recommendations Using Novel Machine Learning Techniques

by

Husam Lahza

¹

,

K. R. Naveen Kumar

²

,

B. R. Sreenivasa

^3,*

,

Tawfeeq Shawly

⁴

,

Ahmed A. Alsheikhy

⁵,

Arun Kumar Hiremath

² and

Hassan Fareed M. Lahza

⁶

¹

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 23589, Saudi Arabia

²

Department of Computer Science & Engineering, Bapuji Institute of Engineering & Technology, Davangere 577004, Karnataka, India

³

Department of Information Science & Engineering, Bapuji Institute of Engineering & Technology, Davangere 577004, Karnataka, India

⁴

Department of Electrical Engineering, Faculty of Engineering at Rabigh, King Abdulaziz University, Jeddah 23589, Saudi Arabia

⁵

Department of Electrical Engineering, College of Engineering, Northern Border University, Arar 91431, Saudi Arabia

⁶

Department of Information Systems, College of Computers and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(11), 8836; https://doi.org/10.3390/su15118836

Submission received: 4 April 2023 / Revised: 18 May 2023 / Accepted: 25 May 2023 / Published: 30 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

A farmer can use machine learning to make decisions about what crops to sow, how to care for those crops throughout the growing season, and how to predict crop yields. According to the World Health Organization, agriculture is essential to the nation’s quick economic development. Food security, access, and adoption are the three cornerstones of the organization. Without a doubt, the main priority is to ensure that there is enough food for everyone. Increasing agricultural yield can help ensure a sufficient supply. The country-wide variation in crop yields is substantial. As a result, this will be the foundation for research into whether cluster analysis can be used to identify crop yield patterns in a field. Previous study investigations were only marginally successful in accomplishing their primary intended objectives because of unstable conditions and imprecise methodology. The vast majority of farmers base their predictions of crop yield on prior observations of crop growth in their farms, which can be deceptive. Standard preprocessing methods and random cluster value selection are not always reliable, according to the literature. The proposed study overcomes the shortcomings of conventional methodology by highlighting the significance of machine learning-based classification/partitioning and hierarchical approaches in offering a trained analysis of yield prediction in the state of Karnataka. The dataset used for the study was collected from the ICAR-Taralabalu Krishi Vigyan Kendra, Davangere, Karnataka. In the two dataset analysis techniques employed in the study to find anomalies, crop area, and crop production are significant variables. Crop area and crop yield are important variables in the two dataset analysis methods used in the study to detect anomalies. The study emphasizes the importance of a mathematical model and algorithm for identifying yield trends, which can assist farmers in selecting crops that have a large seasonal impact on yield productivity.

Keywords:

hierarchical clustering; precision agriculture; yield prediction; cluster analysis; dendrogram; partition clustering

1. Introduction

Precision agriculture, as it is now known, was pioneered by environmentally conscious farmers. Prior to the invention of computers, this method was used. They were successful in identifying both the actions required to increase crop yields and the variables that contributed to the field’s unpredictability. Farmers achieved this by taking field notes during the planting and harvesting seasons. Based on the information gathered, they would then select the most effective plan of action for the following year. Data-generating equipment and sensors have long been on the rise in agriculture [1]. This has enabled farmers to make data-driven decisions. This type of farming is known as smart farming. The author [2] provides a thorough overview of the various goals and strategies used in smart farming. One of the major problems in precision agriculture is taking into account agricultural production predictions and the various models that have been proposed and tried so far. Because crop yield is affected by a variety of factors such as soil, weather, fertilizers, and seeds, multiple datasets must be used [3]. As a result, estimating agricultural production is a difficult task. It is simple to estimate the actual yield using agricultural productivity forecasting models, but improving yield prediction accuracy is still preferable [4,5].

The majority of climate change simulations are based on deterministic biophysical crop models [6]. Based on detailed illustrations of plant physiology, these models can still be used to assess response mechanisms and potential adaptation strategies [7]. Statistical models, on the other hand, outperform them when making predictions at a larger spatial scale [8]. Several studies [9] have found a strong link between excessive heat and poor crop performance. This correlation was demonstrated using statistical models. Traditional econometric methods are used in these methods. In recent research, crop model output and crop model insights have both been incorporated into statistical model parameterization, among other attempts to merge crop models with statistical models [10]. These efforts have been made to better understand how statistical models and crops may interact [11].

Numerous studies go into detail on the various challenges to developing high-performance forecasting models. Choosing the best algorithm for high performance becomes a time-consuming and important task as a result. Furthermore, the chosen systems and algorithms must be extremely efficient at handling large amounts of data [12]. Furthermore, locating zones within a region that have behaved similarly over time is more useful than predicting specific yields within a sector. However, some factors that can affect yield, such as soil type, climate, harvesting techniques, and so on, may vary from season to season. As a result, even if crop yield remains constant from year to year. As a result, the yield of one season cannot account for differences in the field [13]. The large area and yield deviations make it difficult to precisely measure variants. Many crops have yield estimates built into the agricultural planning process. To protect the vested interests of farmers and the government, the research work categorizes the divisions based on yield and region. On some major agricultural issues, the government may act rashly. However, this approach assists farmers in selecting the appropriate crop for various areas in order to provide adequate yields. This is accomplished by taking into account cluster values based on the heuristic scores described in the following section. It also intends to elaborate on the importance of avoiding the cultivation of unnecessary crops in order to protect vital resources, such as time and money. Most farmers plant crops based on past experience, resulting in a lower yield. Various clustering algorithms are used in the study to identify the appropriate clusters, and comparative analyses are performed to determine the optimal cluster values. The section that follows contains a detailed description of the aforementioned study.

The research performs the following activities:

Hierarchical and partitioning approach to developing clusters based on factors such as location, output, productivity, etc.;
Comparative analysis is performed to identify the best method for structuring zones into clusters;
Recommend areas or fields with the potential to produce crops with high crop yields based on the scale value specified.

2. Materials and Methods

2.1. Dataset Overview

The data have been collected from various sources, including Krishi Kendra (Agricultural office) in Davangere district, Karnataka. Area and Production Statistics are collected from the Ministry of Agriculture and Farmers Welfare, Karnataka, India [14]. In this work, the dataset source is accessible in the records of the Karnataka Government [12,15]. The data are obtained from the year 1998 to the year 2018. The preliminary data collection is carried out for various districts in Karnataka. The dataset consists of huge observations with the following varying values: state, district, year, season, crop, yield (in tons), area (in hectares), and production (in hectares).

A precise summary of the statistical data of each variable is presented in Figure 1. Lines no. 5 and 6 represent the variable “crop” details. The total number of crops available is 43, and each needs to be identified by its specific name. Only the Karnataka state is used for the study. As the total number of locations is 30, one needs to know the names of different districts. Data are collected from 1997–1998 to 2017–2018, which counts as 21 multi-year data. One needs to know that the crops were studied all these years aptly. Line no. 20 to 23 represents the variable “season” details. Crops are classified into kharif, rabi, and summer crops. The Kharif season, which lasts until September, begins in June. The summer season lasts from March to May, while the Rabi season lasts from October to February. Whole_year specifies the different crops cultivated in a year irrespective of the three seasons. Representation of total value in the season column specifies the overall yield of a crop. Line no. 25 to 29 represents the details of the variable “area”. Variable “area” specifies an approximate number of lands in hectares used for agricultural purposes. Line no. 31 to 36 represents the details of the variable “production”. Variable “production” considers the aggregate area and calculates the production of crops in kgs. Missing values may be due to personal purpose, nonproduction, or unentered data usage. Line no. 38 to 42 represents the details of the variable “yield”. Variable “yield” determines the number of crops cultivated in tons per hectare.

2.2. Proposed Framework for Determination of Yield Trend

The flow chart of the working model, as shown in Figure 2. starts by loading the dataset and clearing unwanted data. The main data are then sent to the present summary. Once this is performed, extracting numeric variable stands is the following step. Extracting numeric variables helps in generating a table of correlation which shows the correlation coefficient between a set of variables, such as area, production, and yield.

If the correlated variables are available, redundant variables are filtered out as they would give the same results when we use them. Then, data were sent for the data inspection methods (probability density plot and boxplot). If present results using density plots exist multi-modal data, one is supposed to filter out such districts and then send it to the boxplot to realize the presence of outliers.

Afterward, it is further sent for the outlier detection methods (bpRule and Grubbs’ test). The outliers are removed only when the two outlier methods identify outliers.

Once the data are free from outliers, the next step of the flowchart is picking the optimal K for clustering. The results must be either area or yield. For area, group data by location, mean partition by area, and mean partition by yield are obtained. If the outcome is yield, then the grouping of data by location is the next step, which means partition by yield is obtained after the clustering, the output is categorized season-wise and presented, and inferences are drawn.

Most of the existing methods used multiple clustering algorithms to analyse large datasets CLARA (Clustering Large Applications) is an extension of k-medoids (PAM) techniques for dealing with data including many objects (greater than a few thousand observations). Its purpose is to reduce processing time and RAM storage problems. This is performed using the sample technique.

2.3. Algorithm for Determination of Yield Trends

Let D represent the training set of tuples presented by an 8-dimensional feature vector, X = (x1, x2, …, x26, 906), showing eight measurements, such as state, district, year, season, crop, yield (in tons), area (in hectares) and production (in hectares).

Firstly, Algorithm 1 considers numeric variables to check if the variables are changing together at a constant rate by using Pearson Correlation [16] given in Equation (1).

r = \frac{\sum_{i = 1}^{n} (x - \bar{x}) (y - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x - \bar{x})}^{2} {(y - \bar{y})}^{2}}} .

(1)

Then, linearly related variables are removed from the attribute list.

Algorithm 1: Determination of Yield Trends

Input: Data, D, which is a crop yield dataset;
attrlist, the set of numeric candidate attributes;
Output: A set of k territories’ crop yield based on the scales defined.
1    //find the correlated variables;
2    apply FeatureSelection(D, attrList);
3    if redundantVars then
4           attrList := attrList − redundantVars;
5    end
6    var := “area”;
7    DetectionAndInspection(D, var, locations);
8    //find the best k;
9    k := kSelection(mean(D [var]),locations);
10 //construct k partitions;
11 pObj := pGroup(mean(D [var]), k, locations);
12 var := “yield”;
13 for each outcome k of the partitions do
14          let D_k be the set of observations in the partition D satisfying outcome k;
              DetectionAndInspection(Dk, var, locations);
15           k := kSelection(mean(Dk [var]),locations);
16           hObj := hGroup(mean(Dk [var]), k,locations);
17 end

The algorithm calls the DetectionAndInspection () computes, draws kernel density estimates, and figures out the multi-modal distribution. It also removes the locations and location observations for which the method determines the existence of multi-modal distribution.

The simplest univariate outlier detection method, the Boxplot Rule, is also used and applied to the numeric variable area, which tags any value outside the interval

[Q t l 1 - 1.5 * I Q R, Q t l 3 + 1.5 * I Q R],

where Qtl1 (Qtl3) is the first (third) quartile,

I Q R = Q t l 3 - Q t l 1

is the interquartile range, and s an outlier.

A related method called the Grubbs’ Test starts by calculating the following z score for each observation x, Equation

z = \frac{|x - \bar{x}|}{s_{x}}

where

\bar{x}

is the sample mean of the variable x, and

s_{x}

represents sample standard deviation. Using this score, an outlier is declared if the following Equation (2) holds

z \geq \frac{N - 1}{\sqrt{N}} \sqrt{\frac{t_{α / (2 n), N - 2}^{2}}{N - 2 + t_{α / (2 n), N - 2}^{2}}} .

(2)

where N is the sample size and

t_{α / (2 n), N - 2}^{2}

is the value of the t-distribution at the significance level of

α / (2 N)

[8].

The key issues with any clustering algorithm are:

Cluster validation decided by the obtained solution is precise;
To obtain an appropriate number of clusters for the yield dataset (compactness or cluster separation).

Internal validation methods, such as Calinski–Harabasz Index and Average Silhouette Width Index, are adopted to overcome the issues related to clustering.

2.3.1. Calinski–Harabasz Index

Calinski–Harabasz Index, also known as the Variance Ratio Criterion, is the ratio of the sum of between_clusters dispersion and inter_cluster dispersion for all clusters; the higher the score, the better the performances.

For a set of data D of size

n_{D}

, which has to be clustered into k groups, the Calinski–Harabasz score “s” is given by the equation

s = \frac{t r (B_{k})}{t r (W_{k})} \times \frac{n_{D} - k}{k - 1}

where

t r (B_{k})

is the trace of the between_clusters dispersion matrix, and

t r (W_{k})

is traces of within-clusters dispersion matrix given by

W_{k} = \sum_{q = 1}^{k} \sum_{x \in C_{q}} (x - c_{q}) {(x - c_{q})}^{T}

(3)

W_{k} = \sum_{q = 1}^{k} n_{q} (c_{q} - c_{D}) {(c_{q} - c_{D})}^{T}

(4)

and

c_{q}

represents the set of points in cluster q; c_q represents the centre of cluster q,

c_{D}

is the centre of D, and

n_{q}

represents the number of points in cluster q [17].

2.3.2. Average Silhouette Width Index

For each observation “i”, the average distance is obtained to all objects in the same group as “i” and called this average “

a_{i}

”. For each observation, we also calculate the average distance to the cases belonging to the other groups, calling this value “

b_{i}

”. Finally, the silhouette coefficient of any observation, “

s_{i}

”, is given by equation

S_{i} = \frac{b_{i} - a_{i}}{m a x (a_{i}, b_{i})}

[18].

Initially, on a dataset of 26,906 instances, the algorithm calls the partition method on the variable area to form clusters. The area division obtained is further partitioned into a yield-wise group using the Hierarchical Method with linkage criteria.

Given datasets with huge observations, it would be extremely computationally expensive [19] to compute the partition and Hierarchical method. To reduce computation in evaluating these methods without the locations and associated observations splitting up and falling among different clusters [20], the mean () observations of each location are recorded.

The Agglomerative Clustering algorithm supports four linkage criteria: Complete_linkage, Single_linkage, Average_linkage, and Ward’s method [18,19].

The Maximum or Complete_linkage clustering measures the difference between two groups by the largest distance between any two observations in each group, and it is mathematically given Equation (5) as the distance D(X, Y) between cluster X and Y

$D (X, Y) = m a x_{x \in X, y \in Y} d (x, y)$

(5)
The Minimum or Single_linkage clustering measures the difference between two groups by the smallest distance between any two observations in each group, and it is mathematically given Equation (6) as the distance D(X, Y) between cluster X and Y

$D (X, Y) = m i n_{x \in X, y \in Y} d (x, y)$

(6)
The Average_linkage measures the difference between the two groups by the average distance between any two observations in each group, and it is mathematically given Equation (7) as the mean distance between elements of each cluster

$\frac{1}{|A| . |B|} \sum_{a \in A} \sum_{b \in B} d (a, b)$

(7)
Ward’s method aims to minimize the total within-cluster variance. At each step, the pair of clusters with minimum between-cluster distance is merged, and it is mathematically given Equation (8) as the squared Euclidean distance between points

$d_{i j} = d (\{X_{i}\}, \{X_{j}\}) = {| | X_{i} - X_{j} | |}^{2}$

(8)

All the above-discussed linkage methods are applied to the dataset to determine the yield patterns, and the complete details of the methods are explained in Figure 15 and Figure 16. From Table 6, the highest score instance is considered for model construction.

3. Data Analysis

A significant part of the process of data mining is data preprocessing. It can yield false results by examining data that have yet to be thoroughly examined, for instance, issues such as out-of-range values, unlikely mixes of data, missing values, etc. If such obsolete and repetitive data are present, the evidence exploration becomes more complicated during the modeling process.

The following are the critical problems that are recognized during preprocessing information.

3.1. Eliminating the Unwanted Observations

There are a few districts where multi-year data have not been consistent over several years, which could lead to incorrect output because crops in such districts have a lower threshold value (40%). These districts have been identified and removed, as shown in Figure 3.

3.2. Removing the Impossible Data Combination

Table 1 shows that the value “total” is not a season but rather the combining of various seasons in which a specific crop is grown while considering location, production, and output (for example, Kharif + summer). Rows with a cumulative value are highlighted and deleted.

3.3. Fill in the Missing Values

It is important to note that the output column (Figure 1, line 32) has sixty missing values. An in-depth examination of the data in Table 2 reveals that the yield column has zero value, implying that no crop yield occurred. Furthermore, the number of harvesting areas is limited. As a result, there was no agricultural production. In the output, zeros are used to replace the rows with missing values.

3.4. Removing the Impossible Data Mixes

Figure 4 and Figure 16 show a side-by-side comparison of boxplot results for a portion of the region and region results. There are two images in total. The boxplot with the outlier is on the left, and the boxplot without the outlier is on the right. The graphicPlot() function generates a boxplot with parameters. The boxplot is also given a rug, which shows the parameter’s concrete values and its horizontal dotted line at the mean value [18]. We can infer that anomalies have skewed the mean value by matching this dotted line with the inner region of the solid line of the box displaying the median line. Two methods for identifying outliers in districts with defined anomalies are the Boxplot Rule and the Grubbs’ test. As a result, they are capable of detecting and eliminating these anomalies.

4. Results and Discussion

Eight variables are used in the presented work, of which area, production, and yield are numeric variables. To avoid unnecessary analysis, a table of correlated variables is generated (Table 3) [20]. From this table, area and production appear to be highly correlated (about 95%). So, area and yield variables are considered for the analysis.

The shaded density plot on the area in Figure 4 suggests that four districts exhibit multi-modal distributions. From these plots, it will be easier for the two outlier detection methods to identify the outliers correctly.

Two outlier detection methods: The boxplot rule and Grubbs’ test [18] have identified one or two outliers in three districts, as shown in Figure 5. However, the outliers are very few, and outliers also distorted the mean value. So, the elimination of these outliers will not cause any misleading analysis. The work divides the yield zones, resulting in an actual clustering problem. Because of the lack of external information, knowing the value of the k parameter in advance is critical for proper groupings.

The partitioning method was used in conjunction with two criteria to estimate the best number of clusters:

Calinski–Harabasz Index (“CH”).
Average Silhouette Width Index (“ASW”).

Over a range of k is used to estimate the best k.

“ASW” suggested 5 clusters, and “CH” suggested 10 clusters (Figure 6). “ASW” shows the sign of exemplary cluster configuration. Out of clusters 2, 4, 5, and 6, the k value of 2 is selected, as their cluster criterion values are nearly the same, as shown in Table 4.

Further, it would be easier to give rankings to the clusters. The criterion “CH” proposed k value of 10 is not chosen because it indicates data overfitting. After setting the k value to 2, cluster analysis is performed using parameters based on the following features: one by taking the mean (area) and another by taking the mean(yield), with Euclidean distance as the metric. We use mean () for variables (area and yield) because we do not want the locations and their associated observations to split up and fall into different clusters, as this distorts the distribution and leads to false interpretations [20].

The partition method supports two objective functions—construct phase and exchange phase [21]. In the construct phase, the algorithm looks for a good initial set of medoids, and in the exchange phase, it tries to fine-tune initial estimates given by the rough clusters determined in the construct phase. By looking at the values (construct phase: 22350 and exchange phase: 16357) of these objective functions from Figure 7, the function did change significantly with more steps from the construct phase to the exchange phase. The partition method has selected two reference medoids as Bangalore_rural and Mandya. After dividing the areas into two groups of area 1 and area 2, these clusters are identified and renamed as small and large areas based on the mean values obtained in Table 5.

The second part focuses on the variable “yield” The density plot shown in Figure 8 indicates multi-modal distribution in one district, Bagalkot. Figure 9 has identified three districts, namely Hassan, Koppal, and Shimoga.

Outlier detection methods have detected two outliers in small areas, as shown in Figure 10, and three outliers in large areas, as shown in Figure 11.

Picking the k value in small and large areas uses the “ASW” metric along with hierarchical linkage criteria such as complete_linkage, average_linkage, single_linkage, and ward.D2 to estimate the best k. A small area picks up two clusters, whereas a large area picks up four clusters. The resultant scores, thus obtained, are presented in Table 6 [18].

After determining the best k value, cluster analysis is carried out on small and large areas using Agglomerative nesting (Agnes). Figure 12 depicts the results obtained from Agnes. Lines 2 and 11 specify the call to the Agnes function. The first argument to this function is the Euclidean distance matrix of variable yield from a small and large area cluster. On the other hand, the second argument specifies the criterion in this case as “average” to select the two groups for merging at each step. Lines 3 and 12 define the agglomerative coefficient, quantifying the amount of clustering structure discovered (values closer to 1 suggest a strong clustering structure).

We obtained 0.88 and 0.93 for the small and large areas, respectively, indicating that we have a fairly reasonable clustering structure for both groups. Lines 4 and 13 specify the order of objects, i.e., a vector containing a permutation of the original observations to allow plotting. Lines 6 and 15 define the height, a vector containing the distances between merging clusters at each stage.

A dendrogram is a tree-representation diagram. This diagrammatic representation, which is widely used in hierarchical clustering, shows how the clusters generated by the relevant analysis are arranged.

In this research, we used visual criteria, e.g., Average Silhouette Width(ASW) Index, Calinski-Harabasz(CH) Index, and hierarchical linkage criteria such as complete_linkage, average_linkage, single_linkage and ward.D2 to estimate the best k. Basically, we want to know how well the original distance matrix is approximated in the cluster space, so a measure of the cophenetic correlation is also useful. The concordance with Ward.D2 hierarchical clustering gives an idea of the stability of the cluster solution. In the current research two clusters, which were found in Figure 6, were used for the investigation, as shown in Figure 13.

A dendrogram is plotted and compared to a banner plot to determine whether the data have been properly clustered. Figure 13, Figure 14, Figure 15 and Figure 16 give comparative interpretations of the dendrogram and banner plot. The banner plot’s x and y axes represent the height and order in which objects are clustered in Figure 14.

The banner plot’s white area (Figure 14) dictates the unclustered data [22]. On the other hand, the white lines (Figure 14) indicate the red blocks where the clusters were arranged. As a result, it can be seen that objects 9 and 13 have a larger bar than objects 1 and 9. Moreover, in the objects between 2 and 5, there is no bar at all, and this indicates that objects numbered from 1, 9, so on to 5 belong to one cluster, which is completely dissimilar to the objects 2, 12, so on to 15 which belong to another cluster [23]. The dendrogram shows a similar pattern, as shown in Figure 13.

The unclustered data are determined by the white area of the banner plot (Figure 16). The white lines (Figure 16) represent the red blocks where the clusters were arranged. As a result, objects 1 and 7 have a larger bar than objects 1 and 3. Furthermore, there is no bar between objects 3 and 2, indicating that objects 1, 7, and 3 belong to the same cluster as objects 2 and 9, which belong to a different cluster. Figure 15 depicts a similar pattern in the dendrogram.

After dividing the districts’ as small and large areas into different clusters, the two clusters zone 1 and zone 2 are identified and renamed as low and high yield districts based on the pie plot obtained in the Table 7. Similarly, the four clusters zone 1 to zone 4 from large areas are identified and concluded that zone 1 is the low yield, zone 2 is high yield; zone 3 and zone 4 are considered as moderate yield districts. The results of average yield distributions of small and large areas are presented in Figure 17 and Figure 18, respectively.

Table 8 shows that the rice crops are cultivated and harvested during all three seasons. Moreover, larger areas are allocated to this crop for cultivation during the kharif season. On top of that, this season has a high impact on production compared to the summer season, followed by the Rabi season.

5. Conclusions

The research emphasizes the importance of prioritizing relevant crops for different districts by eliminating risks and investments to achieve yield benefits rather than harvesting or sowing crops that are not relevant to the regions. In order to achieve yield benefits instead of harvesting/sowing flop crops (irrelevant crops), the work presented aims to recommend and prioritize the relevant crops for different districts. The work extends to help government sectors and farmers by providing detailed information on the type of seasons and crops that have a high impact on yield and production, and also by resolving issues related to agricultural activities in weaker locations. Two dataset inspection methods (Probability Density Function and Boxplot) are used to define the multi-modal distribution and identify and eliminate outliers. The research work was conducted research observation on the agricultural planes of Karnataka and used the dataset furnished by the Ministry of Agriculture and Farmers Welfare, Karnataka, India, and ICAR-Taralabalu Krishi Vigyan Kendra, Davanagere, Karnataka, India. The proposed work is carried out on linkage parameters, along with two heuristic methods and two partition algorithms (both hierarchical and partition), which are used to estimate the value of the best k. In addition, dendrograms are created and compared to the banner plot to guarantee that the cluster formation is accurate. The algorithm designed in this work computes the mean score on area and yield variables, which reflects the fast execution and assists in carrying out precise analysis to draw the best yield patterns that can be recommended. Finally, based on economic significance, the work recommends the best crops by eliminating risks and investments.

Author Contributions

Conceptualization, H.L., K.R.N.K. and B.R.S.; data curation, T.S. and A.A.A.; formal analysis, H.L. and K.R.N.K.; funding acquisition, H.L.; investigation, B.R.S. and A.K.H.; methodology, K.R.N.K., B.R.S. and A.K.H.; supervision, B.R.S.; validation, H.F.M.L., T.S. and A.A.A.; writing—original draft, K.R.N.K.; writing—review and editing, H.L., K.R.N.K. and B.R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by the Institutional Fund Projects under grant no. (IFPIP: 959-611-1443). The authors gratefully acknowledge the technical and financial support provided by the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We accessed data on 17 August 2022 from ICAR-Taralabalu Krishi Vigyan Kendra, and data is available at www.taralabalukvk.com (accessed on 17 August 2022).

Acknowledgments

The authors also thank Bapuji Institute of Engineering &Technology (BIET) for providing the support and infrastructure.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.-J. Big Data in Smart Farming—A Review. Agric. Syst. 2017, 153, 69–80. [Google Scholar] [CrossRef]
Kamilaris, A.; Kartakoullis, A.; Prenafeta-Boldú, F.X. A Review on the Practice of Big Data Analysis in Agriculture. Comput. Electron. Agric. 2017, 143, 23–37. [Google Scholar] [CrossRef]
Xu, X.; Gao, P.; Zhu, X.; Guo, W.; Ding, J.; Li, C.; Zhu, M.; Wu, X. Design of an Integrated Climatic Assessment Indicator (ICAI) for Wheat Production: A Case Study in Jiangsu Province, China. Ecol. Indic. 2019, 101, 943–953. [Google Scholar] [CrossRef]
Filippi, P.; Jones, E.J.; Wimalathunge, N.S.; Somarathna, P.D.S.N.; Pozza, L.E.; Ugbaje, S.U.; Jephcott, T.G.; Paterson, S.E.; Whelan, B.M.; Bishop, T.F.A. An Approach to Forecast Grain Crop Yield Using Multi-Layered, Multi-Farm Data Sets and Machine Learning. Precis. Agric. 2019, 20, 1015–1029. [Google Scholar] [CrossRef]
Van Klompenburg, T.; Kassahun, A.; Catal, C. Crop Yield Prediction Using Machine Learning: A Systematic Literature Review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
Rosenzweig, C.; Jones, J.W.; Hatfield, J.L.; Ruane, A.C.; Boote, K.J.; Thorburn, P.; Antle, J.M.; Nelson, G.C.; Porter, C.; Janssen, S.; et al. The Agricultural Model Intercomparison and Improvement Project (AgMIP): Protocols and Pilot Studies. Agric. For. Meteorol. 2013, 170, 166–182. [Google Scholar] [CrossRef]
Ciscar, J.-C.; Fisher-Vanden, K.; Lobell, D.B. Synthesis and Review: An Inter-Method Comparison of Climate Change Impacts on Agriculture. Environ. Res. Lett. 2018, 13, 070401. [Google Scholar] [CrossRef]
Lobell, D.B.; Asseng, S. Comparing Estimates of Climate Change Impacts from Process-Based and Statistical Crop Models. Environ. Res. Lett. 2017, 12, 015001. [Google Scholar] [CrossRef]
Schlenker, W.; Roberts, M.J. Nonlinear Temperature Effects Indicate Severe Damages to U.S. Crop Yields under Climate Change. Proc. Natl. Acad. Sci. USA 2009, 106, 15594–15598. [Google Scholar] [CrossRef] [PubMed]
Roberts, M.J.; Schlenker, W.; Eyer, J. Agronomic Weather Measures in Econometric Models of Crop Yield with Implications for Climate Change. Am. J. Agric. Econ. 2012, 95, 236–243. [Google Scholar] [CrossRef]
Roberts, M.J.; Braun, N.O.; Sinclair, T.R.; Lobell, D.B.; Schlenker, W. Comparing and Combining Process-Based Crop Models and Statistical Models with Some Implications for Climate Change. Environ. Res. Lett. 2017, 12, 095010. [Google Scholar] [CrossRef]
Urban, D.W.; Sheffield, J.; Lobell, D.B. The Impacts of Future Climate and Carbon Dioxide Changes on the Average and Variability of US Maize Yields under Two Emission Scenarios. Environ. Res. Lett. 2015, 10, 045003. [Google Scholar] [CrossRef]
Majumdar, J.; Naraseeyappa, S.; Ankalaki, S. Analysis of Agriculture Data Using Data Mining Techniques: Application of Big Data. J. Big Data 2017, 4, 20. [Google Scholar] [CrossRef]
Crop Production Statistics for Selected States, Crops and Range of Year. Available online: https://aps.dac.gov.in/APY/Public_Report1.aspx (accessed on 2 January 2021).
Gandhi, N.; Armstrong, L.J.; Petkar, O. Proposed decision support system (DSS) for Indian rice crop yield prediction. In Proceedings of the 2016 IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR), Chennai, India, 15–16 July 2016; pp. 13–18. [Google Scholar] [CrossRef]
Pearson Correlation Coefficient—Wikipedia. Available online: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient (accessed on 22 March 2021).
Wei, H. How to Measure Clustering Performances When There Are No Ground Truth? Available online: https://medium.com/@haataa/how-to-measure-clustering-performances-when-there-are-no-ground-truth-db027e9a871c (accessed on 2 January 2021).
Torgo, L. Data Mining with R; Chapman and Hall/CRC Data Mining and Knowledge Discovery Series; CRC Press: Boca Raton, FL, USA, 2016; ISBN 978-1-315-39909-6. [Google Scholar]
Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; The\Morgan Kaufmann Series in Data Management Systems Ser. Morgan Kaufmann: Burlington, MA, USA, 2011; ISBN 978-0-12-381479-1. [Google Scholar]
Williams, G.J. The Essentials of Data Science: Knowledge Discovery Using R; Chapman and Hall/CRC the R Series; Chapman & Hall/CRC: Boca Raton, FL, USA, 2017; ISBN 978-1-4987-4001-2. [Google Scholar]
Toomey, D. R for Data Science|Packt. Available online: https://www.packtpub.com/product/r-for-data-science/9781784390860 (accessed on 22 March 2021).
Spector, P. Stat 133 Class Notes—Spring. UC Berkeley Statistics. 2011. Available online: https://www.stat.berkeley.edu/~s133/all2011.pdf (accessed on 21 March 2022).
Bock, T. What Is a Dendrogram? Available online: https://www.displayr.com/what-is-dendrogram/ (accessed on 22 March 2021).

Figure 1. Summary of the dataset used in the proposed work.

Figure 2. Data flow diagram for determination of yield trend.

Figure 3. Removal of district observations with a lower threshold value.

Figure 4. Shaded density plot—Area.

Figure 5. Boxplot with outliers—Area.

Figure 6. Plot of CH vs. ASW—Area.

Figure 7. Partition object—Area.

Figure 8. Shaded density plot—small area.

Figure 9. Shaded density plot—large area.

Figure 10. Boxplots with outliers—small area.

Figure 11. Boxplots with outliers—large area.

Figure 12. Agnes objects of small and large areas—yield.

Figure 13. Dendrogram plot for a small area.

Figure 14. Banner plot for a small area.

Figure 15. Dendrogram plot—large area.

Figure 16. Banner plot—large area.

Figure 17. Average yield distribution—small area.

Figure 18. Average yield distribution—large area.

Table 1. Impossible combination in the season column.

Crop	Year	Season	Area	Production	Yield
Rice	1998–1999	kharif	197	316	1.6
Rice	1999–2000	kharif	128	202	1.58
Rice	2000–2001	kharif	171	311	1.82
Rice	2001–2002	kharif	171	411	2.4
Rice	2001–2002	summer	13	19	1.46
Rice	2001–2002	total	184	430	2.34
Rice	2002–2003	kharif	112	230	2.05
Rice	2002–2003	summer	15	16	1.07
Rice	2002–2003	total	127	246	1.94
Rice	2003–2004	kharif	93	210	2.26

Table 2. Fill in the Missing Values in the production column.

Crop	Year	Season	Area	Production
moong	2015–2016	kharif	1	NA
linseed	2016–2017	rabi	2	NA
urad	2005–2006	kharif	2	NA
cowpea	2015–2016	summer	1	NA
rapeseed	2015–2016	kharif	2	NA
urad	2016–2017	kharif	1	NA

Table 3. Correlation between feature variables.

var	var2	cor
production	area	95%
yield	production	36%
yield	area	26%

Table 4. ASW crit scores for the variable area.

clust:	2	3	4	5	6
crit:	0.6	0.6	0.7	0.7	0.7

Table 5. Min to max cluster values for the variable area.

Clust	Min	Max	Mean
area 1	4	92,740	13,284
area 2	21,625	201,286	80,863

Table 6. ASW scores of small and large-area for variable yield.

k	Complete		Single		Average		ward.D2
k	Small	Large	Small	Large	Small	Large	Small	Large
2	0.55	0.58	0.39	0.41	0.62	0.55	0.55	0.58
3	0.47	0.59	0.34	0.53	0.54	0.53	0.47	0.59
4	0.41	0.71	0.46	0.42	0.48	0.71	0.41	0.71
5	0.48	0.68	0.43	0.68	0.52	0.68	0.48	0.68
6	0.43	0.61	0.32	0.61	0.43	0.61	0.43	0.65
7	0.43	0.62	0.35	0.56	0.41	0.62	0.43	0.62

Table 7. Mean values of a small and large area for yield variable.

Small Area—Yield				Large Area—Yield
Clust	Min	Max	Mean	Clust	Min	Max	Mean
zone 1	0.4	7.62	3.19	zone 1	1.65	6.1	4.7
zone 2	2.73	11.6	7.17	zone 2	4.04	9.29	6.18
-	-	-	-	zone 3	5.16	10.9	7.38
-	-	-	-	zone 4	5.62	12.4	8.65

Table 8. Season-wise data for rice crops, for instance.

Area (In Hectares)		Production (In Hectares)		Yield (In Tons)
Season: Kharif
mean:	37,112	mean:	95,330	mean:	2.47
Season: Rabi
mean:	3240.6	mean:	7468.4	mean:	2.4
Season: Summer
mean:	9845.6	mean:	30,887	mean:	2.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lahza, H.; Naveen Kumar, K.R.; Sreenivasa, B.R.; Shawly, T.; Alsheikhy, A.A.; Hiremath, A.K.; Lahza, H.F.M. Optimization of Crop Recommendations Using Novel Machine Learning Techniques. Sustainability 2023, 15, 8836. https://doi.org/10.3390/su15118836

AMA Style

Lahza H, Naveen Kumar KR, Sreenivasa BR, Shawly T, Alsheikhy AA, Hiremath AK, Lahza HFM. Optimization of Crop Recommendations Using Novel Machine Learning Techniques. Sustainability. 2023; 15(11):8836. https://doi.org/10.3390/su15118836

Chicago/Turabian Style

Lahza, Husam, K. R. Naveen Kumar, B. R. Sreenivasa, Tawfeeq Shawly, Ahmed A. Alsheikhy, Arun Kumar Hiremath, and Hassan Fareed M. Lahza. 2023. "Optimization of Crop Recommendations Using Novel Machine Learning Techniques" Sustainability 15, no. 11: 8836. https://doi.org/10.3390/su15118836

APA Style

Lahza, H., Naveen Kumar, K. R., Sreenivasa, B. R., Shawly, T., Alsheikhy, A. A., Hiremath, A. K., & Lahza, H. F. M. (2023). Optimization of Crop Recommendations Using Novel Machine Learning Techniques. Sustainability, 15(11), 8836. https://doi.org/10.3390/su15118836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimization of Crop Recommendations Using Novel Machine Learning Techniques

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Overview

2.2. Proposed Framework for Determination of Yield Trend

2.3. Algorithm for Determination of Yield Trends

2.3.1. Calinski–Harabasz Index

2.3.2. Average Silhouette Width Index

3. Data Analysis

3.1. Eliminating the Unwanted Observations

3.2. Removing the Impossible Data Combination

3.3. Fill in the Missing Values

3.4. Removing the Impossible Data Mixes

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI