1. Introduction
The quality assessment of wine is traditionally determined by human experts, who rate it on a scale from 0 to 10. While this subjective measure is widely accepted, it is inherently limited by the constraints of human perception. The primary aim of this study is to explore the potential for predictive modeling to quantify wine quality based on measurable physicochemical properties. This approach provides a reproducible framework that is not only useful for understanding quality determinants but also enables the prediction of hypothetical wine compositions that might achieve the highest possible quality scores. These predictions aim to guide future experimentation and quality enhancement efforts in the wine industry.
The predictive modeling of wine quality aligns with broader trends in food and beverage research, where machine learning techniques have been increasingly employed to understand consumer preferences [
1]. For instance, Yu et al. [
2] utilized a hybrid partial least squares–artificial neural network (PLS-ANN) model to predict consumer liking scores for green tea beverages. Similarly, Sudha et al. [
3] developed a fermented millet sprout milk beverage by combining physicochemical studies with consumer acceptability data.
We began by analyzing the 11 variables in the dataset, representing various physicochemical properties of the wines. To address skewness in the data, which can adversely affect the performance of predictive models, we applied log transformations to the skewed variables. This preprocessing step stabilized variance and facilitated more linear relationships among the variables.
To investigate the relationships between the physicochemical properties and wine quality, we initially employed a linear regression model. However, recognizing that wine quality is unlikely to have a purely linear relationship with its underlying properties, we expanded our analysis to explore quadratic and cubic relationships. Specifically, we fit second-degree and third-degree polynomial models to capture potential nonlinearities and interactions among the variables.
Through this comprehensive exploration, we found that the relationship between the variables and wine quality was best captured by a combination of linear, quadratic, and logarithmic terms. Building on this insight, we optimized the model by evaluating all possible subsets of the 11 variables—2047 combinations in total. For each subset, we calculated the performance of the second-degree polynomial model, splitting the dataset into a training sample (80%) and a test sample (20%) to ensure robustness and reproducibility. Using a random state of 42, we ensured consistent results across experiments.
Our analysis revealed that comparable accuracy could be achieved with a reduced set of seven variables, leading to the selection of an optimized second-degree polynomial ridge regression model. This approach balanced predictive accuracy with simplicity, reducing the risk of overfitting by minimizing the number of input variables.
The predictive modeling of wine quality also extends beyond traditional applications by exploring hypothetical compositions. While the dataset includes wines rated up to a maximum score of 8, we extended our analysis to predict quality scores approaching 10. These exploratory predictions aim to identify theoretical compositions that could guide future experimentation, offering new perspectives on the complex interplay of variables that contribute to wine quality. We explicitly acknowledge the limitations of extrapolating beyond the observed data range and emphasize that these predictions are intended as a foundation for further validation rather than definitive outcomes.
This study focuses on developing a robust and interpretable predictive model for wine quality based on physicochemical properties. While comparisons with other machine learning algorithms are included, they are presented to validate the effectiveness of the proposed model and highlight its unique contributions. The primary goal is not to provide a comprehensive tutorial on algorithmic capabilities, but to showcase how mathematical modeling can offer actionable insights into wine quality and inspire further research in this domain.
In summary, this study demonstrates how polynomial ridge regression can serve as an effective tool for wine quality prediction, providing actionable insights and a framework for exploring quality enhancements. By extending these methods to hypothetical compositions, we hope to inspire future applications in other domains, such as food recipe optimization, and contribute to advancements in the understanding of product quality. Future work should include validation of the proposed methodology with winemakers and experts to ensure practical applicability and alignment with industry standards. Such validation could not only refine the model but also provide additional insights into the sensory and contextual factors that influence wine quality, ultimately strengthening the study’s contributions to both research and industry.
2. Background
A paper by K. R. Dahal et al. [
4], titled “Prediction of Wine Quality Using Machine Learning Algorithms”, was published in the open journal of
Statistics on 18 March 2021. The paper explores the use of machine learning algorithms to predict the quality of wine based on various parameters. The authors compare the performance of four different ML models: ridge regression (RR), support vector machine (SVM), gradient boosting regressor (GBR), and a multi-layer artificial neural network (ANN). They find that the GBR model performs best, with metrics such as mean squared error (MSE), correlation coefficient (R), and Mean Absolute Percentage Error (MAPE) of 0.3741, 0.6057, and 0.0873 respectively. The paper demonstrates how statistical analysis can help identify key components influencing wine quality before production, aiding manufacturers in improving quality [
5].
In the introduction, the authors highlight the importance of wine quality for both consumers and producers. Historically, wine quality was tested after production, which was costly if the quality was poor. With advancements in technology, manufacturers started testing during production, saving time and money. Machine learning has been used to determine wine quality using available data.
The data description and preprocessing section explains that the study uses the red wine dataset from the UCL Machine Learning Repository, containing 11 physicochemical properties and sensory scores from blind taste testers. The authors analyze the Pearson correlation coefficient to identify significant variables affecting quality, noting that alcohol has the highest correlation (0.435). They also address outliers and feature scaling before training the models.
In the methodology section, the paper describes the four algorithms used: ridge regression (RR), which is similar to linear regression but with a shrinkage penalty; support vector machine (SVM), which is kernel-based regression using the radial basis kernel (RBF); gradient boosting regressor (GBR), which is an ensemble algorithm building sequential weak learners; and artificial neural network (ANN), which is composed of layers of neurons using nonlinear transformations.
The results and discussion section reports that the authors use metrics such as MSE, MAPE, and R to evaluate the models. GBR performs best on the test dataset, with the highest R and lowest MSE and MAPE. ANN underperforms due to the small, skewed dataset. The authors identify alcohol and sulfates as the most important features controlling wine quality.
In conclusion, the paper demonstrates the effectiveness of machine learning in predicting wine quality, with GBR being the best-performing model. The authors conclude that machine learning provides an alternative approach to determining wine quality and screening key variables before production.
A paper by Terry Hui-Ye Chiu et al. [
6], titled “A Generalized Wine Quality Prediction Framework by Evolutionary Algorithms”, was published in
The International Journal of Interactive Multimedia and Artificial Intelligence on 21 April 2021. The paper focuses on developing a framework that combines different classifiers and their hyperparameters using genetic algorithms to predict wine quality. This approach addresses the variability in wine datasets and offers a robust method for wine quality prediction. The authors propose a hybrid model that evolves through genetic operations to optimize prediction performance.
To evaluate their approach, the authors conducted experiments on wine datasets and demonstrated the effectiveness of the proposed method. The framework allows for automatic discovery of suitable classifiers and hyperparameters, which is crucial for optimizing the prediction results. The results showed that the proposed approach performed better than several other models, highlighting its utility in predicting wine quality effectively.
A paper by Yogesh Gupta [
7], titled “Selection of Important Features and Predicting Wine Quality Using Machine Learning Techniques”, was published in
Procedia Computer Science on 31 December 2017. The paper explores the use of machine learning algorithms to predict wine quality using various features. The author examines the dependency of wine quality on different physicochemical characteristics and employs linear regression, neural networks, and support vector machines for the analysis. The study shows that by selecting important features, better prediction results can be achieved.
A paper by Piyush Bhardwaj et al. [
8], titled “Machine learning application in wine quality prediction”, was published in the journal
Machine Learning with Applications on 28 January 2022. The paper focuses on predicting wine quality using machine learning techniques with both synthetic and experimental data. The authors collected 18 Pinot noir wine samples with 54 different characteristics from diverse regions in New Zealand. They utilized synthetic data and various machine learning models, including Adaptive Boosting (AdaBoost), Random Forest (RF), and gradient boosting (GBOOST), among others. The AdaBoost classifier showed 100% accuracy in predicting wine quality. The study demonstrates that machine learning can effectively predict wine quality, particularly for New Zealand Pinot noir wines, by focusing on essential variables.
A paper by Amalia Luque et al. [
9], titled “Determining the Importance of Physicochemical Properties in the Perceived Quality of Wines”, was published in
IEEE Access on 18 October 2023. The paper explores how the quality of wine, which holds significant economic, nutritional, and cultural value, can be improved by understanding the impact of physicochemical properties on perceived quality.
The authors used several metrics to analyze the importance of different wine attributes, including a novel metric based on the Jensen–Shannon Divergence (JSD). They found that JSD performed better than previous metrics and demonstrated that the main physicochemical attributes influencing red wine quality were citric acidity, alcohol, sulfates, and fixed acidity, while for white wine, the key attributes were alcohol, free sulfur dioxide, and pH.
In addition, other studies have explored consumer preferences in food and beverage contexts, such as Yu et al. [
2], who developed a partial least squares–artificial neural network (PLS-ANN) model to predict consumer liking scores for green tea beverages, and Sudha et al. [
3], who integrated physicochemical properties with acceptability data for fermented millet sprout milk beverages. These studies highlight the broad applicability of machine learning models in predicting consumer-relevant metrics.
Knowledge Gaps and Study Motivation
While previous studies have demonstrated the utility of machine learning models in predicting wine quality, several gaps remain unaddressed. For example, Dahal et al. [
4] and Bhardwaj et al. [
8] achieved high predictive accuracies but did not explore the potential of these models to predict wine qualities beyond the observed range of scores. Similarly, Chiu et al. [
6] focused on optimizing classifier performance but did not examine the interpretability of their models in relation to the physicochemical attributes that influence quality. Furthermore, while Gupta [
7] and Luque et al. [
9] emphasized feature importance, they did not explore systematic methods to optimize the feature selection process for simpler, more interpretable models.
These limitations highlight the need for a methodological framework that not only predicts wine quality but also extends its applicability to hypothetical scenarios and provides interpretable insights into the contributions of physicochemical properties. This study aims to address these gaps by introducing a polynomial ridge regression model optimized through exhaustive feature combination analysis. Additionally, the study explores the possibility of predicting hypothetical quality scores, thereby expanding the scope of existing research.
3. Methodology
3.1. Wine Dataset
This paper will utilize the renowned datasets provided by the University of California Irvine (UCI) Machine Learning Repository [
10], with a particular focus on the Wine Quality Dataset [
11]. This dataset includes entries from two subsets of Vinho Verde wines originating from the north of Portugal, categorized into red and white varieties. Our analysis will primarily focus on the subset consisting of 1599 red wine samples, the physicochemical features of which are presented in
Table 1. Each of these samples is analyzed based on 11 distinct physicochemical traits and assigned a quality rating that ranges from 0 (indicating very poor quality) to 10 (indicating excellent quality), with each rating represented as an integer.
3.2. References to Data Processing Methods
The methodological framework of this study involves the application of several data processing and modeling techniques, including ridge regression, polynomial regression, neural networks, Random Forest, and Principal Component Analysis (PCA). To provide readers with resources for the further exploration of these techniques, we have included references to foundational studies:
Ridge Regression and Polynomial Regression: Hoerl, A. E., & Kennard, R. W.
Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 1970, pp. 55–67. [
12]
Neural Networks: Rumelhart, D. E., Hinton, G. E., & Williams, R. J.
Learning Representations by Back-Propagating Errors. Nature, 323(6088), 1986, pp. 533–536. [
13]
Random Forests: Breiman, L.
Random Forests. Machine Learning, 45(1), 2001, pp. 5–32. [
14]
Principal Component Analysis (PCA): Pearson, K.
On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine, 2(11), 1901, pp. 559–572. [
15]
These foundational works provide a detailed theoretical basis for the methods employed and serve as valuable resources for readers interested in the underlying principles of data processing and modeling techniques.
3.3. Methodology Overview
This study follows a systematic methodology to process, analyze, and model the wine quality dataset. The complete workflow is visually represented in
Figure 1, which illustrates the preprocessing steps, feature engineering, model training, and final selection.
The methodology consists of the following key stages:
3.3.1. Data Preprocessing and Feature Engineering
The initial phase involves loading the wine quality dataset, handling missing data, and standardizing features through normalization. Following this, feature expansion is performed via polynomial transformation, generating first-degree (original), second-degree (squared terms), and third-degree (cubed terms) features.
3.3.2. Model Training and Evaluation
After feature expansion, multiple models are trained, including linear regression, ridge regression, and polynomial variants of these models. Each model is evaluated using Root Mean Square Error (RMSE) and accuracy metrics to assess predictive performance.
3.3.3. Feature Selection and Optimized Model
To refine the model, a feature selection process evaluates all 2047 possible combinations of the 11 wine characteristics. The optimal subset of features is determined using second-degree ridge regression, balancing accuracy and model simplicity. The best-performing equation is identified and validated against various machine learning models such as XGBoost, Random Forest, neural networks, and support vector machines.
3.3.4. Final Model and Prediction
Some models lack extrapolation capability, making them unsuitable for predicting wine quality outside the training range. Therefore, the best second-degree ridge regression model is selected as the final predictive model. This model is then used to predict the highest-quality wine and analyze the statistical significance of the selected features.
3.4. Exploratory Data Analysis
We conducted exploratory data analysis (EDA) to investigate the relationships between the different features in the Portuguese red wine dataset. This subsection provides detailed explanations of the statistical tests applied, as well as graphical representations summarizing the data.
3.4.1. Descriptive Statistics
Descriptive statistics provide a summary of the dataset. As shown in
Table 2, the dataset contains a total of 1599 samples with a variety of chemical properties (e.g., fixed acidity, volatile acidity, alcohol, etc.). The mean alcohol content is 10.42%, and the average wine quality is 5.64. Each feature varies significantly in range, as evident from the minimum, maximum, and standard deviation values.
3.4.2. Correlation Analysis
The correlation matrix
Figure 2 shows the pairwise Pearson correlation coefficients between the features. Pearson’s correlation coefficient r is calculated as
where X and Y are the variables and
and
are their means. It can be observed that alcohol has a moderately positive correlation with wine quality (0.48), whereas
shows a negative correlation with wine quality (−0.39). The strong positive correlation between
and
(0.67) suggests that these two features are closely related.
3.4.3. Variance Inflation Factor (VIF)
The variance inflation factor (VIF) is a statistical measure used to detect multicollinearity in regression analysis. It quantifies the extent to which the variance of an estimated regression coefficient increases due to correlations among predictors. A VIF value greater than 10 is often considered indicative of severe multicollinearity, while values exceeding 5 may indicate potential concerns [
16,
17,
18,
19]. The VIF is computed as the ratio of the variance of the estimated coefficients in a model to the variance of the coefficients in a hypothetical model devoid of multicollinearity, effectively measuring the inflation in variance caused by multicollinearity among predictors [
20,
21]. Researchers typically evaluate VIF values in conjunction with tolerance statistics, where a tolerance below 0.1 signifies multicollinearity issues [
17,
19,
22]. Consequently, VIF serves as a vital diagnostic tool in regression modeling, enhancing the reliability of statistical inferences drawn from the data [
23,
24]. To check for multicollinearity, we calculated the variance inflation factor (VIF) for each feature. VIF is given by
where
is the coefficient of determination of a regression of one predictor on all other predictors. A VIF value above 10 typically indicates multicollinearity. The results, displayed in
Table 3, show that fixed acidity and density have relatively high VIF values of 7.77 and 6.34, respectively, suggesting that these variables may be correlated with others in the dataset.
3.4.4. Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a widely utilized statistical technique for dimensionality reduction, particularly in the fields of data analysis and machine learning. It transforms high-dimensional data into a lower dimensional form while preserving as much variance as possible. This is achieved by identifying the principal components, which represent the directions of maximum variance within the dataset. PCA is especially effective in simplifying complex datasets, facilitating the visualization and interpretation of the underlying structure of the data [
25,
26].
In practical applications, PCA is employed across various domains, including image processing, where it assists in feature extraction and face recognition by reducing the dimensionality of image data while retaining essential features [
27,
28,
29]. Furthermore, PCA can enhance the performance of machine learning algorithms by alleviating the curse of dimensionality, thereby improving computational efficiency and model accuracy [
30,
31]. Its versatility and effectiveness make PCA a fundamental tool in data science and analytics. We calculate the principal components by solving the eigenvalue decomposition problem of the covariance matrix
, such that
where v are the eigenvectors (principal components) and
are the eigenvalues (variance captured by each component). The PCA identifies the directions (components) in which the data varies the most, with the first few components typically capturing most of the variability in the dataset.
In our analysis, the first two principal components (PC1 and PC2) were extracted, which are linear combinations of the original features in the dataset. These components can be described as follows:
First Principal Component (PC1): This is the linear combination of the original features that captures the maximum variance in the dataset. Essentially, PC1 represents the direction in the multi-dimensional feature space along which the data points exhibit the greatest spread or variation. Mathematically, it can be expressed as
where
are the weights (or loadings) assigned to each original feature in forming the first principal component.
Second Principal Component (PC2): This is the linear combination of the original features that captures the next highest amount of variance, subject to the constraint that it is orthogonal (uncorrelated) to PC1. PC2 represents the direction of the second greatest spread or variation in the data. It is calculated similarly to PC1 but with a different set of weights:
The loadings () in the equations above indicate the contribution of each feature to the corresponding principal component. For example, a high absolute value of signifies that the feature has a significant influence on the corresponding component. These loadings are distinct from the scores, which represent the projection of the original data points onto the principal components.
The scatter plot in
Figure 3 illustrates the PCA scores for the first two principal components, PC1 and PC2, for the wine dataset. Each data point represents a sample wine, and the coordinates of these points correspond to their scores on PC1 and PC2. The scores are calculated by multiplying the original feature values with the loadings. The points are colored according to their quality scores, ranging from 3 (lowest quality) to 8 (highest quality), as evaluated by expert tasters based on 11 physicochemical parameters. This visualization highlights how the samples are distributed in the reduced two-dimensional space.
The overlapping regions in the plot suggest that there is no clear separation between high- and low-quality wines based solely on these two components, indicating that wine quality may depend on more complex interactions between features. The visual patterns in this plot provide insights into the relationships among the features and their impact on wine quality.
3.4.5. Distribution and Normality Tests
Figure 4 illustrates the distribution of all features in the dataset through histograms combined with kernel density estimation (KDE). KDE estimates the probability density function of a random variable. We performed normality tests using the Shapiro–Wilk test, where the test statistic W is defined as
where
are the ordered data points and
are constants derived from the covariance matrix. The Shapiro–Wilk test evaluates whether a sample comes from a normally distributed population, with higher values of W indicating closer conformity to a normal distribution.
The Shapiro–Wilk test is a widely utilized statistical method for assessing the normality of data distributions. Developed by Samuel Shapiro and Martin Wilk in 1965, it is particularly effective for small sample sizes, typically recommended for samples of fewer than 50 observations [
32,
33]. The test operates by comparing the observed distribution of data to a normal distribution, calculating a W statistic that reflects how well the data conforms to normality [
34].
Research indicates that the Shapiro–Wilk test has superior power compared to other normality tests, such as the Kolmogorov–Smirnov and Anderson–Darling tests, particularly in detecting deviations from normality due to skewness or kurtosis [
35,
36]. Its application is crucial in various fields, including psychology and medicine, where normality assumptions underpin many statistical analyses [
33,
37]. However, it is essential to interpret the results cautiously, as the test can be sensitive to sample size and outliers [
38,
39].
The results of the Shapiro–Wilk test indicate that almost all features in the dataset deviate significantly from normality, as evidenced by very low p-values. A selection of key results includes
Fixed Acidity: , p-value
Volatile Acidity: , p-value
Citric Acid: , p-value
Residual Sugar: , p-value
Chlorides: , p-value
Free Sulfur Dioxide: , p-value
Total Sulfur Dioxide: , p-value
These results show that the distributions of features such as residual sugar, chlorides, and total sulfur dioxide are particularly non-normal. The combination of histograms and KDE plots in
Figure 4 visually confirms these deviations, as the distributions appear skewed and non-symmetrical. This deviation from normality has implications for the choice of statistical methods, as many parametric tests and models assume normally distributed data.
3.4.6. Boxplots
To explore potential outliers and the spread of each feature, we generated boxplots. Boxplots visualize the distribution of the data through quartiles. The interquartile range (IQR) is calculated as
where
and
are the 75th and 25th percentiles, respectively.
Figure 5 displays standardized boxplots of all features, allowing for direct comparison despite the differences in scale across the features. Additionally, a raw boxplot of the unstandardized features is provided in
Figure 6. Both figures highlight significant outliers, especially for total sulfur dioxide and chlorides.
3.4.7. ANOVA Results
We conducted one-way ANOVA tests to determine the effect of each feature on wine quality. One-way ANOVA (Analysis of Variance) is a statistical method used to determine if there are significant differences between the means of three or more independent groups. This technique is particularly useful when comparing multiple groups to assess whether at least one group mean is different from the others, based on a single independent variable. The fundamental principle of one-way ANOVA is to analyze the variance within each group and between groups, allowing researchers to infer population mean differences from sample data.
The one-way ANOVA test operates under certain assumptions, including the normality of data distribution and homogeneity of variances across groups. When these assumptions are met, the test can effectively identify significant differences among group means. In practical applications, one-way ANOVA is often followed by post hoc tests, such as Tukey’s HSD, to determine which specific groups differ from each other [
40,
41]. This method is widely utilized across various fields, including biology, psychology, and social sciences, demonstrating its versatility and importance in statistical analysis [
42,
43,
44]. The test statistic (F) for ANOVA is calculated as
where
is the mean sum of squares between groups and
is the mean sum of squares within groups. The results, shown below, indicate that all features significantly impact wine quality, as evidenced by their
p-values being less than 0.05. This suggests that variations in these features are associated with differences in wine quality.
Fixed Acidity: , p-value
Volatile Acidity: , p-value
Citric Acid: , p-value
Residual Sugar: , p-value
Chlorides: , p-value
Free Sulfur Dioxide: , p-value
Total Sulfur Dioxide: , p-value
Density: , p-value
pH: , p-value
Sulfates: , p-value
Alcohol: , p-value
The high F-statistics for features such as chlorides, , and indicate that these features have particularly strong effects on the wine quality. These results suggest that variations in these chemical properties contribute significantly to the differences in wine quality.
3.5. Distribution Analysis of Wine Quality
We conducted an exploratory analysis to determine the distribution of the wine quality variable. This analysis involved fitting several distributions, including normal, log-normal, Weibull, gamma, and exponential, to the empirical data. The fitted distributions were then compared using goodness-of-fit tests and visualized alongside the empirical data for a comprehensive understanding of the underlying distribution. The results of the statistical tests and the fitted distribution plots are presented in this subsection.
3.5.1. Descriptive Statistics
The descriptive statistics of the wine quality variable are shown in
Table 4. The dataset contains 1599 samples, with a mean quality score of 5.64 and a standard deviation of 0.81. The minimum quality score is 3, while the maximum is 8. The majority of wines fall within the 5 to 6 quality range, with 75% of the wines having a quality score of 6 or below.
3.5.2. Goodness-of-Fit Tests
To evaluate which distribution best fits the wine quality data, we performed several goodness-of-fit tests, including the Kolmogorov–Smirnov (K-S) test, Anderson–Darling (A-D) test, and Shapiro–Wilk test. These tests were applied to the normal, log-normal, Weibull, gamma, and exponential distributions. The Kolmogorov–Smirnov (K-S) test is a non-parametric statistical method used to determine the goodness of fit between an empirical cumulative distribution function (CDF) and a theoretical distribution or to compare two empirical distributions. It quantifies the maximum distance between the two CDFs, providing a measure of how well the data conforms to the expected distribution [
45,
46]. Originally developed by Andrey Kolmogorov and Nikolai Smirnov, the K-S test is widely utilized in various fields, including environmental science, finance, and biomedical research, to assess normality and distributional assumptions [
47,
48].
The K-S test is particularly advantageous due to its applicability to small sample sizes and its robustness against deviations from normality, making it a preferred choice for many researchers [
49,
50]. Furthermore, its implementation has been enhanced through programming libraries, facilitating its integration into statistical software for broader accessibility [
51]. Overall, the K-S test remains a fundamental tool in statistical analysis for evaluating the distributional properties of datasets.
Normal Distribution
The K-S test for the normal distribution yielded a statistic of 0.25, with a
p-value of approximately
, indicating a significant deviation from normality. The Anderson–Darling test further supports this result, with a statistic of 110.63, which exceeds the critical values at all significance levels. The Shapiro–Wilk test produced a statistic of 0.86 and a
p-value of
, providing strong evidence against the normality assumption. The equation for the normal distribution’s probability density function is
where
is the mean and
is the standard deviation.
Log-Normal Distribution
The log-normal distribution showed a slightly better fit compared to the normal distribution. The K-S test yielded a statistic of 0.24, with a
p-value of
, still indicating a significant departure from a log-normal distribution. The probability density function of the log-normal distribution is given by
Weibull Distribution
The Weibull distribution also did not fit the wine quality data well. The K-S test returned a statistic of 0.25 and a
p-value of
, suggesting a poor fit. The Weibull distribution’s probability density function is
where
k is the shape parameter and
is the scale parameter.
Gamma Distribution
The gamma distribution provided a similar result, with a K-S statistic of 0.25 and a
p-value of
, indicating that this distribution is also not a good fit for the data. The probability density function for the gamma distribution is
where
k is the shape parameter,
is the scale parameter, and
is the gamma function.
Exponential Distribution
The exponential distribution showed the worst fit among all tested distributions. The K-S test yielded a statistic of 0.49, with a
p-value of 0, suggesting that the exponential distribution does not describe the data at all. The probability density function for the exponential distribution is
where
is the rate parameter.
3.5.3. Visualization of Fitted Distributions
We visualized the empirical wine quality data along with the fitted distributions (normal, log-normal, Weibull, gamma, and exponential) in
Figure 7. The histogram of the empirical data is overlaid with the fitted distributions, where different colors represent each distribution. It is evident from the plot that the normal, log-normal, Weibull, and gamma distributions follow similar patterns, while the exponential distribution diverges significantly, as confirmed by the goodness-of-fit tests.
3.5.4. Kernel Density Estimation
In addition to fitting parametric distributions, we also computed a kernel density estimate (KDE) to visually inspect the shape of the quality distribution. This technique involves placing a kernel function, which is a smooth and symmetric function, at each data point and summing these contributions to create a continuous estimate of the density. The choice of kernel and the bandwidth parameter significantly influence the quality of the density estimate, with various kernels being proposed to optimize performance under different conditions [
52,
53].
KDE is widely applicable across various fields, including statistics, machine learning, and signal processing, serving as a fundamental tool for data visualization and analysis [
54,
55,
56]. Advancements in computational techniques have improved the efficiency and stability of multivariate KDE, enabling the method to handle complex datasets [
57,
58]. The method’s flexibility allows it to adapt to the underlying structure of the data, making it a valuable approach for exploratory data analysis [
59,
60,
61]. The KDE is a non-parametric way to estimate the probability density function of a variable and is defined as
where K is the kernel function, h is the bandwidth, and
are the data points. The KDE plot, shown in
Figure 8, highlights the multimodal nature of the data, with peaks around quality scores of 5 and 6. This further reinforces the findings from the goodness-of-fit tests, as the parametric distributions struggled to capture this complexity.
3.6. Limitations of Subjective Quality Scores and Mitigation Strategies
The wine quality dataset utilized in this study assigns quality scores based on evaluations from expert human tasters. These scores, while reflective of sensory attributes such as taste, aroma, and overall balance, are inherently subjective and may introduce potential biases. Such biases arise from differences in individual perceptions, cultural preferences, and contextual factors that could affect the consistency of quality ratings across samples. However, these scores are the industry standard and provide valuable insights into consumer-relevant quality metrics, making them an essential benchmark for wine quality assessment.
To address these limitations, the following considerations were integrated into our methodology:
Data Data Characteristics: The dataset provides a comprehensive representation of physicochemical properties that are objectively measurable and quantifiable, such as alcohol content, sulfates, residual sugar, and acidity. These features serve as reliable predictors independent of subjective interpretations.
Emphasis on Objective Features: Our analysis prioritizes the physicochemical attributes as primary predictors of wine quality, aiming to minimize the influence of subjective bias in the modeling process.
Potential for Cross-Validation with Alternative Datasets: While this study is limited to the UCI dataset, future research could incorporate external datasets that include alternative scoring systems or combine subjective scores with chemical and physical measurements for enhanced reliability.
Robust Predictive Modeling: The methodology focuses on predictive modeling that explores potential wine compositions and their predicted quality scores. This predictive approach shifts the emphasis from subjective quality ratings to the underlying relationships between measurable features and quality outcomes.
While recognizing the inherent subjectivity of the dataset, this study demonstrates that the proposed model can provide valuable insights into wine quality prediction through objective feature analysis. Future studies should consider employing multiple datasets with diverse scoring systems or incorporating sensory panel validation to further mitigate the impact of subjective biases.
3.7. Feature Selection and Limitations
The 11 physicochemical properties included in the dataset are objectively measurable and have been extensively validated in the literature as key predictors of wine quality. These features, such as alcohol content, acidity levels, and sulfates, are directly linked to wine’s taste, aroma, and structural characteristics. Moreover, their inclusion reflects industry standards for physicochemical analysis.
While the dataset does not include sensory attributes or geographic data, which may provide additional insights, these omissions were due to the dataset’s structure and availability. Future studies could incorporate additional features, such as
Sensory Attributes: Measurable aspects like aroma profile and color intensity, which can be quantified using gas chromatography–mass spectrometry (GC-MS) or spectrophotometric methods, respectively.
Geographic and Climatic Data: Information about vineyard location, soil type, and climatic conditions could capture regional influences on wine quality.
By considering these additional features in future research, the scope and predictive power of wine quality models can be expanded, enabling more comprehensive analyses.
3.8. Future Directions: Dataset Expansion and Data Augmentation
The current study relies on a single publicly available dataset from the UCI Machine Learning Repository [
10], which includes quality scores assigned by expert human tasters. While this dataset provides a robust basis for initial modeling, its limited range of quality scores (3 to 8) and the subjective nature of the evaluations present inherent constraints.
To address these limitations, future research could benefit from incorporating additional datasets from diverse wine-producing regions and vintages. Such datasets would allow for the cross-validation of the model across different contexts, enhancing its generalizability and applicability. Furthermore, exploring data augmentation techniques, such as the synthetic generation of feature combinations or bootstrapping methods, could expand the dataset’s diversity and mitigate the limitations imposed by the original dataset’s fixed range.
By applying these strategies, future studies could improve the robustness of the model and provide a more comprehensive understanding of wine quality prediction across diverse contexts.
3.9. Limitations of Extrapolation and Validation Requirements
One of the key considerations in this study is the scope of applicability of the predictive models. The models were developed and validated using observed data, making them most reliable for interpolation within the range of quality scores included in the dataset. Predictions beyond this range, particularly hypothetical scores exceeding 10, are inherently exploratory and should be interpreted with caution. Extrapolation relies on the assumption that the relationships captured by the model remain consistent outside the observed data range, which may not hold true in practice.
To ensure practical applicability, it is essential to validate these predictions experimentally. This involves identifying or creating wine samples with physicochemical compositions similar to those predicted by the model and subjecting them to sensory evaluation by expert tasters. Such validation would help confirm whether the predicted scores align with human quality assessments.
Furthermore, certain model predictions may suggest physicochemical compositions that are challenging to achieve due to physical constraints in grape composition or winemaking processes. For example, a low pH combined with reduced sulfur oxide content may lead to high predicted scores but could be physically unattainable or result in undesirable wine characteristics. Additionally, human perception of physicochemical changes is non-linear and influenced by complex sensory interactions, which may not fully align with the patterns captured by the model.
Recognizing these limitations, we emphasize that the model is a tool for hypothesis generation rather than definitive prediction. Its primary utility lies in identifying potential directions for quality improvement and guiding experimental validation efforts.
3.10. Nonlinear Relationships and Data Transformation
Wine quality is influenced by complex interactions among physicochemical parameters, which often exhibit nonlinear relationships with quality scores. For example,
Alcohol Content: A moderate alcohol level may positively influence wine quality, while excessive levels may detract from the sensory experience.
Volatile Acidity: High volatile acidity is typically associated with poor quality, but its impact may vary depending on other attributes like sugar content and pH.
pH and Acidity: These features have a nonlinear impact on quality due to their role in balancing flavor and preserving wine.
To address these nonlinearities, we employed polynomial regression models (2nd and 3rd degree) to capture the complex relationships between features and quality. Polynomial terms allowed the model to account for interactions and diminishing returns observed in certain attributes.
Additionally, skewness in several features, such as alcohol and residual sugar, necessitated preliminary data transformations. Log transformations were applied to stabilize variance and linearize relationships, enabling better model fitting and interpretation. For instance,
Log-transformed variables reduced the influence of extreme values, making the model more robust to outliers.
Transformed features exhibited higher correlations with quality scores, enhancing their predictive power.
Incorporating these steps also provides an educational framework for researchers and practitioners working on similar prediction problems in food science and beyond.
4. Statistical Assessment of Transformation Techniques for Wine Quality Features
4.1. Selecting the Right Transformation
Before proceeding with predictive modeling for wine quality, we applied various statistical tests to assess the need for transformations on the following features: Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide. The aim was to reduce skewness, stabilize variance, and improve normality through log transformation. Our analysis was guided by skewness, kurtosis, and normality tests (Shapiro–Wilk test), as well as variance stabilization checks (Levene’s test).
4.1.1. Square Root Transformation
The square root transformation is considered for variables exhibiting increasing variance with the mean (heteroscedasticity) [
62]. Compared to logarithmic transformation, it is less aggressive and can be applied to zero but not negative values.
Thinh et al. [
63] demonstrated its effectiveness in regression modeling, showing improved accuracy when applied to surface roughness data. Similarly, it has been utilized in various contexts, including enhancing model precision [
64] and facilitating compositional data analysis [
65].
Variance stabilization is a key advantage of this transformation, particularly when underlying model assumptions are violated, as noted in Bland–Altman analysis [
66]. The transformation’s applicability extends to phylogenetic studies [
67] and spectroscopic data preprocessing [
68], underscoring its broad utility.
4.1.2. Inverse Transformation
The inverse transformation is considered for cases where large values need compression while small values require amplification. This transformation can significantly alter data distribution and is particularly effective when dealing with left-skewed variables.
Saffari et al. [
69] demonstrated the utility of inverse transformation in modeling cognitive outcomes for early-stage neurocognitive disorders. By addressing the skewness in cognitive data, the study highlights the importance of selecting appropriate transformations to enhance model interpretability and accuracy.
This transformation’s impact on wine quality data will be further evaluated in subsequent sections, considering its suitability for improving regression model performance.
4.1.3. Box-Cox Transformation
The Box–Cox transformation provides a flexible family of power transformations, including logarithmic and square root transformations, depending on the parameter
. It is widely employed to stabilize variance and improve normality in datasets, ensuring more reliable statistical modeling [
70].
This transformation has been applied across multiple fields to address skewness and enhance predictive accuracy. In environmental science, it improves dataset homogeneity, reducing variance disparities [
71]. In predictive modeling, it has been shown to enhance normality and improve model performance by addressing nonlinearity [
72].
Mosquin et al. [
73] found that applying the Box–Cox transformation in streamflow evaluation significantly reduced skewness, leading to superior predictive accuracy in linear models. Similarly, in anomaly detection, it plays a crucial role in stabilizing residual distributions [
74], while in remote sensing applications, it corrects biases and minimizes errors in data distribution [
75].
Beyond data distribution corrections, this transformation has proven valuable in climatological forecasting by normalizing hydroclimatic variables [
76] and in trend and seasonality modeling for predictive applications, such as forecasting box office performance [
77]. Additionally, in medical and biological studies, it enhances regression modeling by normalizing tumor-infiltrating lymphocyte data for prognosis assessments [
78] and contributes to efficient parameter estimation in feed efficiency studies for livestock research [
79].
Moreover, the Box–Cox transformation is widely utilized in statistical modeling to normalize data distributions and improve parameter estimation [
80]. It plays a crucial role in ensuring that assumptions of normality and variance stability hold across different datasets, making it an essential tool in regression and predictive analytics.
4.1.4. Exponential and Logarithmic Transformation
Exponential and logarithmic transformations are considered for datasets spanning multiple orders of magnitude or following a power-law distribution. These transformations help improve model interpretability and predictive accuracy by stabilizing variance and addressing skewed distributions.
Bosse et al. [
81] demonstrated the benefits of transforming count data before applying scoring metrics such as the Continuous Ranked Probability Score (CRPS) or Weighted Interval Score (WIS), leading to more meaningful and interpretable results in epidemiological forecasting. Schmid et al. [
82] emphasized the role of exponential decay models in accounting for time dependency in meta-analyses, improving concordance probability estimates.
In ecological modeling, Zhang et al. [
83] explored the use of exponential regression in biomass estimation, highlighting its ability to improve the predictive capabilities of regression models. Similarly, Yu et al. [
84] demonstrated the effectiveness of logarithmic regression in revealing relationships between variables, reinforcing its utility in statistical modeling.
Rasheed et al. [
85] applied both exponential and logarithmic regression techniques in analyzing the properties of octane isomers, illustrating the adaptability of these transformations across scientific domains. The importance of inverse transformations was further demonstrated by Bueno-López et al. [
86], who addressed the challenges of de-transforming logarithmic conversions in regression models.
In medical research, logarithmic transformations have proven effective in predictive modeling. Tu et al. [
87] applied them to analyze factors influencing prognosis after liver transplantation, while Pizarro et al. [
88] utilized natural logarithmic transformations to assess cardiac autonomic dysfunction in adult congenital heart disease patients, demonstrating their relevance in clinical investigations.
Environmental applications also benefit from these transformations. Subi et al. [
89] incorporated logarithmic transformations to estimate organic matter content in farmland soil, highlighting their role in environmental modeling. Similarly, Hieu et al. [
90] used logarithmic models to estimate chlorophyll-a levels, reinforcing their effectiveness in remote sensing applications for environmental monitoring.
or
4.1.5. Power Transformation
Power transformations are considered when data exhibits nonlinear relationships that do not fit the criteria for logarithmic or root transformations. Raising data to a power (e.g., square or cube) can help improve model fit and stabilize variance.
Li et al. [
91] demonstrated the effectiveness of power transformations in enhancing the accuracy of regression-based statistical postprocessing models for short-term precipitation forecasts. Their findings highlight the importance of transformation techniques in refining predictive models.
or
4.1.6. Log Transformation
Log transformation is evaluated as a method to stabilize variance, improve normality, and mitigate the influence of outliers in the dataset. This transformation is particularly relevant for right-skewed variables, where applying a logarithmic function can enhance model performance.
The general form of a log transformation is
where
x is the original data value;
y is the transformed value;
b is the logarithm base, commonly natural logarithm (e) or base 10;
c is a small constant added to prevent undefined values when , typically 0.1 or 1, depending on the dataset.
Two commonly used forms of log transformation in data preprocessing are
Natural Log Transformation:
Base-10 Log Transformation:
The choice of base depends on the scale and interpretability of transformed values. In the context of wine quality modeling, log transformation is considered for variables exhibiting skewness, as discussed in the previous section. The effectiveness of this transformation will be further assessed in subsequent analyses.
4.1.7. Skewness and Kurtosis
Skewness measures the asymmetry of a distribution, while kurtosis quantifies the tails’ heaviness. A positive skewness indicates a longer tail on the right, while a negative skewness indicates a longer tail on the left [
92,
93]. Kurtosis, on the other hand, measures the “tailedness” of the distribution, reflecting the presence of outliers. High kurtosis (leptokurtic) indicates heavy tails and a sharp peak, while low kurtosis (platykurtic) suggests light tails and a flatter peak [
94,
95]. Both skewness and kurtosis are essential in various fields, including finance and environmental science, as they provide insights into the underlying data distribution, which can influence statistical modeling and hypothesis testing [
96,
97]. For a normal distribution, the skewness should be close to zero, and the kurtosis should be approximately three. Log transformation often reduces these measures, bringing the data closer to normality. The results of skewness and kurtosis for the original and log-transformed features are summarized in
Table 5. Log transformation successfully reduced the skewness and kurtosis of several features. For example, the skewness of fixed acidity was reduced from 0.98 to 0.45, while the kurtosis decreased from 1.12 to 0.14, indicating a more symmetrical and normally distributed feature. Similarly, volatile acidity saw a reduction in skewness from 0.67 to 0.27.
4.1.8. Normality Testing (Shapiro-Wilk Test)
We applied the Shapiro–Wilk test to each feature, both in its original and log-transformed states, to assess improvements in normality.
Table 6 presents the test statistics and
p-values. Notably, the log-transformed fixed acidity showed a significant improvement in normality, with its Shapiro–Wilk
p-value increasing from
to
, as shown in
Figure 9. Similarly, volatile acidity and free sulfur dioxide demonstrated improved normality following the log transformation (
Figure 10 and
Figure 11).
4.1.9. Q-Q Plots
Quantile–Quantile (Q-Q) plots were generated to visually assess the effect of the log transformation on each feature. The Q-Q plot compares the quantiles of the sample data to the theoretical quantiles of a normal distribution. For normally distributed data, the points on the Q-Q plot should fall approximately along the 45-degree line. If the data follows the theoretical distribution closely, the points will lie approximately along a straight line [
98,
99]. Q-Q plots are particularly useful for assessing normality, as deviations from the line indicate departures from normality [
100,
101,
102]. They are widely employed in various fields, including statistics, environmental science, and genomics, to evaluate the fit of data to a specified distribution and to identify outliers [
103,
104]. The visual nature of Q-Q plots (
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14 and
Figure 15) allows for a quick assessment of data distribution, making them a staple in statistical analysis [
105,
106].
The Q-Q plots confirm that the log transformation brought several features closer to a normal distribution. For example, as seen in
Figure 11, the free sulfur dioxide feature improved significantly after log transformation. Similar improvements are visible in total sulfur dioxide and volatile acidity, as illustrated in
Figure 10 and
Figure 15, respectively.
4.1.10. Variance Stabilization (Levene’s Test)
We conducted Levene’s test to determine whether the log transformation stabilized variance across different groups of the features. Levene’s test is a statistical procedure used to assess the homogeneity of variances across different groups. It was introduced by Levene in 1960 and is particularly advantageous because it is robust against violations of the normality assumption, making it suitable for data that may not follow a normal distribution [
107,
108]. The test evaluates whether the variances of multiple groups are equal, which is a critical assumption in various statistical analyses, including ANOVA [
109,
110].
Levene’s test operates by analyzing the absolute deviations of each observation from its group mean or median, providing a more reliable measure of variance equality compared to traditional methods like Bartlett’s test, especially in the presence of non-normal data [
111,
112]. The results of Levene’s test inform researchers whether to proceed with parametric tests that assume equal variances or to adopt alternative methods that do not [
113,
114]. The Levene’s test statistic is calculated as
where N represents the total number of samples, k is the number of groups,
is the transformed data, and
and
refer to the group and overall means, respectively. Our test results indicate significant improvements in variance homogeneity for several features, as shown in
Table 7. For instance, the p-value for fixed acidity improved from
, confirming the effectiveness of the log transformation.
4.1.11. Transformation Considerations for Wine Quality Data
To assess the suitability of transformations for our dataset, histograms of wine quality features (
Figure 4) were analyzed. Several features exhibit skewed distributions, suggesting that logarithmic transformations may improve normalization and enhance predictive modeling.
Fixed Acidity: Displays slight right-skewness. A log transformation could improve symmetry.
Volatile Acidity: Right-skewed distribution; log transformation may help reduce skewness.
Citric Acid: Concentration near zero with a long right tail, making log transformation highly suitable.
Residual Sugar: Highly right-skewed with outliers, benefiting from log transformation to normalize the distribution.
Chlorides: Exhibits extreme right-skewness, making log transformation a strong candidate.
Free Sulfur Dioxide and Total Sulfur Dioxide: Both right-skewed, suggesting log transformations could improve normality and aid in linear modeling.
Density, pH, Sulfates, and Alcohol: Appear relatively symmetrical, so transformation may not be necessary.
Log transformations for fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide could enhance predictive modeling by stabilizing variance and improving feature distributions. To accommodate zero values, a small constant (e.g., 0.1) is added before applying the transformation where necessary.
4.2. Conclusions
The statistical tests and visualizations (Q-Q plots and Levene’s test) indicate that the log transformation effectively improved normality and stabilized variance for most of the selected features. While some features like Chlorides did not show a significant improvement, the majority of features demonstrated better behavior post-transformation, making them more suitable for linear modeling.
5. Regression Models for Wine Quality Prediction
5.1. Linear Regression
Linear regression is a fundamental statistical method used to model the relationship between a dependent variable
y and one or more independent variables
X. In the context of our wine quality dataset, the goal is to predict the quality of wine based on several chemical properties [
115]. It aims to establish a linear equation that best predicts the value of the dependent variable based on the independent variables. This method is widely utilized in various fields, including research statistics, healthcare, finance, and machine learning, due to its simplicity and interpretability [
116]. The core principle of linear regression involves fitting a straight line to the data points to minimize the difference between the observed values and the values predicted by the model [
117].
In linear regression, the dependent variable is often referred to as the response variable, while the independent variables are known as predictors or explanatory variables [
118]. The relationship between these variables is assumed to be linear, meaning that a change in the independent variable leads to a proportional change in the dependent variable [
119]. The model estimates the coefficients of the independent variables to quantify their impact on the dependent variable, allowing for predictions and inference based on the established linear relationship [
120].
Linear regression is particularly useful for understanding the association between variables, making predictions, and identifying trends in the data [
121]. It provides insights into how changes in the independent variables affect the outcome, enabling researchers to make informed decisions based on the model’s results [
122]. Linear regression also serves as a foundation for more complex regression techniques and machine learning algorithms, making it a crucial tool in data analysis and predictive modeling [
123].
One of the key advantages of linear regression is its interpretability, as the coefficients in the model represent the strength and direction of the relationships between variables [
124]. This transparency allows researchers to understand the impact of each predictor on the outcome, facilitating the identification of significant factors influencing the dependent variable [
125]. Additionally, linear regression can be extended to handle multiple predictors through multiple linear regression, enabling the analysis of more complex relationships in the data [
126].
Despite its advantages, linear regression has certain assumptions that need to be met for the model to be valid. These assumptions include linearity, independence of errors, homoscedasticity (constant variance of errors), and normality of residuals [
119]. Violations of these assumptions can lead to biased estimates and inaccurate predictions, highlighting the importance of assessing the model’s validity before drawing conclusions from the results [
127].
In practice, linear regression is applied in a wide range of scenarios, from predicting stock prices and estimating medical costs to analyzing climate data and forecasting economic indicators [
128]. Researchers have also explored variations of linear regression, such as quantile regression and Bayesian linear regression, to address specific research questions and improve model performance [
129].
5.1.1. The Mathematical Model
Linear regression assumes the following linear relationship between the dependent variable y (wine quality) and the independent variables X [
130]:
where
y is the dependent variable representing wine quality.
are the independent variables representing chemical properties such as fixed acidity, volatile acidity, and so on.
is the intercept, the value of y when all .
are the coefficients, which indicate how each independent variable affects the quality of the wine.
is the error term, representing the residuals or the difference between the actual and predicted values.
5.1.2. Objective: Minimizing the Error
The goal of linear regression is to find the optimal coefficients (
) that minimize the error between the predicted and actual values of
y [
131]. This error is typically measured using the sum of squared residuals, also known as the “least squares” method. The residual for each data point i is calculated as
where
is the predicted value of
. The sum of squared residuals is then
5.1.3. Ordinary Least Squares (OLS) [132]
To minimize the sum of squared residuals, the method of Ordinary Least Squares (OLS) is used. OLS calculates the coefficients
that minimize
. The optimal coefficients are derived using the following normal equation:
where
b is the vector of coefficients .
X is the matrix of input features, augmented with a column of ones for the intercept.
y is the vector of observed values of y.
5.1.4. Model Evaluation
Once the coefficients are determined, the linear regression model is evaluated using various metrics, including
R-squared (): Measures the proportion of variance in the dependent variable that is predictable from the independent variable(s). An close to 1 indicates a good fit.
Mean Squared Error (MSE): The average of the squared residuals.
Root Mean Squared Error (RMSE): The square root of the MSE.
5.1.5. Application to the Wine Dataset
In our wine quality dataset, we applied linear regression to model the relationship between the wine’s chemical properties and its quality rating. We derived a linear equation:
where y is the wine quality,
represents the chemical properties, and
are the coefficients that define the influence of each property on the wine quality. By minimizing the error, we ensured that the equation closely approximated the actual quality ratings given the chemical properties. This method provided a clear and interpretable model for predicting wine quality.
5.2. Linear Regression with Second-Degree Polynomial
Linear regression can be extended to model nonlinear relationships by introducing polynomial features. This is known as polynomial regression, and it still maintains the properties of a linear model, but allows for more complex relationships between the variables. In the context of our wine quality dataset, this approach can capture the effects of interactions and quadratic terms between the chemical properties. The use of second-degree polynomials in regression analysis is particularly advantageous when the data exhibit a curved pattern that cannot be adequately captured by a simple linear model [
133].
Researchers in various fields have applied second-degree polynomial regression to address specific challenges and make accurate predictions. For instance, in agricultural research, second-degree polynomial regression has been utilized to predict sugarcane yield, demonstrating its effectiveness in capturing the complex distribution of variables and providing a better fit for the data [
134]. In healthcare, researchers have employed second-degree fractional polynomials in linear regression models to identify factors associated with post-stunting linear growth in under-five children, showcasing the versatility of this approach in understanding growth patterns [
133].
The application of polynomial regression extends beyond traditional fields to encompass machine learning and statistical modeling. In machine learning, polynomial regression has been integrated into predictive models to account for non-linear relationships between predictors and responses, offering a more nuanced understanding of complex data patterns [
135]. Additionally, in resource management in edge servers, polynomial regression has been utilized to model capacity factors, highlighting its utility in optimizing system performance and resource allocation [
136].
In environmental studies, second-degree polynomial regression has been instrumental in water quality index trend analysis, where polynomial regression models of varying degrees were compared to assess water quality trends accurately [
137]. Additionally, in the estimation of chlorophyll and nitrogen contents in plants, second-degree polynomials have been employed to establish correlations between non-destructive and destructive methods, emphasizing the role of polynomial regression in bridging different measurement techniques [
138].
In mathematical and computational research, polynomial regression has been leveraged to address diverse challenges. For instance, in the study of hyperbolic polynomials with a span less than four, researchers have explored the construction and properties of such polynomials using algebraic operations, showcasing the analytical power of polynomial regression in mathematical investigations [
139]. In the context of differential equations with polynomial coefficients, polynomial regression has been used to analyze the behavior of polynomial solutions, providing insights into the characteristics of these equations [
140].
5.2.1. The Mathematical Model
For second-degree polynomial regression, the model takes the form
where
y is the dependent variable.
are the independent variables.
is the intercept.
are the coefficients for the linear terms.
are the coefficients for the quadratic terms.
are the coefficients for the interaction terms.
is the error term.
5.2.2. Objective: Minimizing the Error
Like linear regression, the objective of polynomial regression is to minimize the sum of squared residuals:
The coefficients can be calculated using the Ordinary Least Squares (OLS) method:
where
b is the vector of coefficients .
X is the matrix of input features, augmented with a column of ones for the intercept.
y is the vector of observed values of y.
5.2.3. Application to the Wine Dataset
In our wine quality dataset, we applied second-degree polynomial regression to model the relationship between the wine’s chemical properties and its quality rating. This allowed us to capture more complex relationships between the variables. We derived an equation of the form
where y is the wine quality,
represents the chemical properties, and
are the coefficients that define the influence of each property on the wine quality. By incorporating polynomial terms, this model provided a flexible and effective way to predict wine quality.
5.3. Ridge Regression
Ridge regression is a type of linear regression that addresses some of the limitations of ordinary linear regression, particularly when dealing with multicollinearity or overfitting. It is also known as Tikhonov regularization [
141]. Ridge regression, a modification of the Ordinary Least Squares (OLS) method, is a regularization technique used in linear regression to address multicollinearity and improve the stability of coefficient estimates [
142]. By penalizing the L2-norm of the coefficients, ridge regression introduces a degree of bias that helps regulate the estimated coefficients, making them more robust in the presence of high-dimensional regressors [
143]. This method is particularly valuable when dealing with data exhibiting multicollinearity, where independent variables are highly correlated, as ridge regression provides a solution to the issue of unstable coefficient estimates [
144].
The theoretical foundation and practical applications of ridge regression have been extensively developed, establishing it as a well-recognized technique in regression analysis [
145]. By adding a small positive quantity to the regression coefficients, ridge regression acts as a biased estimation method that offers more stable results compared to traditional linear regression, especially in scenarios with multicollinearity in the independent variables [
146]. The introduction of a biasing parameter in ridge regression mitigates the effects of multicollinearity and leads to more reliable estimation of coefficients [
147].
In model selection and forecasting, ridge regression has proven to be an efficient tool for handling multicollinearity and outliers in datasets, making it a preferred choice when dealing with non-normal residuals and data anomalies [
148]. Researchers have also explored the use of ridge regression in genetic studies, demonstrating its versatility across different fields and its ability to provide robust parameter estimation and prediction in the presence of outliers and multicollinearity [
149]. Furthermore, the application of ridge regression in logistic regression models has led to the development of improved estimators to overcome the challenges posed by multicollinearity in logistic regression analysis [
150].
A key advantage of ridge regression is its ability to produce consistent estimates of predictor variables even in the presence of multicollinearity, a common issue in regression analysis [
151]. By introducing a regularization term that shrinks the coefficients towards zero, ridge regression helps prevent overfitting and improves the generalization capabilities of the model [
152]. Ridge regression has also been compared to other regularization approaches like LASSO and Elastic Net, showcasing its effectiveness in handling multicollinearity and improving the stability of coefficient estimates [
142].
The optimization of ridge parameters in regression analysis is crucial for achieving accurate and reliable results. Researchers have proposed various methods for determining the optimal ridge parameter, such as using the ridge trace and selecting a value that stabilizes the regression coefficients [
153]. The choice of the ridge parameter significantly impacts the performance of ridge regression models, as it directly influences the bias–variance trade-off and the overall predictive accuracy of the model [
154].
5.3.1. The Mathematical Model
Ridge regression introduces a regularization term to the linear regression equation to penalize large coefficients [
155]. The equation is similar to that of linear regression, but with an additional term:
However, the objective function that ridge regression minimizes is
where
y is the dependent variable.
are the independent variables.
are the coefficients.
is the regularization parameter.
The regularization parameter controls the amount of shrinkage. If is 0, ridge regression is identical to linear regression. As increases, the coefficients shrink towards zero, helping to prevent overfitting.
5.3.2. Objective: Minimizing the Error with Regularization
The goal of ridge regression is to minimize the sum of squared residuals, similar to linear regression, but with an additional penalty on the size of the coefficients. This regularization helps prevent the model from fitting the noise in the data.
The coefficients are calculated using the following formula:
where
b is the vector of coefficients .
X is the matrix of input features, augmented with a column of ones for the intercept.
y is the vector of observed values of y.
I is the identity matrix.
is the regularization parameter.
5.3.3. Comparison with Linear Regression
Ridge regression differs from ordinary linear regression by adding a penalty term to the loss function. While linear regression seeks to minimize the sum of squared residuals, ridge regression also minimizes the size of the coefficients. This makes ridge regression more robust when the data has multicollinearity or when the model is prone to overfitting.
5.3.4. Model Evaluation
Like linear regression, ridge regression can be evaluated using metrics such as
R-squared (): Measures the proportion of variance in the dependent variable that is predictable from the independent variable(s). An close to 1 indicates a good fit.
Mean Squared Error (MSE): The average of the squared residuals.
Root Mean Squared Error (RMSE): The square root of the MSE.
5.3.5. Application to the Wine Dataset
In our wine quality dataset, we applied ridge regression to model the relationship between the wine’s chemical properties and its quality rating. The regularization helped prevent overfitting, and we derived the following equation:
where y is the wine quality,
represents the chemical properties, and
are the coefficients that define the influence of each property on the wine quality. By incorporating regularization, ridge regression provided a more robust model for predicting wine quality.
5.4. Ridge Regression with Second-Degree Polynomial
Ridge regression is a type of linear regression that includes a regularization term to prevent overfitting, particularly useful when dealing with complex relationships and multicollinearity. Ridge regression can also be extended to handle polynomial features, allowing it to capture nonlinear relationships effectively. In the context of our wine quality dataset, ridge regression with second-degree polynomials can model interactions and quadratic effects between the chemical properties. Ridge regression is employed to estimate the coefficients of multiple regression models, particularly when addressing highly correlated independent variables. This technique introduces a regularization term to prevent overfitting by shrinking the regression coefficients towards zero [
136]. Incorporating a second-degree polynomial into the model allows researchers to capture nonlinear relationships between variables, enabling more nuanced and accurate predictions [
156].
The integration of a second-degree polynomial within the ridge regression framework provides a robust approach to modeling data exhibiting curvature or nonlinearity. This hybrid model utilizes the regularization properties of ridge regression to address multicollinearity and prevent model overfitting while leveraging the flexibility of polynomial functions to capture intricate patterns in the data [
157]. By combining these techniques, researchers can strike a balance between bias and variance, leading to more robust and reliable regression models [
158].
In practical applications, ridge regression with a second-degree polynomial has demonstrated versatility and effectiveness across various fields. For instance, in predicting the progression of COVID-19 outbreaks, a hybrid polynomial–Bayesian ridge regression model was developed to forecast the spread of the virus, demonstrating the utility of this approach in epidemiological modeling [
156]. Similarly, in fault detection for ship systems operations, a multiple polynomial ridge regression model accurately detected developing faults in engine parameters, highlighting the applicability of this technique in predictive maintenance [
157].
By creating a ridge regression estimator within a polynomial framework, researchers can effectively handle scenarios with complex interdependencies among variables, ensuring robust and stable model performance [
158]. This approach has also been applied in the analysis of monthly and annual rainfall variability, demonstrating its efficacy in handling geographical variables and predicting climatological phenomena [
159].
Furthermore, the use of second-degree polynomial regression within the ridge regression framework has been pivotal in fields such as agriculture and environmental science. For instance, in predicting sugarcane yield, researchers utilized second-degree polynomial regression to achieve the best fit for the distribution of variables, emphasizing the importance of polynomial functions in capturing the nonlinear relationships inherent in agricultural data [
134]. Similarly, in water quality trend analysis, higher-degree polynomial models, including second-degree polynomials, provided a good fit to the data, highlighting the utility of polynomial regression in environmental research [
137].
5.4.1. The Mathematical Model
For ridge regression with second-degree polynomials, the model takes the form
However, the objective function that ridge regression minimizes is
where
y is the dependent variable.
are the independent variables.
are the coefficients.
is the regularization parameter.
The regularization parameter controls the amount of shrinkage. As increases, the coefficients shrink towards zero, helping to prevent overfitting.
5.4.2. Application to the Wine Dataset
In our wine quality dataset, we applied ridge regression with second-degree polynomials to model the relationship between the wine’s chemical properties and its quality rating. The regularization helped prevent overfitting, and we derived the following equation:
where y is the wine quality,
represents the chemical properties, and
are the coefficients that define the influence of each property on the wine quality. By incorporating regularization, ridge regression provided a more robust model for predicting wine quality, particularly when the relationships were complex or when overfitting was a concern.
6. Applying Regression Models and Obtaining Equations
6.1. General Information for All Equations
6.1.1. Log Transformation
The equations are based on the log-transformed features for certain attributes, which means that for a given feature
x, the transformed value
is computed as follows:
The log transformation is applied to the following features:
Fixed acidity;
Volatile acidity;
Citric acid;
Residual sugar;
Chlorides;
Free sulfur dioxide;
Total sulfur dioxide.
6.1.2. How to Use the Equations
To use these equations for predicting the wine quality based on new data, the relevant features should be log-transformed as specified above.
For a new wine sample with attributes , where
- −
;
- −
;
- −
;
- −
;
- −
;
- −
;
- −
;
- −
;
- −
;
- −
;
- −
.
The log-transformed values should be
For features , the original values are used directly.
The equations can then be used directly with these transformed values.
6.1.3. Accuracy Calculation
n is the number of samples,
is the predicted quality for the sample,
is the actual quality for the sample,
is the indicator function.
This formula computes the fraction of samples where the rounded predicted quality () matches the actual quality.
Given an input ,
- (i)
The value is computed first, which simply adds to the value of .
- (ii)
The floor function is then applied to . This rounds to the nearest integer.
To explain this with an example, suppose :
- (i)
- (ii)
, because 3 is the largest integer less than or equal to 3.9.
Now, if ,
- (i)
- (ii)
, because 4 is the largest integer less than or equal to 4.1.
Thus, the expression is a common way to round a real number to the nearest integer. The floor function rounds down to the largest integer smaller than or equal to the given value. The trick of adding before applying the floor function effectively rounds to the nearest integer. And the indicator function, often denoted as , is a mathematical function used to indicate whether a certain condition is true or false.
These results indicate that both models performed similarly, with the linear regression model being slightly more accurate.
6.1.4. Repeatability of Experiments
To ensure the repeatability of our experiments and to facilitate a fair comparison between different algorithms, we used a consistent random seed and a fixed test size for our data splits. Specifically, we set random_state=42 and test_size=0.2. This standardization ensures that if others follow the same procedures using Python, they will obtain identical results.
For logarithmic transformations, we used the natural logarithm with a base e. The base e is approximately equal to 2.71828. For example, e to the power of 2 is approximately 7.3891, and the natural logarithm of 7.3891 is 2.
When using Python, it is important to note the distinction between np.log(x) and np.log1p(x). The function np.log(x) computes the natural logarithm of x. In contrast, np.log1p(x) computes the natural logarithm of , which is equivalent to . The np.log1p(x) function is preferred because it helps avoid precision issues when x is close to zero.
For those using other statistical tools such as MATLAB, SAS, or SPSS, the equivalent function to use is or . This consistency ensures that results across different platforms remain comparable.
6.2. Linear Regression Equation
Given the wine dataset, we first predicted the quality of the wine (
y) using linear regression; the equation is in
Table 8.
6.3. Ridge Regression Equation
Given the wine dataset, we then predicted the quality of the wine (
y) using ridge regression; the equation is in
Table 9.
6.4. Second-Degree Linear Regression Equation
Given the wine dataset, we then predicted the quality of the wine (
y) using linear regression with second-degree polynomial features; the equation is in
Table 10.
6.5. Second-Degree Ridge Regression Equation
Given the wine dataset, we then predicted the quality of the wine (
y) using ridge regression with second-degree polynomial features; the equation is in
Table 11.
6.6. Equations with Third-Degree Polynomial Features
When we derived the third-degree equations for predicting wine quality, our results were as follows: ’Linear Regression’: ’RMSE’: 0.8023, ’Accuracy’: 0.5594, ’Ridge Regression’: ’RMSE’: 0.6778, ’Accuracy’: 0.5875. This means that our accuracies decreased and RMSE increased. Therefore, we can conclude that, up to now, the second-degree ridge regression formula is the best among all for predicting wine quality.
6.7. Best Second-Degree Ridge Regression Equation
In our study, we systematically explore all 2047 possible combinations of 11 wine characteristics to determine their predictive power using a second-degree polynomial ridge regression model. This analysis is facilitated by the following combination formula:
where
is the factorial of n, representing the product of all integers up to n.
is the factorial of x.
is the factorial of the difference between n and x.
This formula calculates the number of different ways to select x features from a set of n features, irrespective of order. For our dataset comprising 11 features, we calculate combinations for each subset from 1 to 11, totaling non-empty subsets.
Each subset undergoes evaluation to determine the effectiveness of its corresponding second-degree polynomial ridge regression model in predicting wine quality. We continuously refine our search until identifying the subset that produces the highest accuracy, thereby optimizing our predictive model.
Given the following 11 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol, we considered all possible non-empty combinations of these features. For each combination, we fit a second-degree polynomial ridge regression model and evaluated its performance based on Root Mean Square Error (RMSE) and accuracy. The optimal combination, yielding the highest accuracy and lowest RMSE, was found to be
the optimized best second-degree ridge regression equation; the equation is in
Table 12.
6.8. Summary of Mathematical Equations for Wine Quality Prediction
In our analysis, we aimed to predict the quality of wine using various mathematical models. Throughout the process, we derived different mathematical equations to predict wine quality. Our objective was to maximize accuracy while minimizing the Root Mean Square Error (RMSE).
6.8.1. Results from Different Models
As seen in
Table 13, the highest accuracy we achieved was
using an optimized second-degree ridge regression model. The equation for this model, highlighted in yellow below, had an RMSE of
.
6.8.2. Conclusions
In conclusion, the best second-degree ridge regression equation from
Table 12 performed the best in terms of accuracy, indicating that it is a reliable model for predicting wine quality. The other equations performed relatively well, but none surpassed the optimized ridge regression equation in accuracy.
7. Model Comparison for Wine Quality Prediction
In this section, we aimed to compare the accuracy of our optimized second-degree ridge regression equation with different machine learning models. The objective was to determine which model best predicted wine quality using a dataset with several chemical properties as features. We tested various models including linear regression, Lasso Regression, logistic regression, Elastic Net, K-Nearest Neighbors (KNN), neural networks, Naive Bayes, support vector machines (SVM), Principal Component Regression (PCR), partial least squares regression (PLSR), Random Forest, and extreme gradient boosting (XGBoost).
7.1. Data Preparation
We started by loading the dataset, which contains 11 chemical features and the target variable quality. To handle skewness in the data, we applied a logarithmic transformation to certain features for our best second-degree ridge model, such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide. For other models, no transformation was applied, and scaling was only applied when necessary.
7.2. Optimal Feature Selection for Ridge Regression
To identify the best performing model, we used ridge regression with second-degree polynomial features. We evaluated different combinations of the 11 features to find the best subset. We split the data into training and testing sets using an 80–20 split with a fixed random_state for consistency. Our best subset of features included fixed acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, pH, sulfates, and alcohol. This subset, which consisted of seven features, gave an accuracy of and an RMSE of .
7.3. Model Evaluation
We then tested all the other models using the same subset of features identified by our ridge regression for our model, while the other models were tested using all 11 features. Each model was evaluated using accuracy and RMSE as metrics. The results are shown in
Table 14:
7.4. Conclusions
From
Table 14, it is clear that our ridge regression model with second-degree polynomial features performed exceptionally well, ranking fourth out of the fourteen models tested. The Random Forest Regressor and XGBoost models performed better, likely due to their ability to handle complex, nonlinear relationships effectively. Our mathematical model’s strong performance demonstrates the effectiveness of selecting the right subset of features and using polynomial transformations for prediction.
With a quality formula, we can understand the marginal effects of changes and determine the best direction to improve quality. This understanding empowers farmers and producers to make informed decisions, leading to higher quality wines. This project showcases how a well-crafted mathematical formula can outperform complex simulations, providing clear and actionable insights into improving wine quality.
Impact of Taster Variability on Model Evaluation
The quality scores in the dataset are inherently subjective, as they are assigned by human tasters based on sensory evaluation. Such assessments are influenced by individual preferences, cultural backgrounds, and environmental factors, resulting in variability in the dependent variable (“y”). This variability establishes a baseline level of noise in the dataset that cannot be eliminated by any predictive model.
When comparing different models, the differences in their errors, such as RMSE or accuracy, must be interpreted in the context of this inherent noise. For example, even if a model achieves a lower RMSE than another, the improvement might not hold significant practical value if the reduction in error is smaller than the natural variability in the quality scores. This underscores the importance of considering the limits of prediction accuracy imposed by the subjectivity of the target variable.
In this study, while models such as XGBoost and Random Forests demonstrated higher predictive accuracy, the ridge regression model was chosen for its interpretability and simplicity. These qualities allow stakeholders in wine production to understand the contribution of individual physicochemical properties to wine quality, facilitating actionable insights. The study’s emphasis on interpretability addresses practical concerns in the wine industry, where the ability to analyze and apply model results is as important as achieving high predictive accuracy.
Future work could explore ways to reduce the impact of subjective variability by incorporating additional objective measures or validating the predictions with sensory panels under controlled conditions.
7.5. Practical Applications of Predictive Modeling in Winemaking
The results from this study can provide winemakers with actionable strategies to enhance wine quality. The ridge regression model, with its interpretable coefficients, enables producers to identify key physicochemical attributes that contribute to higher quality scores and make informed adjustments to production processes.
7.5.1. Adjusting Alcohol Content for Optimal Balance
Our model indicates that alcohol has the most significant positive impact on wine quality, with a coefficient of . This suggests that maintaining an optimal alcohol level enhances sensory attributes and overall balance. Winemakers can control alcohol content by
Adjusting fermentation temperature to influence yeast activity.
Selecting yeast strains with different fermentation efficiencies to control ethanol production.
Implementing partial de-alcoholization techniques if necessary.
7.5.2. Managing Residual Sugar to Optimize Perception
The model assigns a positive coefficient to residual sugar (), indicating that a moderate amount enhances quality perception. However, excessive sugar combined with sulfur dioxide can be detrimental. Practical steps for winemakers include
Controlling fermentation duration to leave the desired amount of residual sugar.
Using selective yeast strains that naturally metabolize sugars while preserving aroma.
Monitoring sulfur dioxide additions to avoid overpreservation effects that might negatively impact flavor.
7.5.3. Balancing Sulfur Dioxide Levels for Stability
Total sulfur dioxide positively influences quality (), but free sulfur dioxide () has a slightly negative impact, highlighting the importance of precise sulfur dioxide management. Winemakers can
Adjust SO2 additions based on pH and microbial stability requirements.
Use alternative preservation techniques (e.g., micro-oxygenation) to reduce SO2 reliance.
Regularly monitor free and bound SO2 to maintain balance.
7.5.4. Controlling Acidity and Sulfates for Sensory Harmony
Sulfates () and pH () show minor negative influences, suggesting that extreme values should be avoided to prevent off-flavors. Strategies include
Adjusting acid content through malolactic fermentation or tartaric acid additions.
Fine-tuning sulfate concentrations to enhance structure without increasing astringency.
Blending different batches to achieve a balanced acidity profile.
8. Predicting Out-of-Sample Wine Quality Values
In this section, we delve into the mathematical equation we developed for predicting wine quality. The formula used for our best second-degree polynomial ridge regression was specifically designed to capture the interplay between multiple physicochemical features, including
where the variables
to
represent the relevant features: fixed acidity, residual sugar, free sulfur dioxide, total sulfur dioxide, pH, sulfates, and alcohol. The model captures both linear and quadratic interactions, as well as cross-products between these variables.
Given the formula, we aimed to identify a composition with a predicted quality score just above 10. To achieve this, we evaluated the equation across a dense grid of possible values for each variable. These values span the ranges observed in the training set after applying the log transformation:
For each variable, we divided the range into 10 intervals, creating a grid with
possible combinations. We then evaluated the equation for each combination, aiming to identify the minimum value just above 10:
Through this exploration, we identified the composition of a hypothetical wine with the minimum predicted quality above 10 as follows:
The minimum predicted quality above 10 is 10.000000
Features for the minimum predicted quality above 10:
In
Figure 16, we present 3D surface plots of wine quality predictions using our optimal second-degree polynomial ridge regression model. As observed, the model predicts quality values exceeding 10, which surpass the typical upper limit of human taste perception. These graphs aim to enhance understanding of the quality predictions. By varying combinations of two parameters, we examine their impact on predicted quality values. These visualizations help in identifying how changes in these variables influence the quality outcome predicted by the model.
For each graph, we display the quality value along with two variables. Given the number of combinations of two variables from a set of seven, the total number of unique pair combinations is calculated using the combination formula , which yields twenty-one combinations.
When varying the values of these two variables to draw the quality graph, we hold the other five variables fixed at levels corresponding to the previously calculated minimum predicted quality value that exceeds 10.
These findings offer insights into the potential composition of a hypothetical wine just beyond the threshold of human perception.
The predictions of hypothetical quality scores exceeding 10 were designed to explore the potential of the model to identify optimal physicochemical compositions associated with the highest possible quality ratings. These scores, while hypothetical, provide a theoretical framework for guiding future experimental studies. It is important to emphasize that these predictions are not definitive but rather serve as a starting point for identifying promising wine compositions.
Scientifically, these predictions are interpreted as an extrapolation of the relationship between physicochemical properties and quality scores within the observed data range. While the model captures patterns in the dataset, its validity beyond the observed range (quality scores of 3 to 8) is inherently uncertain. Future validation using experimental and sensory evaluations will be necessary to assess the practical applicability of these predictions.
9. Conclusions
This study explored the use of mathematical models for predicting wine quality, aiming to provide a framework for extending predictions beyond traditional human rating scales. By analyzing 11 physicochemical variables of wine and applying various regression techniques, including linear regression, ridge regression, and polynomial models up to the third degree, we sought to develop an accurate and reliable predictive model.
Through a detailed evaluation, the second-degree polynomial ridge regression model emerged as the most effective approach, balancing predictive performance and simplicity. To optimize the model, we evaluated all possible subsets of the 11 variables, ultimately identifying a reduced set of seven variables that provided the best balance of accuracy and model interpretability. The optimized model achieved an accuracy of and demonstrated robustness by outperforming several other machine learning models.
While ensemble methods such as XGBoost and Random Forests achieved higher accuracy and lower RMSE in this study, they often lack the interpretability necessary for understanding the individual contributions of physicochemical features. Neural networks, while capable of capturing complex, nonlinear relationships, also suffer from a lack of transparency, making it challenging to derive actionable insights.
In contrast, our optimized second-degree polynomial ridge regression model strikes a balance between predictive performance and interpretability. This is particularly valuable in the wine-making context, where understanding the marginal effects of individual features is critical for guiding production decisions and optimizing quality. The ridge regression model’s ability to provide a clear mathematical formula enables wine producers to identify and adjust specific attributes to improve wine quality effectively.
Future research could explore the use of ensemble methods or neural networks to benchmark accuracy further, while maintaining a focus on enhancing interpretability through techniques such as feature importance analysis or explainable AI (XAI) approaches.
One of the key contributions of this work is the exploration of hypothetical wine compositions with predicted quality scores extending beyond the conventional upper limit of 10. By leveraging the developed model, we demonstrated the potential to predict and analyze wine quality metrics that surpass human perceptual thresholds. While these predictions are hypothetical, they provide a novel perspective on how predictive modeling can extend beyond traditional boundaries. While this approach offers a novel perspective, it is essential to acknowledge the limitations of extrapolating beyond the observed data range. These predictions are not definitive but rather serve as a theoretical framework to guide future experimental validation. By highlighting potential physicochemical compositions, this study aims to inspire further research and innovation in optimizing wine quality.
This methodology has broader implications for optimizing and exploring the quality of other food products, such as bread, by providing insights into hypothetical compositions that may enhance quality metrics. For instance, in the case of bread, measurable chemical and physical features such as moisture content, protein content, crumb structure, crust color, and elasticity could be associated with a quality score assigned by food experts. By applying a similar modeling approach, it would be possible to predict the optimal composition of these features to achieve higher quality scores than currently observed.
Although the quality parameters of bread differ significantly from those of wine, this study illustrates the versatility of the proposed methodology for predicting and enhancing quality metrics across diverse food products. Future work will prioritize validation within the wine production industry while exploring potential adaptations of the methodology to other food domains.
The study demonstrates that a carefully optimized polynomial ridge regression model can achieve competitive performance while maintaining interpretability, making it suitable for practical applications in wine quality prediction. Comparisons with advanced machine learning models were conducted to validate the robustness of the proposed approach and highlight the trade-offs between accuracy and interpretability. By focusing on actionable insights and exploring hypothetical compositions, this work contributes to advancing the understanding of wine quality and its determinants, rather than serving as a general tutorial on data processing methods.
9.1. Analysis of Ridge Regression for Predicting Wine Quality
The ridge regression model applied to our wine quality dataset yielded the results shown in
Figure 17 and
Figure 18, which present a residuals vs. predicted values plot and a Q-Q plot of the residuals, respectively. This subsection outlines the strengths and limitations of the model, based on key performance metrics and visual diagnostic tests.
9.1.1. Model Performance Metrics
The performance metrics, as provided in the statistical analysis, are as follows:
R2 on Training Set: −0.1397
R2 on Test Set: −0.0636
Adjusted R2 on Training Set: −0.1460
Cross-Validation MSE: 0.7415
Predicted quality after 5% perturbation: 10.3362
95% Confidence Interval for predicted quality: (4.1488, 6.9951)
The values indicate how much of the variance in wine quality the model explains. The negative values on both the training and test sets suggest that the model is currently underperforming. However, cross-validation results show an acceptable mean squared error (MSE) of 0.7415, indicating that the model’s predictive power could be enhanced with further refinement.
Furthermore, the perturbation analysis demonstrates that the model can respond to small variations in input features. A 5% increase in the features resulted in a predicted quality score of 10.3362, showing that the model is capable of reflecting changes in the input data.
9.1.2. Residual Analysis
Residual analysis is crucial for assessing the model’s accuracy. The residuals are defined as
where
is the residual,
is the observed wine quality, and
is the predicted quality.
Figure 17 illustrates the relationship between the predicted values and residuals. The plot shows a relatively even distribution of residuals around the zero line, which suggests that the model does not systematically overpredict or underpredict across the range of predicted wine quality scores. This even distribution is key to avoiding model bias.
Figure 18 presents the Q-Q plot of residuals, which compares the quantiles of the residuals with the quantiles of a standard normal distribution. The majority of residuals align with the theoretical quantile line, indicating that the residuals are approximately normally distributed. This supports the validity of the ridge regression model, as it fulfills the assumption of normally distributed residuals.
9.1.3. Predictive Confidence Intervals
We calculated the 95% confidence interval for the predicted quality scores. Confidence intervals provide a range of values within which the true prediction is likely to fall, and they are computed as follows:
where
is the critical value from the standard normal distribution and
is the standard error of the prediction. The calculated 95% confidence interval for predicted wine quality is between 4.1488 and 6.9951. This range suggests that the model can predict moderate-quality wines with confidence, though it struggles to accurately predict very high- or very low-quality wines.
In conclusion, although the ridge regression model currently shows some limitations in terms of explaining the full variability in wine quality (as indicated by the negative values), its residual analysis, cross-validation results, and predictive confidence intervals demonstrate that it can still provide valuable predictions. The model is responsive to small changes in input features and maintains an approximately normal distribution of residuals. With further refinement and feature selection, its predictive performance can be improved.
9.2. Comparison with Related Studies and Suggestions for Improvement
The achieved accuracy of for the optimized second-degree polynomial ridge regression model demonstrates a balance between predictive performance and interpretability. Ridge regression was selected for its simplicity and ability to reveal the relationships between physicochemical properties and wine quality. This interpretability is particularly valuable in winemaking, where understanding feature contributions is essential for practical applications.
The choice of ridge regression aligns with the study’s objectives of balancing performance with interpretability and extending predictions beyond the observed data range. Although models such as Neural Network Classifier, Random Forest Classifier, and XGBoost achieved higher accuracies within the dataset’s range, their predictions are often limited to the patterns present in the training data. In contrast, ridge regression, with its assumptions of linearity and regularization, provides a framework that balances predictive accuracy with extrapolation capabilities.
Moreover, ridge regression offers direct insights into feature contributions, which are crucial for applications such as optimizing wine quality based on individual physicochemical properties. For this study’s goal of identifying potential wine compositions that achieve high-quality scores, ridge regression provides a practical and interpretable approach. Future research could enhance this framework by integrating ridge regression with advanced models, enabling the cross-validation of predictions and improving the robustness of extrapolated results.
There are limitations of relying solely on a single dataset, as it restricts the diversity of samples and may limit the model’s generalizability. Future studies could benefit from the following:
Cross-Validation with Diverse Datasets: Incorporating datasets from different wine-producing regions or vintages could help evaluate the robustness and applicability of the model across varied conditions. Such datasets could capture unique patterns in physicochemical properties and quality scores, thereby enhancing the model’s predictive power.
Data Augmentation Techniques: Synthetic generation of feature combinations could expand the dataset’s diversity, enabling the exploration of a broader range of physicochemical properties. For instance, Gaussian noise addition or feature interpolation methods could simulate new data points while maintaining the relationships observed in the original dataset.
By integrating diverse datasets and leveraging data augmentation techniques, future work can address the current limitations and further improve the robustness and applicability of predictive models for wine quality and other food quality assessments.
We also acknowledge that advanced machine learning models, such as neural networks and ensemble methods like Random Forests, could potentially achieve higher predictive accuracy. Neural networks, for instance, excel in capturing complex, non-linear relationships and may provide better performance within the observed data range. However, their extrapolation capabilities are limited and require careful training with augmented or synthetic data to predict values beyond the training range. Random Forests, on the other hand, inherently lack the ability to predict out-of-range values due to their reliance on decision-tree splits, which restrict predictions to the observed range of the training data.
Future studies could benefit from benchmarking the ridge regression model against these advanced techniques to evaluate trade-offs in accuracy, complexity, and interpretability. Additionally, research could explore strategies for enhancing the out-of-range prediction capabilities of advanced models through synthetic data augmentation or domain-specific feature engineering. This comprehensive comparison would provide valuable insights into the strengths and limitations of each approach, guiding the selection of models based on specific research and practical objectives.
For the present study, ridge regression aligns well with our emphasis on understanding physicochemical contributions to wine quality and exploring hypothetical compositions for optimization. By maintaining a focus on interpretability, this study aims to provide actionable insights for both researchers and practitioners in the wine-making industry.
The achieved accuracy for the optimized second-degree polynomial ridge regression model was evaluated in the context of similar studies that applied machine learning techniques for wine quality prediction. To the best of our knowledge, there is a lack of studies specifically focusing on continuous mathematical modeling for wine quality prediction. Most existing research employs machine learning models aimed at the classification or categorization of wine quality into discrete classes. In contrast, this study uniquely emphasizes continuous prediction, aiming to quantify quality scores beyond the observed range of the current dataset. This predictive approach provides novel insights into potential wine compositions and quality metrics that surpass traditional categorizations. The prediction of quality scores exceeding the observed range (e.g., 10) is exploratory and serves to identify theoretical compositions that could inform experimental designs. However, these predictions must be interpreted cautiously, as they extrapolate beyond the dataset’s range. Future studies should validate such predictions through experimental and sensory evaluations to ensure their scientific and practical relevance.
While the current model shows promise, there are opportunities to improve its predictive performance in future research. Specifically, incorporating additional objective and measurable variables not currently included in the dataset could provide more comprehensive insights. These variables include
Volatile Compounds: Measurable using gas chromatography–mass spectrometry (GC-MS), these compounds are key contributors to the aroma profile of wine.
Phenolic Content: Total phenolic compounds, including tannins, can be quantified using spectrophotometry and are linked to the astringency and structure of the wine.
Color Intensity and Hue: Quantifiable using spectrophotometric methods, these parameters provide insights into the wine’s appearance and aging potential.
Glycerol Content: Quantifiable using enzymatic assays, glycerol influences the wine’s body and mouthfeel.
In addition to incorporating these measurable variables, the following strategies could enhance model performance:
Expanding the Dataset: Including wines with a broader range of physicochemical properties and quality scores could improve the model’s generalizability and robustness. A more diverse dataset may capture underrepresented patterns, thereby enhancing accuracy.
Validation with Independent Datasets: Testing the model on independent datasets from different wine-producing regions could ensure robustness and applicability across diverse contexts.
By integrating additional objective variables and applying these strategies, future studies can address the current limitations and further enhance the predictive power and applicability of mathematical models for wine quality and other food quality assessments.
9.3. Interpretation of Ridge Regression Coefficients and Their Implications
The best second-degree ridge regression equation derived in this study provides valuable insights into the influence of various physicochemical properties on wine quality. The regression coefficients reveal the significance of individual features and their interactions, highlighting key factors that affect wine composition and quality scores.
9.3.1. Key Contributors to Wine Quality
The regression results indicate that certain physicochemical attributes play a dominant role in determining wine quality:
Alcohol () is the most significant positive contributor to wine quality. This aligns with enological research, where higher alcohol content is often associated with better sensory attributes and structural balance in wine.
Residual sugar () has a positive impact, suggesting that wines with moderate residual sugar levels are perceived as higher quality. However, excessive sugar, particularly when combined with high sulfur dioxide levels, can be detrimental.
Total sulfur dioxide () contributes positively, indicating that controlled levels of sulfur dioxide help preserve wine quality. However, its interaction effects with free sulfur dioxide suggest that excessive levels can be undesirable.
Free sulfur dioxide () exhibits a negative coefficient, reinforcing the importance of maintaining balanced sulfur dioxide levels to avoid off-flavors and undesirable oxidative effects.
Sulfates () have a slight negative influence on quality, supporting the need for careful sulfate adjustments to prevent excessive astringency.
pH () also exhibits a minor negative effect, underscoring the importance of maintaining optimal acidity levels.
9.3.2. Non-Linear Effects and Interaction Terms
The quadratic and interaction terms further highlight the complexity of wine chemistry:
Fixed acidity () follows a non-linear pattern, where moderate levels enhance quality, but excessive acidity negatively affects perception.
Residual sugar () demonstrates that slight increases improve quality, but excessive sugar can reduce consumer preference.
Fixed acidity and residual sugar interaction () indicates that a well-balanced combination of these attributes enhances wine quality.
Free sulfur dioxide and sulfates interaction () suggests that excessive use of sulfur dioxide and sulfates together has a strong negative effect, reinforcing the importance of precise sulfite management.
Residual sugar and alcohol interaction () highlights that too much sugar combined with high alcohol content can reduce overall quality.
9.3.3. Practical Implications for Winemaking
These findings provide actionable insights for wine producers:
Maintaining alcohol levels within optimal ranges is crucial for achieving high-quality wines.
Controlling residual sugar ensures a balance between sweetness and acidity, preventing undesirable sensory effects.
Managing sulfur dioxide levels appropriately minimizes oxidation while avoiding excessive sulfite-related defects.
Considering the interactions between acidity, sulfur dioxide, and other components can enhance consistency and quality control.
By leveraging these insights, winemakers can optimize production processes and improve wine quality predictions through data-driven decision-making. Future research could further refine these findings by incorporating sensory evaluation metrics and testing various winemaking conditions.
9.4. Implications for Predictive Modeling and Winemaking Applications
The statistical transformations and modeling approaches applied in this study provide crucial insights into the relationship between wine composition and quality. The results highlight the effectiveness of transformation techniques in stabilizing variance and improving the normality of key wine features, ultimately enhancing predictive model performance.
One key finding is that applying log transformations to variables such as fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide significantly improved normality and reduced skewness. These transformations directly impact model accuracy, as demonstrated in prior studies [
63,
73]. The improved model predictions enable winemakers to make data-driven decisions in optimizing wine composition for higher quality scores.
Furthermore, the application of polynomial ridge regression provides a more refined understanding of how different chemical components interact to influence wine quality. The results indicate that non-linear relationships play a critical role, particularly for features like residual sugar and sulfites, which exhibit non-Gaussian distributions prior to transformation. These findings suggest that conventional linear modeling may be insufficient for capturing the complexities of wine chemistry, reinforcing the necessity of advanced machine learning techniques in oenology research.
From a practical standpoint, the results can aid winemakers in identifying key variables that require stricter control during the production process. For example, the transformation results indicate that minimizing variations in sulfur dioxide content and balancing acidity levels can lead to more consistent quality scores. Additionally, winemakers can leverage these insights to refine blending strategies, ensuring that optimal feature compositions align with high-quality wines.
Overall, the integration of statistical transformations and advanced modeling contributes to more accurate quality assessments and predictive frameworks for wine production. Future research can extend these methodologies by incorporating sensory evaluation data and real-time monitoring systems, further bridging the gap between data-driven analytics and practical winemaking applications.
9.5. Validity and Practical Application of Ridge Regression in Wine Quality Prediction
The application of ridge regression in this study provides an interpretable and robust approach to predicting wine quality. While the model achieves an accuracy of approximately 0.6, it remains valuable for practical applications in winemaking. Similar predictive models with moderate accuracy have been successfully used in various fields, including food science, healthcare, and engineering, demonstrating their ability to provide actionable insights despite statistical limitations.
9.5.1. Regression Models in Wine and Food Quality Assessment
Regression-based predictive modeling has been widely applied in wine and food quality research. Kasimati et al. investigated multiple regression techniques, including Ordinary Least Squares (OLS) and decision trees, for predicting total soluble solids in wine grapes, demonstrating that such models can achieve comparable levels of accuracy [
160]. Similarly, Beh and Farver highlighted the advantages of using regression models to analyze wine characteristics, emphasizing their effectiveness in extracting meaningful patterns from physicochemical data [
161].
Multivariate regression models have also proven effective in wine quality assessment. Yin et al. applied partial least squares regression (PLSR) to correlate grape physicochemical indices with wine quality, illustrating the predictive efficiency of these models in real-world applications [
162]. Similarly, Aleixandre-Tudó et al. demonstrated the potential of multivariate regression methods in predicting sensory attributes in red wines, reinforcing the value of statistical models for practical winemaking decisions [
163]. Furthermore, recent advancements in machine learning have improved the predictive power of regression-based approaches, with studies such as Zeng et al. demonstrating the benefits of ensemble learning for wine quality evaluation [
164].
9.5.2. Predictive Models in Practical Applications
Predictive models with moderate statistical performance have been successfully employed in various domains, including food safety, healthcare, and engineering, where their predictions guide important decisions.
In food science, predictive microbiology models forecast microbial growth in food products to ensure quality and safety. Despite their moderate predictive accuracy, these models play a significant role in optimizing food production and extending shelf life [
165,
166]. Similarly, healthcare applications leverage machine learning models with accuracy levels around 0.6 to predict disease progression and patient outcomes, offering valuable insights for clinical decision-making [
167,
168,
169]. Engineering applications also utilize predictive models for optimizing design and operational processes, where even models with moderate accuracy provide useful recommendations [
170,
171,
172].
These examples highlight that predictive models do not need perfect accuracy to generate meaningful insights. Instead, they provide a structured approach to decision-making, allowing domain experts to integrate model-based recommendations with empirical knowledge.
9.5.3. Interpreting Model Performance in Wine Quality Prediction
While the model exhibits negative determination coefficients in certain cases, this does not necessarily render it unsuitable for practical application. Negative values typically indicate instances where the linear regression terms struggle to fully capture the variance in quality evaluations, particularly due to the complexity of sensory attributes and human perception in winemaking. However, the presence of non-linear interaction terms in the model compensates for these limitations, as demonstrated by the improved feature relationships captured through polynomial ridge regression.
In practice, many quality assessment models function effectively despite moderate accuracy. For example, regression-based models have been used in the food industry to evaluate sensory attributes such as texture and flavor in processed foods, including cheese, sausage, and confectionery products [
173]. These studies demonstrate that, even with statistical limitations, predictive models can contribute to process optimization and quality control.
9.5.4. Potential for Experimental Validation
To further validate the practical application of this model, future research can explore experimental testing by evaluating wines that align with the predicted physicochemical properties. This validation approach would involve
Selecting wine samples with chemical compositions matching the model’s optimal predictions.
Conducting controlled sensory evaluations by trained wine tasters to assess whether predicted quality aligns with perceived quality.
Comparing experimental results with model outputs to refine predictive accuracy and enhance practical usability.
Such an approach would provide empirical confirmation of the model’s effectiveness and bridge the gap between statistical modeling and real-world winemaking applications.
Despite inherent variability in wine quality assessments, ridge regression offers a structured and interpretable approach to predicting key quality factors. The application of regression-based models in food science and winemaking has been well documented, with numerous studies demonstrating their value in optimizing production processes. While model accuracy remains moderate, predictive analytics continues to provide actionable insights for wine producers, guiding quality enhancement strategies. Future work integrating experimental validation with model-based predictions will further strengthen its applicability in winemaking.