1. Introduction
Data science is the advancement in the combination of data engineering, scientific methods, math, visualization and statistically based algorithms with a domain of application to make sense of larger quantities of data. With the rise of the internet, data has become abundant; therefore, data science has become one of the most popular research areas in the 21st century. Within this popular field there are four major types of learning algorithms that provide efficacy: Supervised Learning [
1], Unsupervised Learning [
2], Semi-supervised Learning [
3], and Reinforced Learning [
4]. All of these methods provide useful and distinct information to the domain knowledge with large amount of data.
Wine has been enjoyed by people across the world for several thousand years. It is both delicious and so wildly varied that people often choose to dedicate a great deal of their time and money to tasting, comparing, and discussing different wines with their friends and peers. According to the International Organization of Vine and Wine (OIV), who is the world’s authority on wine statistics, in 2018, 293 million hectoliters of wine were produced across 36 countries. This constitutes a 17% increase in wine production from 2017 to 2018 [
5]. The world’s total wine production in 2019 is estimated to be 263 million hectoliters. This is just slightly below the average global wine production over the last ten years of 270 Mhl [
6]. Based on the OIV statistic, wine is one of the high-value products that heavily affect many wine-producing countries’ economies, such as France, Italy, and Spain.
Unsupervised machine learning algorithms infer patterns from a large dataset without reference to known or labeled outcomes [
2]. What separates this from the supervised machine learning algorithms is the fact that when this type of learning is performed, there is no knowledge as to what we are going to observe in the results. Several researches applied unsupervised learning techniques on wine related data: References [
7,
8] utilized clustering on wine consumers to understand their behavior. References [
9,
10] studied the effects on moderate wine consumption to the human body through clustering algorithms. References [
11,
12,
13] worked on the chemical analysis of wine. Among all of these researches, none of them studied the flavor of wine, and the dataset applied for clustering contains less than 200 samples.
Wineinformatics [
14,
15] incorporates data science and wine related datasets, including physicochemical laboratory data and wine reviews, to discover useful information for wine producers, distributors, and consumers. Physicochemical laboratory data usually relates to the physicochemical composition analysis [
16], such as acidity, residual sugar, alcohol, etc., to characterize wine. Most of the existing data mining researches in wine domain use physicochemical data with less than 200 wine samples [
17,
18,
19]. However, physicochemical analysis cannot express the sensory quality of wine. Wine reviews are produced by sommeliers, people who specialize in wine. These wine reviews usually include aroma, flavors, tannins, weight, finish, appearance, and the interactions related to these wine sensations [
20]. Although the physicochemical laboratory data is easy to read and apply analytics to by computers, and wine reviews’ data involves natural language processing and a degree of human bias, we believe the analysis of wine reviews can provide useful information to broader audiences. Therefore, the Computational Wine Wheel was developed to accurately capture keywords, including not only flavors but also non-flavor notes, which always appear in the wine reviews [
21,
22].
The wine making region located in the southwestern part of France, known as Bordeaux, produces the most highly regarded and sought-after wines. The massive and widespread popularity of Bordeaux wines can be partly attributed to a marriage in the 12th century. Bordeaux wine was served at the wedding of King Henry II and Eleanor of Aquitaine [
23]. This established a connection with the region and the royal family, boosting its early popularity. The wedding also served to bring the Bordeaux region under British rule, leading to the widespread trade of the wine throughout the British Empire. Today, Bordeaux is the biggest wine delivering district in France and one of the most influential wine districts in the world. Several researches applied data mining/data science techniques on Bordeaux wines to try to understand the economical correlation between the price and the vintage from historical and economic data [
24,
25,
26]. Several other researches built a mathematical and computational model to study the ontology and wine quality through grapevine yields [
27,
28,
29]. The mentioned researches about Bordeaux as well as some current wine researches [
30,
31,
32,
33] applied their work on small to medium sized wine datasets. With the rise of the internet, data has become abundant; we believe Wineinformatics is the key to analyze large volumes of existing wine related data. Therefore, in our previous Wineinformatics research [
34], we explored all 21st century Bordeaux wines by creating a publicly available dataset with 14,349 Bordeaux wines [
35]. To the best of our knowledge, this dataset is the largest wine-region specific dataset in open literature.
Wineinformatics researches have studied many interesting wine-related supervised learning methods, including regression and classification problems with large amounts of data. In [
14], white-box and back-box classification models were built to evaluate wine reviewers’ consistency between wine grades and wine reviews in human-language-format. Regression models were constructed to predict a wine’s grade, price, and region in [
15]. In Reference [
36], association rules are used to find the characteristics of Napa’s Cabernet Sauvignon. Naïve Bayes classifiers were utilized to find important wine flavor and non-flavor attributes corresponding to high quality 21st century Bordeaux wines [
34]. However, limited amounts of researches apply unsupervised learning approaches on Wineinformatics. In References [
22,
37], a TriMax triclustering algorithm was proposed to cluster 250 wines across five different vintages. The Fuzzy C-means clustering algorithm was applied to form information granules to support the performance of supervised learning techniques, which is more likely to be considered as semi-supervised learning [
38]. To the best of our knowledge, no literature has focused on how to use unsupervised learning to find beneficial information for wine distributors and consumers from the large amount data, especially from region-specific datasets.
With the massive selection of Bordeaux wines on the market, wine vendors have many tough choices when it comes to selecting which wines they want to have represented in their offerings. No vendors can possibly supply all available wines, so they must choose a limited number to provide the best selection for their customers. Choosing these wines can be a difficult process and this project aims to provide some insight by grouping similar wines so that a vendor can make more informed decisions through the unsupervised learning. This study allows wine distributors to compile a comprehensive list of selections from any groups of wines without missing out on a particular type. For the scope of this project, we will be focusing solely on wines from the Bordeaux region of France as the group of wine. The approaches we used can be easily applied to any selection of wines, depending on the need.
5. Conclusions
Wineinformatics is a new data science research area that focuses on large amounts of wine-related data. In this research, unsupervised analysis was applied on 14,349 wines to select representative 21st century Bordeaux wines. A systematic process that incorporates K-means clustering with optimal K search and filtration process was proposed and carried out in this work. Detail clustering results constructed from two different filtering methods, where the first method looks at the overall presence of each attribute and the second method focuses on attribute distribution based on a user defined pivot, were provided in the result section. Both have shown promise for generating unique clusters of wines, and both should be considered for any real-world use cases.
The intended use of these methods is for wine vendors to make a selection given the limited number of wines they can realistically offer. These wines will hopefully represent a broad range of flavor profiles within a given dataset and therefore please the widest market. Wine connoisseurs can also try the list of representative wines of the clusters to understand the variety of the wine region with as few wines as possible. Another use of the cluster could be the recommendation system. A cluster of wine represents wines with similarity; a consumer who enjoyed a representative wine from the cluster can be recommended other wines in the cluster with higher (or lower) price.
The dataset presented in the paper focuses on 21st century Bordeaux with vintage covers from year 2000~2016. Many possible researches can adopt the same process to analyze and find representative wines in a different wine making region/country; vintage(s); or pivot points such as price, weather, terroir, etc. This finding has strong impacts on all Wineinformatics research in many different topics about wine, which has the potential to provide useful information to wine makers, consumers, and distributors.