The Chef’s Choice: System for Allergen and Style Classification in Recipes

Roither, Andreas; Kurz, Marc; Sonnleitner, Erik

doi:10.3390/app12052590

Open AccessArticle

The Chef’s Choice: System for Allergen and Style Classification in Recipes

by

Andreas Roither

,

Marc Kurz

^*

and

Erik Sonnleitner

Department of Mobility and Energy, University of Applied Sciences Upper Austria, 4232 Hagenberg, Austria

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(5), 2590; https://doi.org/10.3390/app12052590

Submission received: 1 December 2021 / Revised: 7 February 2022 / Accepted: 28 February 2022 / Published: 2 March 2022

(This article belongs to the Special Issue Applications of Machine Learning in Food Industry)

Download

Browse Figures

Versions Notes

Abstract

:

Allergens in food items can be dangerous for individuals affected by food allergens. Considering how many different ingredients and food items exists, it is hard to keep track of which food items contain relevant allergens. Food businesses in the EU are required to label foods with information about the 14 major food allergens defined by the EU legislation. This improves the situation for affected individuals. Nevertheless, more changes are necessary to provide reasonable protection for people with severe allergic reactions. Recipe websites and online content is usually not labelled with allergens. In addition, the 14 main allergen categories consist of a variety of different ingredients that are not always easy to remember. Scanning websites and recipes for specific allergens can consume a fair amount of time if the reader wants to make sure no allergen is missed. In this article, a dataset is processed and used for machine learning to classify cuisine style and allergens. The dataset used contains labelling for the 14 major allergen categories. Furthermore, a system is proposed that informs the user about style and allergens in a recipe with the help of a browser add-on. To measure the performance of the proposed system, a user study is conducted where participants label recipes with food allergens. A comparison between human and system performance as well as the time needed to read and label recipes concludes this article.

Keywords:

machine learning; food allergens; allergen classification; cuisine classification

1. Introduction

Allergen labelling has gotten better over the years, especially the introduction of laws that require companies to label their food items with the corresponding allergen labels have contributed greatly to this change. Food businesses are now required to label foods with information about the 14 major food allergens presented by the EU legislation [1]. The categories include: cereals containing gluten, crustaceans, eggs, fish, soybeans, milk, nuts, celery, mustard, sesame seeds, sulphur dioxide and sulphite, lupin and molluscs. This is a big step forward in the right direction for individuals affected by food allergens. However, more changes are necessary to provide good protection for people with severe allergic reactions. One such improvement concerns online content and recipe websites. The lack of informational labels for recipe websites and online content can be frustrating. There are recipe websites that have different categories like “fish”. Others that specifically list some allergens categories on their content also exist. The downside to these websites is that they have less content than normal unlabelled recipe websites. Even if the chef knows about the 14 major food allergens proposed by the EU, remembering all information or rather all ingredients for a certain category can be quite a challenge. Some exotic recipes contain ingredients that are usually not found in simpler recipes and are thus unlikely to be known to the average user. Some recipes contain a multitude of ingredients, which makes reading through all the contents more time consuming and challenging. If time is short and the patience of the reader is limited, some ingredients might be skipped or overlooked. This introduces the possibility of overlooking an important food allergen. Users affected by allergies are likely to be informed about ingredients that contain the particular allergen they are allergic to, however, a pre-screening system could serve as an additional layer of security.

This paper shows how a dataset is processed and used for machine learning to classify cuisine (recipe styles) and allergens. The dataset contains labelling for the 14 major allergen categories. Furthermore, a prototype system is proposed that informs the user about allergens in a recipe. Together with a user study that measures the performance of humans labelling recipes with food allergens, the system’s performance is evaluated. The results of the system and the comparison to the user study determine if the system with the trained classifier can be used in real-world scenarios and is a viable alternative to human screening.

1.1. Motivation

Food allergy reactions are categorized as an adverse immunological reaction of the immune system triggered by the intake of certain food products [2] and is not classified as a single disease. The number of available food products on the market and the lack of food databases with correct allergen labelling make it challenging to provide accurate food allergen classification. The sheer amount of different ingredients that can contain food allergens can be hard to memorize. Although affected individuals are likely to know about ingredients that affect them, some might slip through their attention when reading and cause unnecessary stress. Individuals that have allergic reactions to certain food allergens have to be careful when looking for recipes online, as for some it is even a matter of life and death. Given the rise in the prevalence of food allergies [3], additional measures to inform individuals with food allergies have become more and more important. A system that aids in the selection of recipes could reduce the amount of time as well as the frustration of affected individuals. The integration with existing technologies like a web browser, will reduce time and help with scanning, instead of having a dedicated app where the text has to be copied into.

1.2. Goal

The goal is to create a system that can be used for online recipe sites and provides an additional security layer with allergen detection and more information about detected allergens and the recipe style. The system should be simple to use and integrate into an existing system like an internet browser. In addition, custom filters that are triggered upon detection of certain keywords should be possible as well. Different machine learning classifiers will be evaluated for their performance and compared to the performance of humans. A user study that measures the performance of both humans and trained classifiers and a comparison between both conclude the paper.

1.3. Overview

This rest of the paper is structured as follows:

Section 2 outlines the related work in this field with related solutions. In addition, important findings from previous work that are used for this paper are discussed as well.

Section 3 presents a conceptual system architecture, discusses the requirements, and summarizes the methodological approach including the used technologies.

Section 4 discusses the used data sets and respective methodological approaches for (i) data preprocessing, (ii) filtering, (iii) up- and down-sampling, and (iv) classification.

Section 5 summarizes the proposed system as well as the trained models. They are evaluated by using metrics for multi-label and multi-class classification. Diagrams outline the difference between each trained model. Furthermore, the outcome of the user study is compared with the system’s performance.

Section 6 concludes the paper and describes future work and improvements that can be implemented to increase the performance and usability of the system.

2. Related Work

This section describes the related work for cuisine and allergen classification in the literature and personal findings. While allergen detection is the main focus of this paper, cuisine classification faces similar natural language processing problems. Thus, a good overview of cuisine classification papers also helps to solve the problems in allergen detection.

2.1. Cuisine Classification

Cuisine classification can be done in various ways; the most common is by either analysing an image and predicting the dish or using text-based approaches using a recipe and its ingredients. Cuisine classification by recipe ingredients has been addressed a considerable amount of times in the machine learning sector. Nevertheless, a good overview of used algorithms, results, and possible challenges can be gained by reviewing these papers.

Support Vector Machines (SVM) have proven to be effective binary text classifiers [4]. While the paper describes good results for binary text classification, the paper did not look further into the multi-class problem, where each sample is assigned one single label. Following this approach, the cuisine classification problem can be converted into a multi-class classification problem. The authors Jason D. M. Rennie and Ryan Rifkin [5] tested this approach and show that by splitting the problem into a binary classification for each class, SVM can be utilised to perform multi-class text classification in conjunction with Error Correcting Output Coding (ECOC). They also show that SVM outperforms Naive Bayes [6] when it comes to multi-class text classification. The ECOC approach describes how several different binary classifiers are learned to predict the label for a specific sample using those classifiers’ output. By splitting the problem into multiple binary classifiers, several different machine learning algorithms can be used on the data, such as Naive Bayes, Logistic Regression [7] or SVM, as mentioned above. This approach’s downside is that this method does not take the correlation between classes into account as each classifier is trained independently. As stated by Rennie and Rifkin [5], the performance of an ECOC classifier is affected by the independence of the binary classifiers, the binary classifier performance, and the loss function. They also suggest that binary classifiers can be better judged with a receiver operating characteristic (ROC) breakeven measure. They define the ROC breakeven measure as the following: “We define the ROC breakeven as the average of the miss and false alarm rates at the point where the difference between false alarm rate and the miss rate is minimum” [5]. This allows them to evaluate the strength of a classifier better when the distribution of classes is uneven.

In “Cuisine classification using recipe’s ingredients”, Kalajdziski et al. [8] try different approaches for ingredients preprocessing. They create different sets for each approach. The Levenshtein distance algorithm [9], outlier analysis, and part of speech tagging (POS) [10] are used on the ingredients to create these different sets. The Levenshtein distance algorithm returns the similarity between two ingredients and is used to filter out ingredients that are essentially the same. Example of a pair: “soy sauce = soysauce”. The set with the Levenshtein distance algorithm applied showed better results than most other sets when tested with different classifiers. For the feature selection, TF-IDF [11] (term frequency-inverse document frequency), a bag of words with the 3500 most frequent ingredients and a bag of word per class (450 most frequent) approach is applied on the ingredients. Their tests show that the SVM classifier with TF-IDF and the Levenshtein distance algorithm applied produced the best results. The paper also reveals the importance of preprocessing as text from public sources is likely to contain noise. Other factors such as common cooking terms in ingredient texts could also contribute to extra noise in a recipe. However, Teng et al. [12] found that there is a regional preference for applying a specific heating method. In theory, these cooking terms could be used as a feature for model training. The paper shows that 5.8% of their recipes that have been classified have heating terms that correspond to one of five US regions. By itself, 5.8% is quite low, but analysing more cooking terms could lead to increased classifier performance.

Another attempt at cuisine classification by Li, B., and Wang, M. [13] also involves building several feature sets and training them with different classification algorithms. This paper cleans the data beforehand; digits, measurements, and punctuations are removed. The authors show that their feature set, which has low-content descriptive words (“fresh”, “natural”) and keywords with identical meaning removed/combined (“fat”, “low fat”, “reduced fat”), performs better than the original feature set with all original features. The tested set shows that the Logistic Regression classifier was able to achieve the highest performance compared to their other tested classifiers. Comparable to the work by Kalajdziski et al. [8], their results also show that a better result can be achieved by filtering out similar words. However, their approach mainly focuses on clearing up inconsistencies and data reduction.

Machine learning has several challenges that affect the outcome of a trained model: over- and under-fitting, poor quality data, lack of good features, class imbalance, and many more. Han Su et al. discuss class imbalance in their paper “Automatic Recipe Cuisine Classification by Ingredients” [14] and use up- and down-sampling to equalise the number of recipes. While the initial recipe count from “food.com” numbered over 226,000, a reduction to 6 main cuisines from over 70 has been conducted since the number of cuisines for each category was skewed. After up- and down-sampling, each category contained 1001 recipes. They found that their SVM classifier tends to classify Japanese recipes as Chinese cuisine. They also show that several ingredients that are used in Chinese recipes are also present in Japanese cuisine, hence the low classification accuracy score. Their results, however, do not show if there is a significant accuracy improvement when using up- and down-sampling. Lane et al. [15] show that balancing the distribution of different classes in a dataset can be beneficial for some machine learning approaches. However, algorithms that take balance into account can be adversely affected. While the paper is focused on favourability analysis, they recommend that balancing the training data by undersampling of the majority class is an effective strategy, though they also mention that this varies by dataset, and preliminary experiments are needed beforehand.

2.2. Allergen Classification

The cuisine classification problem can be solved by following multi-class classification approaches. Allergen classification, however, deals with a multi-label classification problem as each sample can contain multiple allergens. While there are not many papers dealing with allergen detection in texts, there are several papers that discuss the multi-label classification for texts.

In a recent paper [16], the authors present an approach for multi-label classification of restricted foods in recipes. The 10 proposed class labels are not directly related to allergens, though some class names group foods with allergens together. The authors use an existing dataset (https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions, accessed on 27 February 2022), and while they do not mention preprocessing recipes, a list of 6734 different ingredients is compiled. Each word is then considered a feature and used to predict the different classes. Their proposed multi-label classification approach consists of different Random Forest Classifiers for each of the classes with a One-vs-Rest method [17]. For the evaluation, the Random Forest Classifiers are tested with a different amount of ingredients using a Bag of Word approach. Their results show that after 2700 ingredients (most frequent features), precision and recall did not improve significantly anymore. Using the Bag of Words approach, they found that by increasing the number of features, words relevant for specific classes have a bigger chance of being included in the feature set and thus improving the performance.

For supervised learning, training models normally involve creating a train and a test set that are disjoint. The train set is used for model training, while the test set is used to validate the trained model on data that it has previously never seen. To increase data utilisation or when little data is at hand, cross-validation [18] is used. Cross-validation splits the dataset (train set) into several subsets. For single-label classification, a stratified approach is used, where the split factors in each class so that the proportion of each class in each subset is roughly equal to the amount in the whole set. By employing this stratification approach, cross-validation shows improvement for both variance and bias [18]. Cross-validation and multi-labelled data face a problem, however. Stratification for multi-labelled data can become a problem if a random distribution of data is employed. This can result in subsets lacking positive examples of a label, or even worse, not containing a single label at all. This can result in problems for the evaluation and training of classifiers. Sechidis, K. et al. [19] propose a relaxed iterative stratification algorithm for multi-labelled data. For each iteration in the algorithm, the label with the least remaining entries is examined. “Rare” labels are chosen first, so they are not distributed in an undesirable way. More frequent labels are easier to distribute since more entries are available. For their tests, Random Sampling, Iterative Stratification, and Labelset-based Stratification were used. The results show that Random sampling produced consistently worse results than the other two methods and they even discourage the use of Random Sampling for multi-label experiments. Labelsets performed better than Iterative Stratification. However, Iterative Stratification produced the least amount of folds and fold-label pairs with zero positive entries. It also maintains the ratio in subsets of positive to negative entries for each label.

Since multi-label text classification and Stratification has been covered, allergen classification research should also be included. Alemany-Bordera et al. [20] show the use of a bargaining agents-based system to classify allergens in recipes. The authors use the United States Department of Agriculture (USDA) nutrient database [21] for additional information on ingredients in a recipe. The collaborative multi-expert system consists of several trained models. Each agent has several different binary classification models, each trained for different allergens. Additionally, each agent receives a dictionary of keywords with ingredients that contain allergens. This helps to improve the quality of the predictions. The same machine learning technique has been used to train the models. Still, each expert agent has a different model for each allergy, resulting in possibly different predictions from each agent. When a new ingredient is sent to the system, each agent uses the trained models to predict if it belongs to a certain class of allergen. The results of each agent are forwarded to another agent called the “decision-maker agent”. This last agent implements a voting system that uses the majority of votes for each class to produce a final classification. Therefore, this voting system is robust against single misclassification errors but cannot rectify the outcome if all agents produce wrong results. Their results show that the percentage for six out of seven allergen classes had improved results for the true positive rate (the probability that an actual positive class is classified as positive). However, poor performing individual agents can also bring down the overall accuracy, which can be seen in their allergen egg classification results. Here one expert has a higher true positive accuracy than the multi-agent system, which is not the case for all other allergen classes. Their results show that the multi-agent voting based system improves the overall general accuracy and increases the reliability in allergen detection for ingredients.

Lastly, recommendation systems should also be mentioned. Although recommendation systems do not typically offer allergen detection, they present the user only with recipes they like. By learning the user’s dislikes for certain ingredients that contain allergens, a recipe recommendation system could be tuned to filter out recipes and only present allergen-free recipes. Additional functionality has to be added to the recommendation system, however. Otherwise, wrong ingredients could be flagged as being disliked by the user.

Ueda et al. extract in their paper “User’s food preference extraction for personalised cooking recipe recommendation” [22] liked and disliked ingredients using the user’s browsing and cooking history. By analysing the recipes that the user has browsed through but not cooked, scores for disliked ingredients are generated. While the authors state that the precision of extracted disliked ingredients was rather low at 14.7%, the extraction of liked ingredients was up to 83% precision. Their approach suffers from the assumption that a user dislikes all ingredients of a recipe if he does not cook it. While those ingredients’ frequency is factored in, a potential user could avoid certain recipes because he is not in the mood for that specific type of recipe. Another different approach from Freyne et al. [23] establishes a relationship between a recipe and its food items. Participating users rated different foods and recipes with: “hate, dislike, neutral, like, love”. One strategy involves a breakdown of each recipe rated by the user and assigning its food items a score. The system then scores a recipe based on the average of all food item ratings of the recipe. The score of the recipe is then used for the recommendation system. While this system does not prevent recipe recommendation with allergens from ever getting recommended, food items (possible allergens) and recipes that have a low scoring are still unlikely to get suggested.

2.3. Findings

2.3.1. Misclassification for Similar Cuisines

As mentioned by Han Su et al. [14] in their work, Japanese cuisine tends to get classified as Chinese cuisine with their SVM classifier. Several ingredients tend to be similar in certain regions, making it harder for a classifier to distinguish between classes. Thus, a lower accuracy score for classes with similar ingredients for Asian or European countries is to be expected. A confusion matrix will clarify the results in the evaluation part for each classifier in this paper.

2.3.2. One-Vs-Rest and Feature Amount

Britto et al. [16] use a One-vs-Rest approach combined with a Random Forest classifier. Their results are quite promising, but more importantly, their tests show that after a certain amount of features, 2700 in their experiment, precision and recall metrics showed no significant improvement anymore. Having a great number of features allows words that are relevant for a class to appear but comes with a downside that other words that are not as relevant might be included. Additionally, more features also means that there is more for the classifier to learn, possibly increasing the time needed for training. For better evaluation results, a good balance between relevant words and the number of features should be established.

2.3.3. Noise in Public Recipes

As briefly mentioned above, concerning the paper by Kalajdziski et al. [8], preprocessing texts is crucial for model training and results. Public texts are prone to contain errors as some are user-generated, and not all authorities ensure a proper quality process to correct possible mistakes. However, not all texts have to contain errors in order for them to be considered to contain noise. Synonyms, abbreviations, adjectives, and stopwords can increase the noise in a text. Recipes are no exception as they are also user-created; preprocessing data for model training, in this case, is especially important.

Example of a recipe for “Whole Wheat and Honey Pizza Dough” (https://www.allrecipes.com/recipe/24372/whole-wheat-and-honey-pizza-dough/print, accessed on 27 February 2022): 1 (.25 ounce) package active dry yeast, 1 cup warm water, 2 cups whole wheat flour, 1/4 cup wheat germ, 1 teaspoon salt, 1 tablespoon honey.

The provided recipe contains special characters, numbers, cooking terms, and ingredients. The specific cooking terms could be country-specific or even the chef’s preference. In any case, cooking and measurement terms can introduce misclassification errors. As an example, the term “cup” could be favoured in Italy, which would make it a good feature for classification. If the term was suddenly used in another country as well, however, this feature does not indicate a certain country anymore and should not be learned by the classifier.

2.3.4. Class Imbalance

Machine learning has several challenges that affect the outcome of a trained model: over- and under-fitting, poor quality data, lack of good features, and class imbalance. Class imbalance certainly plays an important role and should not be overlooked. Su et al. [14] use up- and down-sampling to improve the skewness of their data but do not mention if this improved their overall accuracy. Lane et al. [15] mention in their paper for favourability analysis that a balancing of the majority class in a dataset can benefit the learning process, though the effectiveness depends on the classifier used. Classifiers that take balance into account may be affected negatively.

2.3.5. Stratification of Multi Labeled Data

Stratification for multi-labelled data is important for model training. When cross-validation or the split into train and test set results in having no entries for a specific label, the trained model has a possibility of not recognising the label at all as it has no knowledge of a positive example of that label. The results from Sechidis Konstantinos et al. [19] show that Iterative Stratification shows better results than Random sampling. Iterative Stratification is used in this paper as well to reduce possible classification errors.

2.4. Related Solutions

Two food recommendation systems that factor in allergens are Spoon Guru (https://www.spoon.guru, accessed on 27 February 2022) and Foodmaestro (https://www.foodmaestro.me, accessed on 27 February 2022). Spoon Guru uses AI-based personalised nutrition algorithms to scan and analyse product and ingredient information for individual shoppers. Instead of offering products, they offer information and customer touchpoints, so it is more tailored to businesses rather than individuals. Foodmaestro uses brand-approved product information to structure and standardise information. Dieticians then approve the extracted and labelled information from trusted health organisations. They offer solutions for businesses and individual consumers by creating a profile that can include lifestyle restrictions and allergies. The app also features a scanner for products and the possibility to search for products, filtering out products according to the specified restrictions.

2.5. Proposed System and Existing Solutions Distinction

Food recommendation systems actively prevent the user from seeing a recipe that contains certain disliked ingredients. By evaluating the preferences of the user or using extra information like the browsing history [22], information about the user’s diet and preferences is saved and taken into account for future recommendations. However, the proposed system can scan a recipe, recommended or not, and display results about allergens for the user to see. Thus, the proposed system adds another layer of security and information when browsing for recipes, even if a specific provider’s personalised recommendation system is not used. While existing food recommendation solutions offer a variety of different recipes, the overall recipe amount is still limited, and some types of recipes might not even be listed in the provider’s recipe database. This gives the system more flexibility, as the user can explore other food recipe sites online without being tied down to a single food recipe recommendation provider.

3. System Concept and Methodology

The requirements for our proposed system are the following: (i) style classification of recipes, (ii) allergen classification with ingredients, (iii) warnings for custom ingredients, (iv) architecture that allows extensions regardless of platform, (v) browser extension for common internet browsers.

The proposed system should be able to detect style and allergens from a recipe using its ingredients. A browser extension should visualise the results along with the confidence for the multi-class classification (style of a recipe) and a list of predictions with confidence for each allergen for the multi-label classification (allergens in a recipe). Additionally, the user should be able to specify a custom list of different ingredients that should be considered when analysing. Custom ingredients should trigger a different warning than the allergen detection alert.

4. Data Acquisition

In 2014, Kaggle (https://www.kaggle.com, accessed on 27 February 2022) posted a challenge for cuisine classification. The attached dataset contains recipes with ingredients and the cuisine origin. Yummly (https://www.yummly.com, accessed on 27 February 2022) originally provided the data. The challenge consists of predicting the category of a dish by using only its ingredients. The available data is provided in the form of JSON files and include a number of different cuisines with their respective ingredients.

The Open Food Facts dataset [24] contains a large number of different products (1,540,060 at the time of writing) and is maintained by a large community of volunteers. The dataset is freely available and can be downloaded in several formats such as “CSV”. Other options include a live JSON API (JavaScript object notation application program interface), which can be used to gather information about a scanned product. Training a classifier requires a dataset beforehand and rules out the usage of the API. The “CSV” format is used in this thesis as it is a simple data format that can be processed quickly.

4.1. Kaggle Dataset

The Kaggle dataset is kept simple as only a few columns exist: “id”, “cuisine”, and “ingredients”. A few example entries can be seen in Table 1. The “id” column will not be used as there is no valuable information for training a classifier. Upon further inspection, the “ingredients” column has already been preprocessed beforehand, as no cooking terms or measurement terms appear in the text.

Analysing the class distribution (Figure 1) reveals that the “Italian” cuisine has more entries than all other classes, and the general distribution between classes with a low amount of samples and classes with a high sample count appears to be quite unbalanced. Furthermore, two other classes, “Mexican” and “southern us”, also have quite a large amount of entries. The classes with the least entries are “Brazilian”, “Jamaican”, and “Russian”. This distribution could hint at a class imbalance problem that might interfere with the classifier training later on. However, as the other classes are more balanced, the class imbalance could be a problem that does not need to be addressed. More importantly, the majority of classes show quite a low sample count, ranging from roughly 1000 to 3000. Depending on the classifier used, the amount of data might not be enough to train an accurate classifier.

Looking at the top 40 words for the “Brazilian” cuisine in Figure 2 reinforces the assumption that the provided dataset has already been preprocessed, and numbers and other characters have already been filtered out. When comparing the top 40 ingredients between the “Brazilian” and the “Chinese” class (Figure 2 and Figure 3), different ingredients as well as the ranking of ingredients differs. The assumption for the cuisine classification problem is that there are enough distinctions in the ingredient usage between different styles that the cuisine can be inferred from the ingredients of the recipe. The initial analysis of Figure 2 and Figure 3 seems to confirm the initial hypothesis. The ingredient “onion” appears in both cuisines, but in the “Brazilian” cuisine class, the ingredient ranks third while it ranks twentieth in the other class. Other ingredients like “lime” are not even in the top twenty ingredients of the other class.

In addition to the differences in ingredient usage, both classes also have a high amount of ingredients that consist of more than one word like “olive oil”. Instead of using single words to identify the relevancy to a certain cuisine class, a better option could be to create a sequence of words which is then used as a feature. Ingredients with more than two words are rarer compared to ingredients with single or double words, so the focus lies on the former combinations. Also important to note here is the difference in ingredient occurrence between the two classes. While “garlic” occurs around 180 times in the “Brazilian” class, the same ingredient appears over 1400 times in the “Chinese” cuisine class even though both are in the top five most used ingredients. This could be due to different ingredient diversity between classes or due to the fact that the “Brazilian” class has fewer samples than the “Chinese” class. Figure 1 seems to support the latter assumption.

4.2. Openfoodfacts Dataset

The Open Food Facts dataset is maintained by a community, and as such, preprocessing is a significant first step before models are trained, as data is very likely to contain more than just pure ingredients. The initial dataset contains 182 columns. Most of them contain “nan” values, which mean that they are unlikely to be used as a feature, as they are rather inconsistent. The only entries that are going to be used are: “product_name”, “countries_en”, “ingredients_text” and “allergens”.

The country distribution shows that there is quite a difference between the contributing countries. France, the United States, and Spain make up a great portion of entries. Furthermore, we can identify that the dataset contains over 38 different countries, with most countries having 1000 entries. Since Open Food Facts does not enforce a single language, each country has ingredients that are not listed in English.

This also means that entries from non-native English speaking countries have to be removed, as it will be hard to distinguish between English and non-English ingredients. Even though France has the most contributions, the majority of listed ingredients are in French and is likely to skew evaluations for the trained models. The word “nan” signifies the absence of data, meaning there are many entries with unusable data that have to be filtered out. Furthermore, words like “de cacao” or “en poudre” reveal that there are non-English words in the dataset as well. Considering that the top contributor by country is France, this result was to be expected.

Looking at the rest of the available data from the Open Food Facts dataset, we also have to decide on a multi-label vs a multi-class classification approach. This distinction is easier to make for the Kaggle dataset, as it is not important to label a recipe with multiple cuisine classes as the requirements are not that strict. For food allergens, however, it is an entirely different case. A recipe can contain multiple allergens and is not restricted to a single one. This means we do not have mutually exclusive labels or categories. A recipe with gluten could also contain milk, for example. For this reason, we have to use a multi-label classification [25] approach. Another problem that comes with this decision is that for the multi-label classification approach, the class imbalance can not be solved by randomly adding and removing more from certain classes, which will be covered in a later section.

4.3. Dataset Preprocessing

The Kaggle dataset has already been preprocessed beforehand by the creators of the dataset. However, in order to streamline the process, both datasets, Kaggle and Open Food Facts, will both be processed in the same way. The Kaggle dataset has no outlier or undesired words but can benefit from Lemmatization. The Open Food Facts on the other hand has an abundance of content provided by the community and thus needs preprocessing. Since allergen and cuisine classification is based on ingredients of a recipe, other unwanted characters have to be filtered. A recipe that is available on a website will be shown as an example of why preprocessing is needed.

Example of a recipe for “Whole Wheat and Honey Pizza Dough” (https://www.allrecipes.com/recipe/24372/whole-wheat-and-honey-pizza-dough/print, accessed on 27 February 2022): 1 (.25 ounce) package active dry yeast, 1 cup warm water, 2 cups whole wheat flour, 1/4 cup wheat germ, 1 teaspoon salt, 1 tablespoon honey.

Preprocessing for the dataset includes several steps:

Removal of non-alphanumeric characters
Stopword removal
POS tagging
Lemmatization of ingredients
Filtering
One-Hot-Encoding
Up- and down-sampling
Feature transformation

Due to a large amount of data and the nature of the preprocessing steps, processing the Open Food Facts dataset with all steps can take up to an hour or even more. Since training a classifier is also quite time-consuming, a few steps are done only once beforehand, while other steps are completed as part of training the classifier.

Up- and Down-Sampling

Up-and down-sampling in the multi-label case is more complicated. Adding and removing a sample from one class also inevitably removes an entry from another class as one sample can belong to more than one class.

After the under-sampling has been applied, the majority classes are reduced, which can be seen in Figure 4. Under-sampling reduces data, which means there is information loss. This can potentially be a problem for machine learning and the end result. Given the big difference between the majority and minority classes, under-sampling has been applied to reduce the skewedness of the data.

For oversampling, a label powerset transformation with random oversampling is applied. The original code from Siladittya Manna [26] is used and changed to operate with the Open Food Facts dataset. First, an extra column is added to the dataframe for the label powerset transformation of the original labels. Then for each power label, the distance or gap between the highest powerset label and the current one is calculated. Using the distance, random samples are selected for each power label and added to the dataframe. The overall dataset is not perfectly balanced even after up- and down-sampling. Figure 5 shows that after random oversampling, the minority classes have more samples and the overall distribution looks better than before. It also shows that the class with the lowest amount of samples is “lupin”. This raises concerns for model training later on, where the number of samples for this class might influence the accuracy of the trained classifiers.

4.4. Cuisine Classification

Since preprocessing and data analysis has been covered, creating classifiers is the next step. Training a cuisine classifier consists of the following steps:

Train test split;
Feature transformation;
Hyperparameter tuning;
Classifier evaluation.

4.4.1. Hyperparameter Tuning

For hyperparameter optimization [27], the class GridSearchCV is used. This class exhaustively generates models from a grid of parameter values, and each combination of parameters is evaluated with the best combination being kept. The evaluation for each combination depends on the specified scoring parameter. Depending on the problem at hand (classification, regression, clustering), different metrics should be used for this parameter. Some metrics are also better suited for multi-label or multi-class classification problems than others. Used parameters include: (i) cv and (ii) scoring.

The cv parameter determines the cross-validation splitting strategy. The KFold class is used as input and provides train and test indices to split the provided dataset. Each generated fold (also called set) is used for validation once, while the others (k-1) are used for training. As mentioned above, the scoring parameter is a strategy to evaluate the performance of a cross-validated model on a test set. For cuisine classification, accuracy is used. This scoring option computes the fraction of correct predictions for a sample and is equal to the Jaccard score [28] for binary and multi-class classification.

4.4.2. Machine Learning Classifier

A variety of different machine learning algorithms are used to compare different approaches. The following classifiers are used:

K-Nearest Neighbour (KNN) [29];
Logistic Regression [7];
Random Forest [30];
Decision Tree [10];
C-Support Vector Classifier (SVC) [10];
Linear Support Vector Classifier (LinearSVC) [31].

The evaluation of the cuisine classifiers will take place in the evaluation section later on, where both cuisine and allergen classifiers as well as the used metrics are described in more detail. For the evaluation, only the most promising models will be used. After training and evaluating cross-validation models, the best parameters after evaluation from a custom range parameters can be seen in Table 2. The resulting parameters are not optimized but should indicate a good starting point for future extensions.

4.5. Allergen Classification

The steps to build an allergen classification model are similar to the steps mentioned in cuisine classification but differ in the details and approaches. Training an allergen classifiers consists of the following steps:

Up- and down-sampling;
Train test split;
Feature transformation;
Hyperparameter tuning;
Classifier evaluation.

For cuisine classification, up-and down-sampling is not needed as the evaluation indicates good results. For the Open Food Facts dataset, however, up-and down-sampling is needed as the class imbalance as well, as the number of classes negatively impact the training of classifiers. As for the train and test split step, parameters are retained from cuisine classification.

4.5.1. Hyperparameter Tuning

Similar to cuisine classification, the class GridSearchCV is used once more. Except for the scoring parameter, all others are kept the same. For allergen classification, different evaluation metrics have been chosen. In multi-class classification, the accuracy scoring value is the most common evaluation criteria [32]. In multi-label classification, however, the predictions can be fully correct, partially correct, or fully incorrect. For this reason, the accuracy scoring value is not used. Instead, a few other options are used in this thesis:

f1_samples
roc_auc
roc_auc_ovr_weighted

To see how model training is affected by changing the GridSearchCV scoring parameter, F1 score values are compared with ROC AUC scores when trained models are evaluated. The f1_samples as scoring value stems from the multiple options the F1 metric in Scikit-learn has (binary, micro, macro, weighted, samples). The metric is calculated for each instance and their average is computed, which can be used for multi-label classification. As for the roc_auc metric, the OneVsRestClassifier [17] supports both multi-class and multi-label classification. For multi-label classification, the roc_auc_score function is extended by averaging over all labels.

4.5.2. Machine Learning Classifier

For allergen classification, the number of classifiers is reduced compared to cuisine classification. The reduction also stems from the learnings from cuisine classification, as not every classifier works well for text classification problems. The following classifiers are used:

Logistic Regression [7];
Random Forest [30];
Decision Tree [10];
Multilayer Perceptron Classifier (MLP) [33].

The evaluation of the allergen classifiers will take place in the evaluation section later on, where both cuisine and allergen classifiers as well as the used metrics are described in more detail. Different from cuisine classification, where each classifier is used and trained on its own, the One-vs-Rest [17] strategy is used in this case, which consists of fitting one classifier per class. While this classifier is intended for multi-class problems, it can also be used for multi-label classification. After training and evaluating cross-validation models, the best parameters after evaluation from a custom range parameters can be seen in Table 3. The resulting parameters are not optimized but should indicate a good starting point for future extensions.

5. Evaluation

In this section, the proposed system and the models are evaluated using different metrics for multi-label and multi-class classification. In addition, the results of the user study are compared to the trained classifier. At the end of the section, the usage of the system with its results and the shortcomings of the user study are presented.

5.1. Evaluation of Classifiers

There are various metrics that can be used to evaluate the trained classifiers. Multi-label and multi-class classification have different evaluation metrics (although some can be used in both cases), and some metrics can not be used interchangeably. To evaluate the different classifiers, different key metrics for the multi-label and multi-class problem are chosen. For both problems, GridSearchCV from Scikit-learn is used to not only find good parameters for each classifier, but also to evaluate the created models. The evaluation with GridSearchCV happens with a held back test set within the provided training samples using the KFold class. The rest of the evaluation is conducted with another test set of samples that are not present in the training set used for the GridSearchCV class. One of the metrics that can be used in both cases is the F1 score, as there are some multi-label F-measures too [34]. The F1 metric in Scikit-learn [35] offers the following options (unused options omitted):

Micro;
Macro;
Weighted.

The main difference between these options is how precision and recall are calculated across the labels and if their proportion is taken into account. While label imbalance is not accounted for with the macro option, it is preferred over the micro option for unbalanced datasets. The weighted option takes class imbalance into account. For the cuisine classification, the weighted and the macro options are important. All three options are used for allergen classification, although the micro option is not used as a primary key evaluation factor.

5.2. Cuisine Classification

To evaluate the trained cuisine classifiers, four key metrics have been chosen:

F1 score macro;
F1 score weighted;
Accuracy score [35];
Cohen Kappa score [36].

The F1 macro and weighted scores have been chosen to indicate how good a classifier performs when taking proportion for each label in the dataset into account or not. Since the F1 macro score is not weighted according to the number of true instances for each label, having a low F1 macro score shows that there might be several classes for which predictions are not very accurate. Figure 6 shows the different metrics for each trained classifier.

To take a closer look at why the F1 macro score is always lower than the F1 weighted score and accuracy, a confusion matrix for the LinearSVC classifier is created. Figure 7 shows the predicted and true class labels. The strongly coloured diagonal line shows that the classifier correctly classifies most cuisines. Some of the misclassification problems occur with the classes “British”, “Southern US”, “French”, and “Italian”. This is likely due to regional similarities as well as the preference in ingredient usage. Comparing the results for each class to the class distributions, a lack of samples seems to correspond with low scores, although the “Jamaican” class indicates otherwise, which could also be an outlier.

5.3. Allergen Classification

To evaluate the trained cuisine classifiers, three key metrics have been chosen:

ROC AUC [37];
F1 score;
Accuracy score.

The initial scores for the trained allergen classifiers show promising results (Figure 8). The scores represent the trained classifiers without downsampling. While the overall weighted ROC AUC scores look good, the gap between the test and train set score and the high F1 and ROC AUC scores for MLP hint at overfitting or another underlying problem. The subset accuracy shows that not all labels for a given sample are entirely correct and serve as counter-evidence that models are not overfitting.

Even though the classifiers with up- and downsampling achieve better subset accuracy, the distance between train and test ROC AUC scores is higher compared to classifiers only trained with upsampling, and thus the models with upsampling applied are chosen. To rule out problems with TF-IDF, different values for the max_feature parameter are tested. A closer look at different TF-IDF settings in Figure 9 reveals a downward trend for accuracy when fewer features are used. A good spot for this parameter seems to be around 2000. Higher values also mean that more words are considered for a prediction, which would increase accuracy but also include words that are not necessarily allergens.

Analyzing some of the top features for a trained model and the “nut” category shows that while a lot of words are correct for this category, a good portion also contains words that have nothing to do with the allergen category. One of the words, “salt organic”, does not contain allergens for the category “nuts”. Below, example features for the category “nuts” are shown:

wheat flour, bicarbonate ammonium, bay, dioxide preservative,
garlic ginger, yolk sugar, honey mustard, salt organic, riboflavin,
cajun, toffee, spread, kernel soybean, vinegar lactic, manufacture,
lt, cottonseed oil, masa, advice, mollusc, nut, barley flour, tree,
present, tree nut, hazelnut, cashew, walnut, pecan, almond

The resulting features could be due to poor optimization of the trained classifier, however, it is more likely that this is due to the nature of the dataset. The samples provided by the community increase the risks of errors, but the dataset is also heavily imbalanced for some classes. This means that the results in real-world applications are likely to deviate from the test results achieved earlier.

5.4. Evaluation of Proposed System

To evaluate how the proposed system and the trained classifiers perform in the real world, a study with human participants is conducted with a set of randomly selected recipes taken from Yummly (https://www.yummly.com, accessed on 27 February 2022), allrecipes (https://www.allrecipes.com, accessed on 27 February 2022) and On And Off Keto (https://onandoffketo.com, accessed on 27 February 2022). Both human participants and the system’s performance are tested with 20 randomly selected recipes. Ten candidates have been selected and are divided into two groups. The participants range in age from 21 to 43, with 7 of them being male and the rest female. Both groups are introduced to the 14 major allergens proposed by the EU legislation, since some participants have not come into contact with a few of the categories yet. The first is the uninformed group, where only basic knowledge about allergens exists. The second group has time to familiarize themselves with the 14 major food allergens with examples for each category. During the study, participants are only allowed to see the labels of the 14 different categories but no ingredient examples for each category. While each participant is assigned the provided recipes food allergen categories, the time needed for each recipe is also recorded.

5.4.1. Results

The results of the study can be seen in Table 4. Some candidates have agreed to participate if they remain anonymous; thus, other participants names are also omitted. The results show that the informed participants are generally better at labelling allergens than the uninformed group. For the informed group, the average ratio of correct predictions to the total amount of allergens is at 85.60%. This places the informed group well above the system’s 72% with Logistic Regression as a trained model. When comparing the system to the lowest-performing participant with 50%, its predictions are much more accurate. As for the uninformed group, the average is at 69.20% which is still below the systems 72%. The most common thing for participants was to overlook or skip over certain ingredients when reading through recipes. When asked as to why certain allergens were missed, some participants reported forgetting that an ingredient belongs to a certain allergen category or altogether missing it due to the amount of ingredients in the recipes. When asked about missing allergens and why they have more missed allergens than incorrectly labelled allergens, some participants expressed uncertainty and wanted to avoid labelling anything when they were unsure as to which category an allergen belongs to. With all combined, the overlooked ingredients and the feeling of uncertainty when categorizing ingredients/recipes, the amount of missed allergens on average is higher than the amount of incorrectly labelled allergens in recipes. In contrast, the trained Logistic Regression model performs better than the average participant when comparing incorrectly labelled samples. When comparing the amount of missed allergens per sample in total, a clearer picture can be seen on why the model has some issues. The average amount of missed allergens for the participants is 11.3 whereas the Logistic Regression model missed labelling 14 allergens in total. As for the time needed, participants, on average, needed around 48 s per recipe, whereas the system usually needs 3–5 s per recipe depending on how fast the ingredients are selected.

The results reveal that the biggest problem in labelling allergens is that some categories are missed and overlooked instead of incorrect labelling. As for the system, most problems also stem from missing allergens instead of incorrect labelling. Most participants report that the difference between “molluscs” and “crustaceans” is sometimes unclear and confusing if no prior information is given. When analyzing the results, both categories tend to be incorrectly labelled, which explains the high amount of incorrectly assigned labels for some participants. Compared to the system, distinguishing “molluscs” and “crustaceans” was no problem and did not result in incorrect labelling. As allergens can be a life or death situation for some, a hard requirement for a labelling system is that it is accurate and does not overlook any possible allergens. Incorrect labelling is less of a problem than missing important allergens in a recipe. In conclusion, the system’s performance is a little above the average for uninformed participants. However, due to the number of missing allergens and the criticality of the matter, the system should only be used as an addition instead of a replacement.

5.4.2. User Study Shortcomings

After evaluation, there are a few things that can be improved for the subsequent user study. The first issue is concerning the number of participants as well as that of recipes. More participants increase the quality of captured data, and more information about age groups, genders, and possible relations between them could be observed. Furthermore, the captured data becomes more conclusive. While 10 participants already create enough data to analyze, more people will increase the confidence in the recorded data and the results overall. Another interesting statistic could be whether younger participants are more accurate in labelling recipes with allergens than the older participants and how their knowledge and vocabulary of certain ingredients influences their decisions. The second shortcoming is regarding the recipes and their ingredients. As recipes have been randomly selected, most have ingredients that reveal which category they belong to, making it easier for participants to detect them. The best example is the category “eggs”, as most recipes only list “egg” as an ingredient instead of products made of eggs. Another example is the ingredient “milk”, which was also one of the ingredients that appeared in many recipes. Nevertheless, a big part of this test was to see how humans and the system perform at recipes that are randomly selected without limiting a specific category. A follow-up study could look into the system’s performance if more exotic ingredients are used that are not commonly known to participants. In addition, the follow-up study could also test a wide variety of different ingredients for each category, where recipes with more than the standard ingredients like “egg” are used. An example of such an ingredient would be “mayonnaise” or “sauce hollandaise”.

6. Conclusions

6.1. Summary

This paper served to determine what a system for allergen and style detection in recipes could look like and how well a trained classifier performs against humans. Considering how many allergens and ingredients exist, a system that detects allergens can definitely help the user with determining if online recipes contain allergens or not. Furthermore, a system that can be easily integrated with existing technologies like internet browsers can increase system usage rate and increase awareness for allergens in general. A user study was conducted to test how well the system and humans perform in labelling the 14 different EU food allergens. The study showed that participants needed around 48 s on average per recipe. Compared with the system, which only takes between three and five seconds, depending on how fast ingredients are selected, the used time per recipe is much shorter. This also shows that the system can be used without any major problems and that predicting food allergens with the system is on average faster than a human. As for the performance of the trained classifier and human participants, results show that the system performs better than the average uninformed participant. For informed participants who have been briefed about the allergens and most common ingredients that belong to them, the system performs worse. In general, both sides, system and human participants, had problems overlooking or missing allergens rather than incorrectly labelling recipes with allergens.

Since food allergens can cause serious harm to anyone who has an allergic reaction to a certain food ingredient, overlooking or incorrectly labelling allergens can both cause fatal outcomes. Since the system performs worse than the average informed participant, using the system for a critical task like identifying allergens for someone who is affected by food allergens is not recommended. However, the system can be used as an indicator for some use cases as a pre-screening tool before inspecting the recipe closer. In conclusion, a food allergen detection system will only be viable as a replacement for human screening if the system’s performance heavily outweighs the performance of human labelling. Creating a perfect classifier will be hard and completely replacing human judgement could result in accidents, and as such, the system will be more helpful as additional help to aid people in choosing the correct recipe before the user inspects the recipe closer to make sure no allergens are present.

6.2. Challenges and Problems

The machine learning process suffered from three main problems:

Lack of data;
Data quality;
Class imbalance.

The lack of datasets from government or scientific institutions with labelling of the 14 major food allergens is a problem that is not easy to deal with. The Open Food Facts database contains labelling, but it is community maintained and as such not completely reliable. This also affects the overall data quality as some entries can not be used and have to be removed. Another factor that is important for text based classification is the structure and the ingredients. Synonyms, abbreviations, common cooking terms, and other words that have little or no semantic meaning have to be considered as well. Lastly, class imbalance is also a major problem. Some food allergen categories like “lupin” are naturally not that common in recipes or food items.

6.3. Known Limitations and Discussion

We are aware that the extreme diversity of allergens occurring in foods and food products is hard to capture—thus we intend to present a basis for the increasing demand for a tool helping allergic people to avoid deleterious effects from hidden allergens occurring in food products. In detail, the term “allergen” in the present work (and in the EU legislation) corresponds to foods and food production that are potentially allergenic. In fact, the term allergen means the different type of allergenic proteins present in a food product, whereas some of them might not be as important in terms of allergenicity than others. Nevertheless, our system does not tend to rate allergenicity, but to give an overview of allergens in the category of the 14 main allergen categories.

Furthermore, many food products have been transformed in such a way that native allergens occurring in these food products may be modified and exhibit a quite different allergenicity, compared to that of the crude product. Additionally, the categorization in different cuisines is not always of sufficient relevance because of the increasing use of “ready-to-use” transformed ingredients in modern cuisine, which introduce some uniformization in the types of cuisine.

All these aspects are known and not yet covered in our proposed system and thus constitute considerable limitations.

6.4. Future Extensions and Development

The area of allergen and cuisine style detection in recipes provides many opportunities. In this section, a few points are discussed.

6.4.1. Performance

As discussed earlier in Section 5, the performance of the trained classifier is worse than the average informed participant. An increase in performance to above the average level will be a huge boost and possibly make the system viable for everyday use. With higher performance, the system could be used as a pre-screening tool to help users select recipes. Another option would be to focus on a well performing category and offer services just for that category. An example would be to focus on the categories where most individuals are affected. This could improve the performance and accuracy for the remaining allergen categories.

6.4.2. Language

The focus of this study was the English language. This was also partly because of the familiarity with the English language and the available data in the dataset. Other languages like the French language could be used in the future to predict allergens in their respective language. This, however, requires more data, which the Open Food Facts dataset might not provide.

6.4.3. Datasets

The current limitation is the Open Food Facts dataset. The number of samples for each allergen category are not equally present in the dataset, meaning it is an imbalanced dataset. More entries and diversity for each allergen class could improve the predictions of trained classifiers. A custom dataset created from online recipes with oversight from experts that control the labelling process could also increase classifiers’ performance and requires time and effort from multiple people.

6.4.4. Machine Learning Classifier

The focus in this paper was on traditional classifiers like Random Forest or Logistic Regression with a Multi-Layer Perceptron classifier. Other machine learning approaches like Deep Learning or Neural Networks could be used for the multi-label classification problem [38] in the future to improve the current results.

6.4.5. Regional Cooking Terms

As mentioned in Section 2, Teng et al. [12] found that there is a regional preference for applying a specific heating method. The presence of certain cooking terms could help in indicating certain regions where a cuisine might originate from. This, however, requires extensive research of regional cuisines and cooking terminology. The problem with this approach is that the terms might not provide enough or conclusive evidence whether a cuisine is from a particular region or not. This is also the reason why this approach was not pursued any further in this paper. For further development, however, having a look at those terms might prove to be valuable information for training classifiers.

6.4.6. Feedback Loop

User feedback can be valuable extra information for improving the classifier and the prototype system overall. There is currently no way a user can report inconsistencies or errors back. User feedback could be used in the future with the scanned contents to provide more information on why a wrong or failed prediction has been shown to the user. This additional information could be used to improve trained classifiers and serve as an addition to the overall training dataset, given proper labelling beforehand.

6.4.7. Integrations

The current prototype system is only used to scan websites and selected text if a keyboard shortcut is used. As an additional feature, integrations with different applications or websites could also be possible. For example, the visited website URL could be used to determine the recipe ingredients without the user manually selecting text. This would reduce the amount of effort a user has to put in to know if a recipe contains allergens but might also be intrusive as a popup for every single recipe will be shown. Another possible extension could be the integration with different cooking apps which feature link sharing. An extra app or a web application could be used to receive the shared link and analyze its contents and present the user with a prediction.

Author Contributions

Conceptualization, A.R. and M.K.; methodology, A.R.; software, A.R.; validation, A.R., M.K. and E.S.; formal analysis, A.R.; investigation, A.R.; resources, M.K.; data curation, A.R.; writing—original draft preparation, A.R.; writing—review and editing, A.R., M.K. and E.S.; visualization, A.R.; supervision, M.K.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Open Access Funding by the University for Continuing Education Krems, the University of Applied Sciences BFI Vienna and the University of Applied Sciences Upper Austria.

Conflicts of Interest

The authors declare no conflict of interest.

References

Council of European Union. Regulation (EU) No 1169/2011 of the European Parliament and of the Council of 25 October 2011 on the Provision of Food Information to Consumers, Amending Regulations (EC) No 1924/2006 and (EC) No 1925/2006 of the European Parliament and of the Council, and Repealing Commission Directive 87/250/EEC, Council Directive 90/496/EEC, Commission Directive 1999/10/EC, Directive 2000/13/EC of the European Parliament and of the Council, Commission Directives 2002/67/EC and 2008/5/EC and Commission Regulation (EC) No 608/2004. 2011. Available online: https://www.legislation.gov.uk/eur/2011/1169/contents (accessed on 2 December 2020).
Bruijnzeel-Koomen, C.; Ortolani, C.; Aas, K.; Bindslev-Jensen, C.; Björksten, B.; Wüthrich, B. Adverse reactions to food: Position paper of the European Academy of Allergy and Clinical Immunology. Allergy 1995, 50, 623–635. [Google Scholar] [CrossRef] [PubMed]
Tang, M.L.K.; Mullins, R.J. Food allergy: Is prevalence increasing? Intern. Med. J. 2017, 47, 256–261. [Google Scholar] [CrossRef] [PubMed]
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. [Google Scholar]
Rennie, J.D.M.; Rifkin, R. Improving Multiclass Text Classification with the Support Vector Machine. 2001. Available online: https://www.researchgate.net/publication/2522390_Improving_Multiclass_Text_Classification_with_the_Support_Vector_Machine (accessed on 27 February 2022).
Rish, I. An Empirical Study of the Naive Bayes Classifier. Available online: https://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf (accessed on 27 February 2022).
Kleinbaum, D.G.; Dietz, K.; Gail, M.; Klein, M.; Klein, M. Logistic Regression; Springer: Cham, Switzerland, 2002. [Google Scholar]
Kalajdziski, S.; Radevski, G.; Ivanoska, I.; Trivodaliev, K.; Stojkoska, B.R. Cuisine classification using recipe’s ingredients. In Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 21–25 May 2018; pp. 1074–1079. [Google Scholar] [CrossRef]
Yujian, L.; Bo, L. A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef] [PubMed]
Swamynathan, M. Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python, 2nd ed.; Apress: New York, NY, USA, 2019. [Google Scholar]
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; Volume 242, pp. 29–48. [Google Scholar]
Teng, C.Y.; Lin, Y.R.; Adamic, L.A. Recipe recommendation using ingredient networks. In Proceedings of the WebSci ’12 4th Annual ACM Web Science Conference Association for Computing Machinery, New York, NY, USA, 22–24 June 2012; pp. 298–307. [Google Scholar] [CrossRef] [Green Version]
Li, B.; Wang, M. Cuisine Classification from Ingredients. Available online: http://cs229.stanford.edu/proj2015/313_report.pdf (accessed on 27 February 2022).
Su, H.; Lin, T.W.; Li, C.T.; Shan, M.K.; Chang, J. Automatic recipe cuisine classification by ingredients. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication; Association for Computing Machinery, New York, NY, USA, 13–17 September 2014; pp. 565–570. [Google Scholar] [CrossRef]
Lane, P.C.R.; Clarke, D.; Hender, P. On developing robust models for favourability analysis: Model choice, feature sets and imbalanced data. Decis. Support Syst. 2012, 53, 712–718. [Google Scholar] [CrossRef] [Green Version]
Britto, L.; Pacífico, L.; Oliveira, E.; Ludermir, T. A cooking recipe multi-label classification approach for food restriction identification. In Proceedings of the Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional, SBC, Porto Alegre, Brazil, 20–23 October 2020; pp. 246–257. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Information Science and Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection; Ijcai: Montreal, QC, Canada, 1995; Volume 14, pp. 1137–1145. [Google Scholar]
Sechidis, K.; Tsoumakas, G.; Vlahavas, I. On the stratification of multi-label data. In Machine Learning and Knowledge Discovery in Databases; Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 145–158. [Google Scholar]
Alemany-Bordera, J.; Heras Barberá, S.M.; Palanca Cámara, J.; Julian Inglada, V.J. Bargaining agents based system for automatic classification of potential allergens in recipes. ADCAIJ Adv. Distrib. Comput. Artif. Intell. J. 2016, 5, 43–51. [Google Scholar]
U.S. Department of Agriculture. FoodData Central. 2020. Available online: https://fdc.nal.usda.gov/ (accessed on 2 December 2020).
Ueda, M.; Takahata, M.; Nakajima, S. User’s food preference extraction for personalized cooking recipe recommendation. In Proceedings of the Second International Conference on Semantic Personalized Information Management: Retrieval and Recommendation, Bonn, Germany, 24 October 2011; Volume 781, pp. 98–105. [Google Scholar]
Freyne, J.; Berkovsky, S. Intelligent food planning: Personalized recipe recommendation. In Proceedings of the 15th International Conference on Intelligent User Interfaces; Association for Computing Machinery, New York, NY, USA, 7–10 February 2010; pp. 321–324. [Google Scholar] [CrossRef]
Open Food Facts Community. Open Food Facts—Food Products Database. 2020. Available online: https://world.openfoodfacts.org/data (accessed on 27 February 2022).
Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef] [Green Version]
Manna, S. Imbalanced Multilabel Scene Classification using Keras. The Owl, 29 July 2020. [Google Scholar]
Feurer, M.; Hutter, F. Hyperparameter optimization. In Automated Machine Learning; Springer: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar]
Jaccard, P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
Soucy, P.; Mineau, G.W. A simple KNN algorithm for text categorization. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 647–648. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Albon, C. Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning, 1st ed.; O’Reilly Media: Newton, MA, USA, 2018. [Google Scholar]
Sorower, M.S. A Literature Survey on Algorithms for Multi-Label Learning. 2010. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.364.5612&rep=rep1&type=pdf (accessed on 27 February 2022).
Hinton, G.E. Connectionist learning procedures. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 1990; pp. 555–610. [Google Scholar]
Pillai, I.; Fumera, G.; Roli, F. Designing multi-label classifiers that maximize F measures: State of the art. Pattern Recognit. 2017, 61, 394–404. [Google Scholar] [CrossRef] [Green Version]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Nam, J.; Kim, J.; Loza Mencía, E.; Gurevych, I.; Fürnkranz, J. Large-scale multi-label text classification—Revisiting neural networks. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 14–18 September 2014; Calders, T., Esposito, F., Hüllermeier, E., Meo, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 437–452. [Google Scholar]

Figure 1. Class distribution for the Kaggle recipe train set.

Figure 2. Top 20 ingredients for the Brazilian cuisine class.

Figure 3. Top 20 ingredients for the Chinese cuisine class.

Figure 4. Class distribution of allergens in the cleaned dataset after downsampling is applied.

Figure 5. Class distribution of allergens in the cleaned dataset after up-sampling is applied.

Figure 6. Cuisine classifier scores.

Figure 7. Confusion matrix of the LinearSVC classifier.

Figure 8. Allergen classifier scores without downsampling.

Figure 9. Logistic regression classifier scores for the milk category with varying TF-IDF feature values.

Table 1. Labelled Recipe Data Example.

Id	Cuisine	Ingredients
0	Spanish	mussel, black pepper, garlic, saffron thread, olive oil, stew tomato, arborio rice…
1	Mexican	tomato, red onion, paprika, salt, corn tortilla, cilantro, cremini, broth, pepper…
2	French	chicken broth, truffle, pimento, green pepper, olive, turkey, egg yolk, …
3	Chinese	ginger, sesame oil, pea, cooked rice, bell pepper, peanut oil, egg, garlic, …

Table 2. Cuisine classification classifier parameters after grid search.

Classifier	Parameter
KNN	n_neighbours: 75
Logistic Regression	C: 0.5
	max_iter: 1000
	multi_class: auto
	solver: lbfgs
Random Forest	class_weight: balanced
	max_depth: 75
	max_features: auto
	n_estimators: 100
Decision Tree	max_depth: 120
	max_features: auto
	min_samples_leaf: 1
SVC	C: 10
	gamma: 0.001
	kernel: rbf
LinearSVC	C: 0.2
	dual: false
	max_iter: 1100
	penalty: l1

Table 3. Allergen classification classifier parameters after grid search.

Classifier	Parameter
Logistic Regression	estimator__C: 20
	estimator__class_weight: balanced
	estimator__max_iter: 2500
	estimator__solver: saga
Random Forest	estimator__class_weight: balanced
	estimator__max_depth: 400
	estimator__n_estimators: 2000
Decision Tree	estimator__max_depth: 2500
Decision Tree	estimator__min_samples_leaf: 5
MLP	activation: relu
	early_stopping: True
	hidden_layer_sizes: (130,)
	learning_rate: constant
	max_iter: 300

Table 4. User study results, showing the correct, missing and incorrectly labelled allergens out of a total amount of 50 allergens spread across recipes. The ratio shows correctly labelled allergens to the total amount of allergens.

Participant No.	Correct	Missing	Incorrect	Avg. Seconds/Recipe	Ratio
1-informed	46	4	3	34.88	92.00%
2-informed	42	8	9	47.50	84.00%
3-informed	38	12	0	62.81	76.00%
4-informed	42	8	3	73.75	84.00%
5-informed	46	4	3	35.31	92.00%
6	25	25	6	38.63	50.00%
7	40	10	2	41.25	80.00%
8	38	12	4	58.69	76.00%
9	34	16	3	50.88	68.00%
10	36	14	4	36.56	72.00%
Logistic Regression	36	14	2	4.5	72.00%
MLP	28	22	6	4.5	56.00%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Roither, A.; Kurz, M.; Sonnleitner, E. The Chef’s Choice: System for Allergen and Style Classification in Recipes. Appl. Sci. 2022, 12, 2590. https://doi.org/10.3390/app12052590

AMA Style

Roither A, Kurz M, Sonnleitner E. The Chef’s Choice: System for Allergen and Style Classification in Recipes. Applied Sciences. 2022; 12(5):2590. https://doi.org/10.3390/app12052590

Chicago/Turabian Style

Roither, Andreas, Marc Kurz, and Erik Sonnleitner. 2022. "The Chef’s Choice: System for Allergen and Style Classification in Recipes" Applied Sciences 12, no. 5: 2590. https://doi.org/10.3390/app12052590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Chef’s Choice: System for Allergen and Style Classification in Recipes

Abstract

1. Introduction

1.1. Motivation

1.2. Goal

1.3. Overview

2. Related Work

2.1. Cuisine Classification

2.2. Allergen Classification

2.3. Findings

2.3.1. Misclassification for Similar Cuisines

2.3.2. One-Vs-Rest and Feature Amount

2.3.3. Noise in Public Recipes

2.3.4. Class Imbalance

2.3.5. Stratification of Multi Labeled Data

2.4. Related Solutions

2.5. Proposed System and Existing Solutions Distinction

3. System Concept and Methodology

4. Data Acquisition

4.1. Kaggle Dataset

4.2. Openfoodfacts Dataset

4.3. Dataset Preprocessing

Up- and Down-Sampling

4.4. Cuisine Classification

4.4.1. Hyperparameter Tuning

4.4.2. Machine Learning Classifier

4.5. Allergen Classification

4.5.1. Hyperparameter Tuning

4.5.2. Machine Learning Classifier

5. Evaluation

5.1. Evaluation of Classifiers

5.2. Cuisine Classification

5.3. Allergen Classification

5.4. Evaluation of Proposed System

5.4.1. Results

5.4.2. User Study Shortcomings

6. Conclusions

6.1. Summary

6.2. Challenges and Problems

6.3. Known Limitations and Discussion

6.4. Future Extensions and Development

6.4.1. Performance

6.4.2. Language

6.4.3. Datasets

6.4.4. Machine Learning Classifier

6.4.5. Regional Cooking Terms

6.4.6. Feedback Loop

6.4.7. Integrations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI