1. Introduction
Food image generation refers to generating visually realistic food images based on given ingredient descriptions [
1]. Due to the advancement of AIGC (Artificial Intelligence Generated Content), notable progress has been made in food image generation [
2,
3,
4]. However, generating high-quality food images remains a challenge. The main challenges of food image generation are as follows [
5]: (1) semantic inconsistency, i.e., inconsistency between ingredient and food image data. This inconsistency poses a difficulty for models to precisely describe the correspondence between ingredients and images, thus affecting the precision of food image generation. (2) Insufficient visual realism: the generated images fail to express visual information of actual food images in detail and accurately.
To ensure text–image semantic consistency, existing methods attempt to integrate text encoders and image encoders to establish cross-modal consistency, using LSTM (Long Short-Term Memory), and CNN (Convolutional Neural Networks), respectively, as text encoders and as image encoders [
6,
7]. However, the fixed input sequence length of LSTM might cause information loss, especially when dealing with diverse and high-dimensional data. Furthermore, LSTM heavily relies on extensive labeled data, thus potentially limiting performance in few-shot scenarios. In contrast, the Transformer relies on self-attention mechanisms to capture relationships between different positions in a sequence. Therefore, by analyzing information throughout the entire sequence, transformers can overcome the limitations of LSTM in text encoding, which contributes to establishing a more precise understanding of semantic context and relationships within the text [
8]. However, this method lacks a joint learning process for image and text encoding. Consequently, it fails to fully exploit the latent cross-modal information, which might result in suboptimal performance. To achieve superior outcomes, it is imperative to explore joint learning methods for image–text encoding.
Due to the insufficiency of visually realistic food images, most existing research relied on GANs for food image generation. For instance, SM-GAN [
9] introduced user-drawn mask images to strengthen the reliability of generated images by reinforcing distinctions between plate masks and food masks. However, this approach resulted in lower-resolution images and is solely employed for retrieval models by regularization, consequently limiting the assessment of image quality. Differing from regularization retrieval models, CookGAN [
10] adopted attention mechanisms and cycle-consistency constraints to enhance image quality and control appearance. PizzaGAN [
11] employed CycleGAN to distinguish the presence or absence of various ingredients, in which different stages of pizza preparation can be simulated. Nevertheless, this approach only generated pizzas with predefined steps, which limited its capability to generate diverse food images. To address this limitation, ML-CookGAN [
12] represented text information by various granularities in sentence or word levels, which can be transformed into different ingredients of the generated images with different sizes and shapes. To enhance the quality of generated images, ChefGAN [
13] utilized joint embeddings of image and ingredient features to guide image generation. Based on GAN networks, the above methods are able to improve the clarity of the generated images to some extent. Although these GAN-based methods improve the clarity of generated images to some extent, there still exists imperfection in singular images with a lack of diverse visual features.
Although researchers have conducted extensive exploration, food image generation still faces problems of low semantic consistency and insufficient visual realism. With the emergence of diffusion models [
14], positioning food image generation methods based on diffusion models achieved promising results in efficiently generating more diverse food images.
To address the aforementioned issues, this paper proposes a memory-learning embedded attention fusion model for food image generation, as depicted in
Figure 1. The enhanced CLIP module is designed to establish tight correlations between food ingredients and images through joint embeddings of ingredients and images. To ensure the visual realism and diversity of initial images, the Memory module is integrated with a pre-trained diffusion model to retrieve and generate initial images. Finally, an attention fusion module is performed to enhance comprehension between ingredient and image features. The contributions of this paper can be outlined as follows:
- (1)
To address the semantic inconsistency issue, we propose the enhanced CLIP module by embedding ingredient and image encoders. The former aims to preserve crucial semantic information by transforming sparse ingredient embeddings into compact embeddings. The main idea of the latter is to capture multi-scale feature information to enhance image representations.
- (2)
To address the insufficient visual realism issue, a Memory module is proposed by implanting a pre-trained diffusion model. This module stores ingredient-image pairs trained by the CLIP module as an information dataset to guide the food image generation process.
- (3)
An attention fusion module is proposed to enhance the understanding between ingredient and image features by three attention blocks: Cross-modal Attention block (CmA), Memory Complementary Attention block (MCA), and Combinational Attention block (CoA). This module can efficiently refine feature representations of the generated images.
3. Method
As shown in
Figure 2, our proposed food image generation network MLA-Diff consists of three components: the enhanced CLIP module, the Memory module, and the image generation module. In the CLIP module, MLA-Diff trains both the food ingredient and image encoder to learn ingredient-image pairs by transforming sparse ingredient embeddings into compact embeddings and capturing multi-scale image features. In the Memory module, ingredient-image pairs are stored and initial images are generated from a pre-trained diffusion model. In the image generation module, the attention fusion module is designed to refine image details by integrating features from different modalities.
The algorithmic flow of the proposed food image generation is as follows:
- Step 1.
Train the CLIP module using the food images and their corresponding ingredient.
- Step 2.
Employ the trained CLIP module to create embedding pairs of ingredient-image, and store these pairs in the Memory module.
- Step 3.
Generate auxiliary images , and ingredient embeddings , respectively, by using the diffusion module and ingredient encoder of the CLIP module.
- Step 4.
Query the ingredient embeddings in the Memory module to find the most similar ingredient-image pair (, ).
- Step 5.
For image generation: (1) Input the query image and auxiliary image into the encoder of the image generation module to generate encoded image features . (2) Combine the ingredient embeddings and query ingredient embeddings and image embeddings using the attention fusion module to derive fused ingredient and image features, denoted as CoA. (3) Feed the fused features CoA into the decoder of the image generation module to produce the final food image .
3.1. Enhanced CLIP Module
Our enhanced CLIP module aims to join food ingredient information and image information to generate ingredient-image pairs, as mentioned in the previous section.
This module consists of an ingredient encoder and an image encoder. The former, including multiple MLP blocks and residual connections, primarily focuses on transforming sparse food ingredient embeddings into compact representations, as depicted in
Figure 3a. The MLP block consists of fully connected layers, sigmoid activation functions, and batch normalization. The latter part involves a Multi-scale Feature Extraction Module (MFEM), a Feature Fusion Module (FFM), and a Feature Mapping Module (FMM), outlined in
Figure 3b. The MFEM is responsible for extracting multi-scale features from input images to facilitate the model to comprehend various levels of image details. Meanwhile, the FFM concatenates multi-scale image features by performing CBConv to capture distinctive visual information. Additionally, the function of FMM, housing an MLP layer, aims to map the fused features into a one-dimensional space, aligning them with compact ingredient embeddings.
The procedure for the CLIP module is outlined in Algorithm 1:
Algorithm 1. CLIP module |
1 | Input: Ingredients , images , batch size , and other parameters. |
2 | Output: Ingredients feature vectors . |
3 | for each epoch do |
4 | for each batch do |
5 | Encode the ingredients to by ingredient encoder. |
6 | Encode the images to by image encoder. |
7 | Generate a diagonal unit matrix with size as the labels . |
8 | Make fuse matrix by . |
9 | To optimize objective function by Formula (2). |
10 | Save the ingredients feature vectors and images in pairs. |
3.2. Memory Module
Drawing inspiration from the notion of prototype memory learning [
27], the Memory module is introduced to store ingredient-image pairs. In the next module, the cosine similarity metric is employed to measure the similarity between the current ingredient coding and the stored ingredient coding within the Memory module. This similarity score is used to retrieve the most similar image to the current ingredient code and then serves as an input for the image generation module.
3.3. Image Generation Module
In this section, an image generation module, including three attention mechanisms–Cross-modal Attention block (CmA), Memory Complementary Attention block (MCA), and Combinational Attention block (CoA)—is designed as shown in
Figure 4. The image generation module, as illustrated in Algorithm 2, aims to refine the initial image features through the attention fusion module, aligning them as closely as possible with features in the attention region. These three attention mechanisms are defined as follows.
- (1)
The CmA block is responsible for extracting interaction features between food ingredients and food images. Specifically, it integrates ingredient embeddings
,
, and image embeddings
to establish four distinct Cross-modal Attention blocks (denoted by
,
,
,
), as shown in
Figure 4. These equations of four CmA blocks are formulated as follows:
- (2)
The intention of constructing the MCA is to adaptively learn the difference between image features retrieved from the Memory module. We assume that the difference between
and
indicates the similarity between the input ingredients and those stored in the Memory module. The formula for calculating this difference as the weight
is defined as follows:
Algorithm 2. Food image generation module. |
1 | Input: , , , , labels . |
2 | Output: Image . |
3 | for do |
4 | Fuse the image and . |
5 | Input the fused image into the encoder of the U-Net, and obtain the encoder feature . |
6 | Input , , and into the attention block. |
7 | Calculate Cross-Modal Attention , , , by Formulas (3)–(6) |
8 | Calculate Memory Complementary Attention by Formula (8). |
9 | Calculate Combinational Attention by Formula (9). |
10 | Input the attention feature into the decoder of the U-Net. |
11 | Output the image generated by the decoder. |
12 | Calculate the between and by . |
13 | Backpropagation and adjusting weight parameters of food image generation model. |
As shown in the equation above, a smaller difference implies a higher similarity; that is, the generated image will be more like the corresponding query image
. Therefore, the difference adaptively learned via the MCA attention Module to balance the contribution of image features and text features, defined as follows:
- (3)
The CoA block is designed to adaptively assign weight parameters
in both
and
. For measuring the contributions of each attention, the fused attention feature
is formulated by a linear combination of these attention parameters, as shown Equation (9):
The procedure for the food image generation module is outlined in Algorithm 2.
6. Conclusions
In this paper, we propose a novel network for food image generation from food ingredients. It first extracts features from the input ingredient and image learned from the CLIP module, then the ingredient-image pairs from the CLIP module are stored in the Memory module, while a pre-trained diffusion model is employed to generate the initial image due to the enhancement of the visual realism of the generated images. Finally, the attention fusion module refines the details of the generated images. Experiment results on the Mini-food dataset demonstrate the superiority of MLA-Diff compared to state-of-the-art methods, with an average decrease of 4.758 in FID, 0.733 in LPIPS, and an average increase of 2.450 in IS. The reason for this is that our MLA-Diff can capture the ingredient information more efficiently and enhance the generated images with more realistic details that are consistent with the input ingredients.
Nevertheless, the methodology did not incorporate the processing of multimodal datasets, thereby neglecting the comprehensive analysis of multimodal information, which is essential for a more profound understanding of food ingredient data. Furthermore, the method failed to account for the nuanced dietary requirements that vary according to distinct culinary styles, national origins, and geographical regions.
The future work primarily concentrates on two aspects. Firstly, we intend to incorporate additional information besides ingredient data, such as recipes, video frames, etc., during network training. Analyzing multi-category information provides a better comprehension of food ingredient data in the food image generation process. Secondly, a more powerful pre-trained model will be introduced to enable it to be capable of generating food images with different styles, angles, or specific requirements for various application fields.