**3. Methodology**

The proposed system is composed of deep neural networks that analyze independent characteristics of the user by means of their face (age, gender, and personality), in addition to obtaining the ambient temperature, to be able to be entered into a drinks recommender system and display the recommendation obtained through a module that adds augmented reality to a screen.

Figure 1 shows three types of blocks: The green ones represent external components used for our purposes. Those in red represent actions that are carried out using communication protocols between hardware components, and the blue ones are the modules that were implemented in this work.

Next, a description is made of each of the modules that appear in the diagram in Figure 1, in addition, in the final block used the hardware is explained in a general way.

#### *3.1. Video Reception*

Video from outside the store is continuously sent from an IP camera to an NVIDIA Jetson device for processing. Previously, the network configuration must be made to communicate all the devices, as indicated in [50].

#### *3.2. Video-to-Image Capture*

Once the video has been obtained in the NVIDIA Jetson from the camera, it is necessary to extract frame by frame from it, since these are the basis of all the processing, using the OpenCV library.

#### *3.3. Image Preprocessing*

Four main preprocessing operations were performed on the frame obtained by the previous design:


#### *3.4. Face Detection in the Image*

The architecture of the neural network Single Shot Detection (SSD) [51] has several advantages, such as the ability to detect objects at different scales and resolutions, in addition to performing it at high speed. This is a perfect fit for the needs of the project, as it requires a fast response time and the ability to detect faces at various distances

from the camera. Therefore, a Single Shot Detection is used based on a neural network MobileNet [52] to have an even lower processing time.

It is necessary to carry out a new custom training to adapt the detection only of faces, so using the API Tensorflow Object Detection [53] that already contains the pre-trained neural network to classify 80 types of classes, together with the transfer of learning technique, this goal is achieved. The datasets used for this process are Face Detection in Images [54] and Google Facial Expression Comparison Dataset [55]. Figure 2 shows an example [55] in which the face is painted in a red box.

**Figure 2.** Detected face.

#### *3.5. Face Extraction and Preprocessing*

Since the location of the faces in the image is known, they are cut out and pre-processed in order to be independently analyze and propagate in a convolutional neural network. This process is illustrated in Figure 3.

**Figure 3.** Extraction of the faces detected by the SSD from the frames captured by the IP camera, and their processing to be entered into the neural networks.

#### *3.6. Propagation of Each Cropped Face in Neural Networks*

Each of the faces obtained in the previous block will be propagated through two convolutional neural networks to estimate age, gender, and personality characteristics. Figures 4 and 5 show these propagations with their respective inputs and outputs.

**Figure 4.** Diagram showing the flattened input face, the age, and gender network model in TensorRT, and its two outputs: The age obtained from the regressor, and the gender vector obtained from the classifier.

**Figure 5.** Diagram showing the flattened input face, the classification model: CNN-4 [24] of the Big Five network in TensorRT, and the output vector of size 5 where each position corresponds to a personality dimension as indicated.

#### *3.7. Getting Age and Gender*

To obtain age and gender from an image of a person's face in selfie form, a multitasking convolutional neural network is designed with the intention of reducing the amount of computational resources to be used. We start with the layers estimating age, which is the first part of the design. The architecture of this network is shown in Figure 6 and its hyper-parameters in Table 2.

**Figure 6.** Proposed CNN used to estimate age.

**Table 2.** Hyper-parameters used to train age in a CNN.


Afterwards, we carry out a knowledge transfer that will allow us to reduce the training times required to obtain the second CNN that will be responsible for determining the gender of the person. The transfer is carried out in two ways. First, the starting layer enclosed in the gray rectangle (Conv1) freezes during this workout. Second, the final weights of the individual layers are used as initial values for this training which is indicated by the red arrows, see Figure 7.

**Figure 7.** Designed transfer learning.

Now the training of the second part of CNN is carried out using the hyper-parameters of Table 3. For this training, the regularization of the network was necessary; for this, dropout layers were added. Thus, the final design of the CC is shown in Figure 8.

**Table 3.** Hyper-parameters used to train the genre rating portion of CNN multitasking.


**Figure 8.** Final architecture of the multitasking convolutional neural network to obtain the age and gender of the person.

#### *3.8. Obtaining the Personality (Big Five)*

The estimation of personality based on his face will be measured using the Big Five model [24], which measures personality through 5 dimensions on scales from 0 to 1:


For this step, a model previously built and developed from facial analysis in images will be taken, since these provide better results than all types of existing multimedia files (audio, text, video, and images) [27], being the model of classification: CNN-4 of the work in [24,29] the one selected to be integrated into the system, because despite the fact that there is a better model obtained in the same work called FaceNet-1 [56], its weight, computational requirement, and convergence time is very high, causing the selected hardware devices not to be able to support it, taking then the second best model of the work, whose precision obtained does not differ significantly from this, and in return offers better performance and speed. In Tables 4 and 5, a comparison of the precision in the detection of personality by Big Five is made between the various models of [56] from the images of faces, which have the following characteristics: They are in a selfie format, they are in grayscale, normalized, and their resolution is 208 × 208.


**Table 4.** Comparison of the best results obtained from the different Big Five models developed in [56].

**Table 5.** Comparison between the results (each one is the percentage of precision of the detection of each personality) by dimension of personality among the best Big Five models in [56].


The convolutional neural network architecture of the CNN-4 classification model [56] for the detection of Big Five in the system is shown in Figure 9.

**Figure 9.** CNN-4 classification model architecture for Big Five [24].

#### *3.9. Obtaining the Ambient Temperature*

Although the ambient temperature is not a factor that will influence the recommendation of the drink, knowing this information, the system will have the ability to recommend the most suitable drink modality, since within the BubbleTown® catalog there are three options: Zen (hot drink), Iced (cold drink), or Frozen (Frappé). To give the system the ability to obtain the ambient temperature, the API provided by OpenWeatherMap [57] was used.

#### *3.10. Drink Recommendation*

In very simple terms, a recommendation system is an application that filters information in order to sugges<sup>t</sup> appropriate things to the user [3], which for this job, the recommendation will be a drink from the BubbleTown® catalog.

To achieve the proposed task, this work uses a content-based recommender, that is, a recommendation system that examines the characteristics of the products [4] of BubbleTown® that could be of interest for the user.

The recommender works by using characteristic flavor vectors for each drink on the menu, generated with the support of the *Coffee Taster's Flavor Wheel* [58] because although there are flavor wheels specific for tea, these wheels have been built considering aroma, texture, and flavor characteristics; while the wheel generated in [58] only considers the flavor.

For the vectorization of the beverages, vectors are first generated for each element of the flavor wheel. This vectorization process is achieved by dividing the wheel into flavor classes (sweet, umami, bitter, sour, and spicy), then the ingredients are listed and a value between 1 (one) and 0 (zero) is assigned depending on the location they have within the flavor class.

With the vectorized basic flavors, the vectors that characterize the ingredients of each drink are added and with the "softmax" function of the resulting vector, what will be the characteristic flavor vector for a said drink is obtained.

Figure 10 shows in a general way how the system recommender works. It is important to note that for this recommendation process the salty taste has been eliminated because none of the BubbleTown® drinks have this flavor.

**Figure 10.** Diagram of the recommender operation.

• Personality factor: Once the customer's personality vector is known together with the Big Five network, the vector will be multiplied by a matrix with the values from Table 6 to obtain a matrix with the customer's taste preference based on their personality. The next step will be to add all the values for each flavor and thus a vector of size five will be obtained, where each value of the vector will correspond to a flavor class (sweet, umami, bitter, sour, and spicy).

**Table 6.** Average taste preference as a function of personality in values from 0 to 1. Table obtained with data from [8].


• Gender factor: For this objective, Table 7 is used, so that the vector obtained after applying the personality factor is multiplied by this factor too.

**Table 7.** Average taste preference based on gender. Table obtained with data from [8].


• Age factor: After obtaining the estimated age, the customer will be placed in one of the four age classes and, together with Table 8, the preference vector will be adjusted of the customer's flavor when multiplied by said table. To continue with the vector format, the flavor *umami* and *spicy* are added however, in the Table both have values of 0 (zero), which means that age will not influence these flavor classes.


**Table 8.** Average taste preference based on age group. Table obtained with data from [6].

• Softmax: After applying all the factors that influence the taste preference, the "Softmax" function will be applied to the vector in order to obtain a customer flavor preference vector.


#### *3.11. Data Storage*

The system uses a database to store information such as the beverage catalog, the list of static advertising images or the recommendations that the system makes for each person. For this, it was decided to use a non-relational database.

The database will store the drinks that are available for purchase, the static advertising that will be projected when there is no customer in front of the system, and, finally, each of the recommendations generated by the system during its operation. The recommendations are stored together with the recognized face's features, the estimated age, its gender classification, and the personality vector. The recommended drink will also be stored with said parameters, the ambient temperature at the time of recommendation, and the most suitable drink modality.

Since the face could be considered as a sensitive data, it receives a special treatment before being stored, since it is encrypted using the AES-256 algorithm, in such a way that only system administrators can access this image and only in order to maintain or improve the system proposed through this work.

#### *3.12. Search for the Drink in the Catalog*

The database contains the characteristic vectors for each drink from the BubbleTown® catalog and the valid modalities for each drink (as explained in Section 3.10) however, the path of the images that will be used to generate the augmented reality is also stored. These images will be loaded into memory in order to create the augmented reality that shows the recommendation to the client.

#### *3.13. Generation of Augmented Reality*

One of the parts that consumes the most resources is the generation of augmented reality, since to create that feeling of "interaction" with the user it must be constantly updating itself and, considering that at the code level, manipulating the images consists of modifying an array the same size as the image resolution, too many operations are performed to produce a single image. These operations should be carried out approximately 60 times per second if it is to be made an imperceptible process for the user.

Figure 11 shows in a general way how the module in charge of generating augmented reality works. To operate the system, it is necessary to feed the module with the frame obtained from the camera, the faces that have been previously recognized and processed, as well as advertising (images of the drinks to be recommended). The first component of the augmented reality module will be in charge of adjusting the size of the thought balloon based on how close or far it is from the camera to create an effect of depth. The next step in this process will be to add the advertising to the original frame, where finally the company logo will be added in the upper left corner, a banner with the name of the drink to recommend at the bottom, and finally, the balloon of thought with the image of the drink in the upper left side of each detected face. In order to exemplify the idea for the reader, the augmented reality proposal is placed in Figure 12.

**Figure 11.** Augmented reality design.

**Figure 12.** Example of the augmented reality to be carried out.

#### *3.14. Display of the Result with Augmented Reality*

The image with the recommendation of the drink added in the form of augmented reality is displayed on the LED screen. It is important to mention that periodically, the output on the screen will be refreshed with a new frame (obtained from the camera video).

At this point, each frame is completely ready to be shown on the screen. The communication flow of the system, as well as the type of information that travels between them, the protocol of communication they use and the way to connect them are shown in Figure 13.

**Figure 13.** System hardware connection diagram.

All the functionality will be carried out within the NVIDIA Jetson Xavier, a small but powerful computer for artificial intelligence tasks [59,60] that has an ARM64 architecture and Linux operating system called Jetpack [61].
