*4.2. Dataset*

There are several open skin datasets available. The International Skin Imaging Collaboration (ISIC) [77] has introduced many datasets from different sources as part of their annual challenge including ISBI, HAM10000, BCN\_20000, and MSK Datasets. Interactive Atlas of Dermoscopy (IAD) [78] and PH2 [79] have also provided a dataset of dermoscopy images. He et al. [71] have collected two datasets, Skin-10 and Skin-100, as part of their research, but they have not been made publicly available. In this research, the HAM10000 (Human Against Machine with 10,000 training images) [80] dataset has been used to train the designed models. Table 5 lists the dataset characteristic including the number of images and classes of diagnoses. The dataset has been published in the Harvard Dataverse data repository and consists of 10,015 dermatoscopic images belonging to seven different diagnostic categories of common skin pigmented lesions. The last column in the table shows examples of dermatoscopic images that belong to different diagnosis classes.

### *4.3. DL Models Service*

The DL model service is responsible for model design, training, retraining, and optimization (see Figure 8). This service may be located locally or remotely on cloud, fog, or edge devices. However, retrieving models from different layers of the network would affect the response time. New models can be retrieved on an interval basis or as the services agreement specifies and depending on the user preferences. TensorFlow, an ML open-source tool developed by Google, is used for model development. Algorithm 3 shows the procedure that this service follows to design a model. First, the TF model is designed and trained using the given dataset. Some pre-processing is performed on the dataset images including image resizing and normalization. After training, the TF model is saved in a Hierarchical data format version 5 (H5) file which stores model weights and configuration so they can be restored anytime. Then, the TF model is converted to a TensorFlow Lite (TFLite) model which is an optimized version of the TF model to run on mobile, embedded, and IoT devices. The TFLite model is saved in a file with the (.tflite) extension. The subsections that

follow present a discussion on the design, training, evaluation, and conversion of the two models used in this paper.

**Table 5.** HAM10000 dataset characteristics.



### 4.3.1. TensorFlow Model Design

Two models have been designed, implemented, trained, evaluated, and converted to smaller models for edge devices. The first model (A) is based on the pre-trained model Inception v3, while the second model (B) is a pure CNN model. Figure 9 shows model (A) architecture, starting with the Inception v3 model and ending with a dense layer that has seven nodes representing each class of diagnosis. Inception v3 is a pre-trained CNN model consisting of 48 layers and trained using the ImageNet database. Multiple layers have been added to the Inception v3 model to improve its performance when it is trained with the dermatoscopic images, including 2D Convolution (Conv2D), 2D Maximum Pooling (MaxPooling2D), Dropout, Flatten, and Dense. Figure 10 shows model B architecture consisting of a series of 19 layers including 2D Convolution (Conv2D), 2D Maximum Pooling (MaxPooling2D), Dropout, Flatten, and Dense layers. The first layer, Conv2D, receives the input image of shape (299,299,3), and the last layer is a dense layer that has seven nodes representing each class of the diagnosis.

**Figure 10.** Model (B) architecture.

Both models were trained using the HAM10000 dataset. The dataset was split with 60:20:20 percentages for training, validation, and testing, respectively. Model accuracy (a) was calculated for each subset of data as the percentage of true disease prediction. Model (A) had 0.96, 0.83, and 0.82 accuracies, while model (B) had 0.79, 0.78, and 0.77 accuracies for training, validation, and testing, respectively. To evaluate the accuracy of models A and B in terms of various disease classes, the heatmaps have been used to plot the confusion matrix of the test dataset predictions. Figure 11 present the heatmaps that illustrate the accuracy of classification results for the seven classes. The darker diagonal line in Figure 11a shows that Model A classification results for various classes of disease are more accurate than Model B. The nv class had the highest level of accuracy on both models and model A outperformed model B in akiec, bcc, mel, and vasc classes.

**Figure 11.** The accuracy heatmap of different classes: (**a**) Model A; (**b**) Model B.

4.3.2. TensorFlow Lite (TFLite) Model

After training and validating both models, TFLite Converter has been used to convert the saved TF models into TFLite models. TFLite Converter generates optimized TFLite models in a FlatBuffer serializable format identified by the (.tflite) file extension. To evaluate both models, the four model versions (A, ALite, B, and BLite) were run for the training, validation, and testing datasets. Table 6 lists the characteristic of models A and B and compares the original (TensorFlow) model and TFLite model in terms of memory footprint and accuracy. After conversion, both A and B models were reduced in size by around three-fold, with no reduction in model accuracy.

**Table 6.** Comparison between original TF and TFLite Models.


### *4.4. Mobile Local Service*

In the mobile local service, both diagnosis service and diagnosis request service reside in the user device. Therefore, the user's mobile device should have the required resources to save and run the model locally. As shown in Figure 8, the TensorFlow Lite model is provided by the DL model service in a (.tflite) format. In the diagnosis request service, the user selects a skin image and chooses the mobile local service from their catalog. The mobile local service uses the local TFLite Interpreter in the mobile device to load the model and perform image classification tasks. This type of service guarantees a real-time response and preserves user privacy as the images do not have to be sent across the network to a remote service.

### *4.5. Mobile Remote Service*

The mobile remote service is located in mobile devices and is responsible for providing classification services to nearby devices. As shown in Figure 8, this service is equipped with a TFLite Interpreter, Android Nearby Connections API, and downloads the model from the DL model service. Android Nearby Connections API is used for service connection and management. It is a networking API provided by Android for peer-to-peer service and connection managemen<sup>t</sup> with nearby devices using technologies such as Bluetooth, Wi-Fi, IP, and audio. This includes service advertising, discovery, connection, and data exchange in a real-time manner. Figure 12 shows messages exchanges between the mobile

remote service and the diagnosis request service for the service provisioning process. The mobile remote service starts service advertisement by periodically broadcasting messages that include the service name and service ID. The diagnosis requests service listens to broadcast messages for service discovery and when the required service provider is found, the connection is requested. This invokes the connection establishment process, which includes connection acceptance from both sides and connection result acknowledgment. When the connection establishment is successful, the user can start requesting diagnosis services by sending a skin image to the provider, who uses the TFLite Interpreter to classify the image and return the result.

**Figure 12.** Mobile remote service nearby connection workflow.

### *4.6. gRPC Service*

The gRPC service is implemented using remote procedure calls, specifically Google Remote Procedure Call (gRPC). gRPC is a framework for building platform-independent services and providing various utilities to facilitate service implementation and deployment. Proto syntax is used to define the request and response messages that are passed between gRPC servers and clients. As shown in Figure 8, gRPC services support both TF and TFLite models for skin diagnosis. These models are provided by the DL model service. Secure Sockets Layer (SSL) protocol is used to provide secured communications between the server and client. The diagnosis request service first establishes a secure channel with the gRPC service and then sends the diagnosis request, including the skin image. When the gRPC service receives the request, it passes the image to either the TensorFlow or TFlite Interpreter to classify the image and returns the result. The result is then sent back as a gRPC response including classification probabilities.

### *4.7. Containerized gRPC Service*

The containerized gRPC service is a version of the gRPC service that is containerized as a Docker container (see Figure 8). Docker containers provide an executable, lightweight, and standalone container image that encapsulates everything the gRPC service needs in order to run. This service image is deployed in Google Cloud using the Cloud Run platform. Containerized gRPC service reduces efforts in deploying gRPC service into

the cloud especially when they are already supported by the cloud platform, such as the Google Cloud Run platform that have been used here. Cloud Run provides a fully managed serverless platform to deploy highly scalable containerized applications. The containerized gRPC service could not replace the gRPC service as Docker containers do not have full support for many of the AI libraries for different processor architectures such as armv7 and aarch64 in Raspberry Pi and Jetson. Therefore, offering this variety of technologies and software platform allows services to be instantiated anywhere in cloud, fog, and edge layers.

### *4.8. Diagnosis Request Service*

The diagnosis request service has been developed using Android studio, so that it could run on Android devices. This service is responsible for image selection and communication with various diagnosis services. Algorithm 4 shows the procedure that the diagnosis request service follows to ge<sup>t</sup> a skin diagnosis prediction from one of the skin image diagnosis services and present the final result.

The algorithm takes, as an input, the user selected skin image and the chosen service type from the provided service catalog. In the case of mobile local service, the local service installed in the device will be used for skin image classification directly. In other cases, the diagnosis request service first establishes a connection with the required service. If the mobile remote service is chosen, the application listens to the nearby service broadcasts and establishes a connection with a nearby mobile device. For gRPC-based services (gRPC and containerized gRPC), the application uses gRPC stubs to communicate with the services. When the connection is ready, the diagnosis request is sent along with the skin image to be classified (diagnosed) by the chosen diagnosis service and when the results are sent back, they are presented to the user. Figure 13 shows screenshots of the user interface for the skin diagnosis application, which enables the user to request a diagnosis service. The screenshots are numbered from 1 to 5 to show the steps involved in selecting a service and obtaining a diagnosis on the application.

**Figure 13.** Mobile application user interface for access to skin diagnosis services.


### **5. Service Evaluation and Analysis**

This section presents and discusses our experiments and results. First, the experiment settings are presented (Section 5.1). Then, every evaluation metric has been discussed and evaluated, including processing time (Section 5.2), response time (Section 5.3), network time (Section 5.4), service data transfer rate (Section 5.5), and the services' energy consumption (Section 5.6) and values (Section 5.7).

### *5.1. Experimental Settings*

The experiments were conducted in a real-life environment in a typical family home setting to represent everyday city life. They took place over a period of several weeks. Every week, they were conducted for four consecutive days (from Saturday to Tuesday), at three different times of the day. Unfortunately, limited human, and other, resources, made it impossible to conduct the experiments more frequently (every three h) and for the seven weekdays. Table 7 lists the various evaluation variables for which data had been collected during the experiments and those were recorded as testing logs. The table lists the variable names, definitions, units, and an example of collected data.

Figure 14 shows the networking setup for the experiments. All edge devices are connected to a WiFi router that provides a local connection between them and a connection to the Cloud through the Fiber and 4G networks. Two WiFi routers have been used separately for the two different experiment settings. One is the Fiber WiFi router which is both a fiber optic modem and WiFi router that is connected to the fiber optic cable provided by Internet Service Provider (ISP). The second is a 4G WiFi router connected to the 4G cellular network via a SIM card provided by ISP. The smartphones use the Android Nearby Connections API to create a peer-to-peer (P2P) connection between them, which uses either WiFi or Bluetooth for communication. The figure shows the Fog node connects to the edge WiFi network through the 4G and Fiber networks. This is depicted to show how it should be connected in reality and to avoid confusion for the reader. However, the Fog device in our case is connected to the edge devices through the same two routers. This is done due to the human and infrastructure resource limitations since having the fog node in a separate network requires a separate physical space and human support for conducting experiments. In our case, this is an acceptable setup because in studying fog node performance we have focused on the computational performance of the fog node which depends on the device computecapabilityandisvirtuallyindependentofthenetworkperformance.


**Table 7.** Evaluation data variables and examples.

### *5.2. Service Processing Time*

The processing time is the time that the diagnosis service needs to process an image and predict the skin disease category. It depends on both model complexity and device resources. The processing time was recorded at different times of the day during the week. Figures 15 and 16 show processing times for all service types, devices, and models. Services and devices specifications can be referred to in Table 3.

**Figure 14.** Networking setup.

Figure 15 compares models processing time behavior for each service type and device. The bar chart presents the average processing time where the horizontal axis represents devices, the vertical axis represents the average processing time in seconds, and bars represent model types. For all devices, model A average processing time is higher than that of model B, even for the TFLite versions, which was excepted considering the complexity and size of model A. Jetson device has the highest average processing time for all models compared to other devices and this is related to both Jetson memory limitation and device capability. On Jetson, the average processing times were 49 s, 10 s, 2 s, and 0.5 s for models A, B, ALite, and BLite, respectively. The lowest average processing times were for the Fog device with 7.7 s, 0.8 s, 0.5 s, and 0.1 s for models A, B, ALite, and BLite, respectively.

**Figure 15.** Service processing time (by device).

**Figure 16.** Service processing time (by model).

The boxplot in Figure 15 depicts the processing time data distribution for the whole data collected in our experiments. Boxplots show five statistical measurements the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The first quartile (Q1) is the 25th percentile, the median (Q2) is the 50th percentile, and the third quartile (Q3) is the 75th percentile of data. These are depicted as the bottom line in the colored box, the middle thick line inside, and the top line of the box, respectively. The distance between Q1 and Q3 (the height of the colored box) is called the interquartile range (IQR). The maximum and minimum values are the highest and lowest points of the vertical lines on the top and the bottom of the colored boxes. They are calculated using the quartiles and IQR, Q3 + 1.5 × IQR for the maximum and Q1 − 1.5 × IQR for the minimum. Any value more than maximum or less than minimum values is considered as an outlier and depicted as a circle outside the boxplot. Jetson minimum, Q1, Q2, Q3, and maximum processing times for model A were 41 s, 43 s, 48 s, 51 s, and 55 s, respectively, with some outliers over 65 s (see Figure 15 boxplot). This shows that all recorded values for Jetson model A are greater than any other devices or models. Model B processing times distribution on Jetson was much better with 13 s as the maximum value though there were some outliers over 25 s.

Figure 16 compares the device's processing time behavior for each model. The bar chart presents the average processing time where the horizontal axis represents models, the vertical axis represents the average processing time in seconds, and the bars represent the seven devices that were being evaluated. The Fog had the lowest average processing times among all devices for all models and this can be related to the resources of the fog device. An HP Pavilion Laptop had been used here as a fog device which has an Intel® Core™ i7-8550U processer and 8 GB memory. The Fog has, even, outperformed the Cloud average processing time, 9.3 s, 1.1 s, 1.15 s, and 0.18 s (Cloud) compared to 7.7 s, 0.8 s, 0.5 s, and 0.1 s (Fog) for models A, B, ALite, and BLite, respectively. It seems that the vCPU assigned by Google for the containerized gRPC service is less powerful than the Intel® Core™ i7-8550U processer in the Fog device (gRPC service). The exact physical CPU the containerized gRPC service run on is unspecified by Google on the Google Cloud Run platform.

For models A and B, the Cloud provided the second-best average processing time (9.3 s and 1.1 s), followed by Rasp8 (30.5 s and 2.4 s), and Jetson (49.1 s and 9.9 s), respectively. Rasp8 and Rasp4 average processing times were almost identical for TFLite models (around 3 s ALite and 0.5 s BLite), though Rasp4 could not process original TF models due to memory shortage. For the ALite model, after the Fog (0.46 s), the S9 was the fastest (0.61 s) followed by the Cloud (1.15 s), Jetson (2.48 s), Rasp8 (2.52 s), Note 4 (2.72 s), and Rasp4 (2.74 s). For the BLite model, after the Fog (0.09 s), the Cloud was the fastest (0.18 s), followed by S9

(0.26 s), Jetson (0.52 s), Rasp8 (0.51 s), Rasp4 (0.53 s), and Note 4 (0.72 s). Although both S9 and Note 4 are mobile devices, S9 showed better results due to its processor capabilities (see Table 3).

Looking at previous observations, the processing time is closely related to both device capabilities and the model size and complexity. TFLite optimization greatly improved the processing time and there was no accuracy loss (in our case). In more complex models, the accuracy may lower to a certain level, which may jeopardize the application, depending on its criticality. Devices at the fog or edge layers showed acceptable results compared to the cloud which make them grea<sup>t</sup> candidates for local processing.

### *5.3. Service Response Time*

The response time is the total time since the request was made until the result is returned; this includes the processing time. Figures 17 and 18 show the response time of all service types, devices, and models. Service and device specifications are referred to in Table 3. Figure 17 compares the models' response time behavior for each service type and device. The bar chart presents the average response time where the horizontal axis represents devices, the vertical axis represents the average response time in seconds, and the bars represent model types. The average processing time (see Figure 15) and the average response time (Figure 17) have similar behavior. Model A average response time is always more than model B's average response time for both TF and TFLite versions. Unlike other devices, the Cloud performed better for model B (3.36 s) than model ALite (3.7 s), and Rasp8 had an identical average response time (2.9 s) for both model B and ALite. Looking at the boxplot in Figure 17 which shows the data distribution of the response time for all the data collected in the experiments. There is a difference in the Cloud performance distribution of model B and ALite. Model ALite IQR is larger than the IQR of model B though both medians (Q2) are at around 1 s. CloudALite had an outliner above 25 s, while the highest CloudB outliner is at around 17 s. For Rasp8, both model B and ALite had a similar distribution, although the maximum value of ALite is more than that of B.

**Figure 17.** Service response time (by device).

Figure 18 compares the devices' response time behavior for each model. The bar chart presents the average response time, where the horizontal axis represents models, the vertical axis represents the average response time in seconds, and the bars represent the seven devices that were being evaluated. Although the Cloud and Fog had a close average processing time (see Figure 16), the difference between them is greater when it comes to response times. The difference between the Cloud and Fog average processing times is 1.6 s, 0.4 s, 0.7 s, and 0 s for models A, B, ALite, and BLite, respectively; the difference between their average response times is 2.6 s, 2.2 s, 2.8 s, and 0.4 s for models A, B, ALite, and BLite, respectively. The Cloud average response time is higher as it requires more time to transfer the image across the internet. The networking time metric that is covered in the next section clearly shows the load on the network of different services. The Jetson average processing time greatly affects its response time, especially for model A (48.7 s), with more than 18 s difference with the next highest response time, which is for Rasp8 (31 s). S9 had the lowest average response times for TFLite models 0.6 s (ALite) and 0.3 s (BLite), while Note 4 had the highest average response time among TFLite models (10 s and 7 s), which can be related to the Android Nearby Connections API that was used for mobile device services.

The boxplot in Figure 18 provides a deeper look at the service's response time values showing the distribution of all collected response times. Note 4 boxplots for ALite and BLite show that the maximum response times were 25 s and 20 s while the medians (Q2) were 7 s and 2 s. BLite median is close to the Q3 (1 s), which means that 50% of collected response times for Note 4 was below 1 s. This variation of data means that this type of service response time is highly unpredictable.

These results show that local processing at mobile devices with MobileL services had the best response time as they do not require any network communication, though only TFLite or small models can be accommodated. On the other hand, other fog and edge devices at the local network such as the Fog and Rasp8 can accommodate more complex models and provide fast responses to local service requests.

### *5.4. Service Network Time*

The network time is calculated as the difference between the response time and the processing time, so it includes connection initialization time, devices network processing, and data transfer time (see Equation (4)). This metric shows both the load on the network for each service type and the performance of the different network connection technologies used to communicate with the diagnosis services. There are two basic types of communica-

tion protocols that have been used in this research, namely, the nearby connections and gRPC. The mobile remote services use the Android Nearby Connections API to create a peer-to-peer connection using various technologies such as Bluetooth, Wi-Fi, IP, and audio depending on the available connection. Other services use gRPC and SSL for communication and the mobile local service does not require any network communication. In addition, the network time gives an indication of the data transfer factor of the total response time.

$$\text{NetworkTime} = \text{ResponseTime} - \text{ServiceProcessTime} \tag{4}$$

### 5.4.1. Model and Device Behavior

In this section, the network time is evaluated for various devices and models. Figures 19 and 20 show the calculated network time values from the collected response and processing times data. Figure 19 focuses on models' behavior for each service type and device while Figure 20 focuses on the behavior of the devices for each model.

**Figure 19.** Service network time (by device).

**Figure 20.** Service network time (by model).

The bar chart on the left hand side of Figure 19 presents the average network time for all service types and devices in terms of model types. MobileR services had the highest values of average network time (around 7 s), and this can be related to the nearby connections, especially since the network time includes the time required for connection initialization. The difference of average network time between different models on the same device is very small for all devices except the Cloud. The average network time of CloudBLite was 0.6 s, while for other models the network time was 1.4 s, 2.2 s, and 2.6 s (for A, B, and ALite, respectively). However, looking at the right-hand side of Figure 19, the network time boxplot shows the distribution of the data. The Q1, Q2, and Q3 of the Cloud network time for all models fall below 1 s, though there are a lot of anomalies above 9 s for models A, B, and BLite, which have affected the average value. All other devices have a similar distribution of network time but with fewer and much lower anomalies, except for the mobile remote. The maximum value of MobileRALite is 23 s and MobileRBLite is 19 s, both have a minimum of 0.5 s. MobileRALite median is 4 s and MobileRBLite is median 2 s. This high variation of network time on MobileR indicates that the nearby connections are more unpredictable.

Figure 20 compares calculated network time in terms of devices for each model. The bar chart on the left side shows the average network time. Despite the MobileR, whose behavior was explained earlier, the Cloud had the highest values among them. This is expected, as it is the only service that is located across WAN, and all other services are on LAN. Among devices on the LAN, the Fog had the best average network time for all models of around 0.39 s. Resp (Rasp4 0.41 s and Rasp8 0.46 s) was the second best followed by Jetson (0.72 s). The boxplot on the right side shows the distribution of these values. The Cloud had the highest anomalies followed by Jetson. All other device network times fall below 2 s, including all anomalies.

To summarize, the results confirm that both the type of connection and technique used for communication are affecting the networking time. Local services are always the best option if the available resources are sufficient for processing although the available network cards and other device specifications showed a variation of network times among devices on the same LAN.

### 5.4.2. Behavior over Weekdays

This section describes the network time, which has been evaluated over the whole period of the experiment to investigate the behavior of the devices. The data were collected for four days starting from Saturday to Tuesday, three times a day for all models and devices. Figure 21 shows the calculated network times plotted over a time series. In the scatter plot on the left side of Figure 21, the network times were plotted as colored dots where each color represents a different device. The vertical axis represents the network time in seconds, the horizontal axis represents the time series including days and hours, the dots represent the calculated network time for each device at a specific time, and the line shows the trend of the network time over time. The trend curve is plotted using the LOESS (Locally Estimated Scatterplot Smoothing) regression analysis method. The S9 mobile device has no network time as it runs a mobile local service that does not require any network communication.

The highest network time trend line (top line) is for Note 4, which runs a remote mobile service. The behavior of the Nearby Connections API has been observed in the previous section, which had a very high distribution of data (see the boxplot of Note 4 in Figure 19). Similarly, it can be seen that the Note 4 data points are spread all over the graph, with a maximum of 23 s on Sunday 00:00 and a minimum of around 0.1 s on Tuesday 13:00. The Cloud is the second-worst network time (second trend line from the top). However, there are eleven values over the Note 4 trend line, ten of them ranging from 9 s to 13 s and one 23 s on Tuesday 21:00. Other devices on LAN are showing a similar trend line, except for Jetson (the purple line), which went slightly higher on Saturday until Sunday afternoon. Saturday 15:00 was the highest with 8 s, and the second highest was on Saturday 22:00 with

around 4 s network time. The Fog, Rasp8, and Rasp4 were more stable with one point over the Cloud trend line for the Fog at around 2 s on Sunday 18:00.

Looking at the shape of the trend lines over time, all the lines were lower on Sunday 00:00 and higher on Tuesday 22:00. Note 4 trend line fluctuated more than the other lines, the curve rose on Saturday afternoon until Sunday night. On Monday at daytime, the network time was lower, and then the curve started to rise again from Monday evening until the end of the period. The Cloud network time trend started with a low network time at 1 s on Saturday 00:00, then started to build up, and stabilized at around 2 s from Saturday evening to Monday evening, before it rose again from Monday night to Tuesday night reaching 3 s.

The boxplot on the right side of Figure 21 shows the distribution of calculated network times on different days for different devices. Due to the space limitation in the figure, only the distribution of days has been plotted, not specific times. The boxplot confirms the earlier observation made from the scatter plot. The large boxes of Note 4 confirm the high distribution of network times on the days shown in the scatter plot. Similarly, the Cloud had many outliers over the maximum values on all days and the high outliers for Jetson on Saturday confirm the curve in the Jetson trend line.

To summarize, the results showed there are changes in the device's network times on different times and days. These changes could be related to the user's network usage trend at different times of the day and during weekends and weekdays. Further investigation is needed to find trends in network usage. Such information could be used for network and service placement planning which could improve the QoS.

### 5.4.3. Cellular (4G) vs. Fiber Networks

In this section, a comparative study is made of fiber-optic and cellular 4G internet connections. An experiment has been conducted over three days, from Sunday 28 March 2021 to Tuesday 31 March 2021. The data were collected for both fiber and 4G at two different times of the day, and both Internet connections were from the same network provider. The Cloud services are the ones that require the internet connection to connect to them as they were installed in the Google datacenter. All other services do not require an internet connection as they were installed in the LAN.

Figure 22 shows the network time of all Cloud services (for all models) for both fiber and 4G Internet connections. The vertical axis represents the network time in seconds

and the horizontal axis represents the time series including days and hours. In the scatter plot on the left side of Figure 22, the dots represent the calculated network time for each connection at a specific time, and the line shows the trend of the network time over time. The trend curve is plotted using the LOESS regression analysis method. As expected, the fiber connection had a better network time (around 2 s) than 4G (ranging from 3 s to 10 s). The fiber connection is more stable over time with a slight rise at the end of the period to around 2.5 s. However, there are a few (seven points total) higher values between 9 s and 13 s. The cellular (4G) connection is less stable over time as the trend line fluctuates over time with many high and low values. The lowest value was 1.5 s on Sunday 28 March 2021 at 11:00, and the highest value was 30 s on Tuesday 31 March 2021 at 00:00. It appears that there was higher demand on the cellular network from Monday night to Tuesday afternoon and lower demand on Sunday afternoon to Monday afternoon, which produced these variations.

**Figure 22.** Network time for fiber and 4G of the cloud services.

The boxplot on the right side of Figure 22 shows the distribution of calculated network times for both fiber and 4G internet connections over time. The boxplot confirms our earlier observation from the scatter plot. The large boxes of the 4G network on Sunday 00:00, Tuesday 00:00, and Tuesday 13:00 are aligned with the curve in the 4G trend line in the scatter plot. The fiber network was much more stable, with smaller IQRs, consistent medians, and few outliners over the whole period.

### *5.5. Service Data Transfer Rate*

The service data transfer rate metric is the rate at which the data are being transferred from the request service to the diagnosis service and back again. It includes the time needed for the operating system to initialize the connection, prepare the packets, and send them across the network. Mobile local services do not have a service data transfer rate, as they do not require network communications. The service data transfer rate is calculated as the total size of the transferred data divided by the network time (see Equation (5)). The RequestSize and the ResponseSize are sizes of the request and response packets in bits. Figures 23 and 24 show the calculated service data transfer rate from the collected packet sizes and calculated network times.

$$\text{ServiceDataTransfrerRate} = \frac{\text{ResponseSize} + \text{RequiredSize}}{\text{NetworkTime}}\tag{5}$$

Figure 23 compares the service data transfer rate of different models for each service type and device. The bar chart presents the average service data transfer rate where the

horizontal axis represents devices, the vertical axis represents the average service data transfer rate in Kbps, and bars represent model types. The gRPC service on the Fog device had the highest average service data transfer rate for all models 4 Kbps for A, B, and ALite as well as 5 Kbps for BLite, which is aligned with the average network time discussed earlier. The Cloud service had the lowest service data transfer rate among all models, 1.7 Kbps, 1.7 Kbps, 1.6 Kbps, and 2 Kbps for models A, B, ALite, and BLite, respectively. This was expected, as the Cloud services are the only services that require the data to be transferred across WAN. The boxplot on the right side of Figure 23 shows the distribution of the service data transfer rates. All devices show larger boxplots than the Cloud's boxplots, this means that the service data transfer rate for all local devices varies in its values more than the Cloud's values. Note 4 showed a very low service data transfer rate, with minimum and Q1 values of around 0.1 Kbps. In addition, the medians of both MobileRALite (0.6 Kbps) and MobileRBLite (1.5 Kbps) are lower than those of CloudALite (1.9 Kbps) and CloudBLite (2.1 Kbps).

**Figure 23.** Service data transfer rate (by device).

**Figure 24.** Service data transfer rate (by model).

Figure 24 compares the service data transfer rate of different devices for each model type. For all original TF models, the Fog had the best data service transfer rate followed by Resp8, Jetson, and Cloud. For the ALite model, Rasp4 was better than the Fog by 0.06 Kbps, and they were followed by Resp8, Jetson, Note 4, and Cloud. For the BLite model, the Fog was the best followed by Rasp4, Resp8, Jetson, Note 4, and Cloud. The boxplot on the right side of Figure 24 shows the distribution of the service data transfer rates. The medians of the original TF models show the same pattern as the average values; however, the TFLite models showed a slightly different pattern. Unlike the averages, Note 4 medians were lower than the Clouds, and Rasp4 and Rasp8 both had a similar median of 4 Kbps for ALite model.

### *5.6. Service Energy Consumption*

Energy is a key factor for system efficiency in terms of cost and environmental sustainability. Therefore, services that consume less energy are favorable. Figure 25 shows the estimated average energy consumption per task for all service types presented in service catalog (see Table 3). The bar chart on the left side shows energy consumption grouped in terms of devices, while the one on the right side shows energy consumption grouped in terms of models. MobileL had the lowest energy consumption for both ALite (0.0009 Wh) and BLite (0.0004 Wh), as no energy is used on data transfer in those models. The Cloud had the highest energy consumption for all models, 0.26 Wh, 0.03 Wh, 0.03 Wh, and 0.01 Wh for models A, B, ALite, and BLite, respectively. The BLite model consumed the least energy for all service types, compared to other models which was expected, considering the characteristics of this model. On the other hand, model A had the highest energy consumption due to its computation and memory requirements. The CloudA had the highest energy consumption of 0.26 Wh followed by FogA (0.14 Wh), JetsonA (0.14 Wh), and Rasp8 A (0.04 Wh).

**Figure 25.** Service energy consumption.

### *5.7. Service Value (eValue and sValue)*

Two relative values are calculated, one for energy (eValue) and the other for speed (sValue) (see Section 3.5). These service values are used to compare the 22 different service types in terms of their accuracy, energy, and speed (response time). We only used the Fiber network in these calculations (the same applies to the energy consumption values presented in the previous section). The service values are computed using appropriate energy consumption parameters (see Section 3.5). For example, the Cloud eValue uses both Fiber and Wi-Fi energy consumption values. For Bluetooth, in the figures, we used the same energy consumption as for the Wi-Fi but this could easily be replaced by precise Bluetooth energy values. Note that there are also no problems in computing and plotting service values for the 4G network, but this will lengthen the paper and unnecessarily add to its complexity. The comparison provided for 4G versus Fiber in Section 5.4.3 only presents a comparison between network times; all other values, such as the service values, can be drawn from it. This is to bring another design dimension to the reader's attention, while keeping the article complexity to a minimum.

Figure 26 shows normalized service eValues as an integer between 0 and 100 for all service types. The bar chart on the left side shows the service eValues grouped in terms of devices, while the one on the right side shows the service eValues grouped in terms of models. MobileLBLite had the highest service eValue, and CloudA had the lowest eValue, which is aligned with their energy consumption. In general, the BLite model had the highest values among other models, and model A had the lowest values. When it comes to devices, MobileL services had the best service eValues, though they can only run TFLite models. MobileL services do not require network communication, which eliminates the network data transfer energy from the energy equation (see Equation (1)), reduces their energy consumption, and increases their eValues. The Rasp8 services had the best service eValue among services that run original TF models, and they are the second best for TFLite models after MobileL. This can be related to the energy consumption of the Raspberry Pi devices, which is the lowest among all devices used in the experiments (see Table 3). The Cloud services had the worst eValues due to both devices and data transfer energy consumptions.

**Figure 26.** Service eValue.

Figure 27 shows normalized service sValues as an integer between 0 and 100 for all service types. The bar chart on the left side shows the service sValues grouped in terms of devices, while the one on the right side shows the service sValues grouped in terms of models. MobileLBLite had the highest service eValue, and JetsonA had the lowest sValue. For devices running TFLite models, MobileR had the lowest sValues, and for devices running TF models, Jetson had the lowest sValues. In general, MobileL had the best sValues, and the Fog services came in second place. Rasp8 and Rasp4 had similar sValues, and the Cloud services' were better than those for A and BLite models. The sValue is strongly related to the services' response times, which have been discussed extensively in Section 5.3.

To summarize, MobileL services had the highest eValue and sValue, as they are using less energy and provide faster responses. The only concern with MobileL services is that they are limited in their resources and cannot accommodate large and complex models or large volumes of data. The Cloud services were much better in terms of sValues but not eValues due to their high energy consumption. The Fog also performed very well in terms of sValues (they are the second-best), but Rasp8 outperformed them when it came to eValues. Jetson services had closer eValue and sValue, as their high processing time affected both energy and response time.

**Figure 27.** Service sValue.

### **6. Conclusions and Future Work**

Digital services are the fundamental building blocks of technology-driven smart cities and societies. There has been an increasing need for distributed services that provide intelligence near the fog and edge for reasons such as privacy, security, performance, and costs. The healthcare sector is not an exception; not only does it require such distributed services, but also it is also driven by many other factors including declining public health, increase in chronic diseases, ageing population, rising healthcare costs, and COVID-19.

In this paper, the Imtidad reference architecture is proposed, implemented, and evaluated. It provides DAIaaS over the cloud, fog, and edge using a service catalog case study containing 22 AI skin disease diagnosis services. These services belong to four service classes that are distinguished by software platforms (containerized gRPC, etc.) and are executed on a range of hardware platforms (NVIDIA Jetson nano, etc.) and four network types (Fiber, etc.). The AI models for diagnosis included two standard and two Tiny AI Deep Neural Networks to enable their execution at the edge. They were trained and tested using 10,015 real-life dermatoscopic images.

A detailed evaluation of the DAIaaS skin lesion diagnosis services was provided using several benchmarks. A DL service on a local smartphone provides the best service in terms of energy followed by a Raspberry Pi edge device. A DL service on a local smartphone provides the best service also in terms of speed followed by a laptop device in the fog layer. DL services in the edge layer on local smartphones are the best in terms of energy and response time (speed) as they do not require any network communication, though they can only accommodate TFLite or small models. TFLite optimization provided a grea<sup>t</sup> improvement in terms of processing time and compatibility with edge devices. However, it could reduce model accuracy to some levels that could be tolerated depending on the criticality of the application and user preferences. Therefore, we considered the accuracy of the model in both eValue and sValue, to provide a way for the user to choose and trade-off between these factors, energy, and speed. Other devices in the fog and edge layers, such as a laptop and Raspberry Pi (8 GB), can accommodate more complex models and at the same time provide fast responses to local service requests. DL service on a remote smartphone provided unpredictable behavior in terms of network time compared to other edge and fog services due to the Android Nearby Connections API, which is used for nearby smartphone

communication. The Cloud services' processing time is close to the Fog services, though the response time is higher as it requires more time to transfer the image across the internet. This would depend on particular scenarios, such as those requiring heavy computations, which would render the cloud to have much faster responses because in those cases the processing time would be a bottleneck for low-resource fog devices. DL services in the cloud layer also depend on the type of internet connection used. Our evaluation of both Fiber and Cellular (4G) internet connections on the Cloud services confirmed that the fiber network connection is more stable and has lower network time than the cellular connection (4G in this case, but this may change for 5G and 6G). Obviously, while fiber connection was shown to be more stable, it has limitations in terms of user mobility. The Cloud services eValue and sValue are both affected by the required network communication over WAN.

The novelty and the high impact of this research lies in the developed reference architecture, the service catalog offering a large number of services, the potential for the implementation of innovative use cases through the edge, fog, and cloud, and their evaluation on many software, hardware, and networking platforms, as well as a detailed description of the architecture and case study. To the best of the authors' knowledge, this is the first research paper in which a reference architecture for DAIaaS is proposed and implemented, as well as in which a healthcare application (skin lesion diagnosis) is developed and studied in detail. This work is expected to have an extensive impact on developing smart distributed service infrastructures for healthcare and other sectors.

Future research on distributed services will focus on improving the accuracy and other performance aspects of the skin disease AI model and services. While the design, implementation, and evaluation of the proposed reference architecture and DAIaaS services is detailed and diverse, human, computer, and network resource limitations impeded a higher diversity of hardware, networks, and more frequent measurements. Future lines of research will be oriented towards improving the granularity of the measurements as well as adding to the diversity of the software, hardware, and communication platforms.

Future work will also consider improving and refining the reference architecture, extending it through the development of services in other application domains and sectors including many smart city applications that we have developed over the years including smart cities [2,3,81], big data [8,20], improving computing algorithms [82,83], education [1], spam detection [84], accident and disaster managemen<sup>t</sup> [85,86], autonomous vehicles and transportation [87–91], and healthcare [6,56,92,93].

AI will be an important parameter in the evolution of the 5th Generation (5G) networks and the conceptualization and design of the 6th Generation (6G) networks. Technologies such as network function virtualization (NFV), software-defined networking (SDN), 3D network architectures, and energy harvesting strategies will play important roles in delivering the promises of 5G and 6G networks. However, it is AI that is expected to be the main player in network design and operations, not only in terms of the use of AI for the optimization of network functions, but also due to the expectations that AI, being a fundamental ingredient of smart applications, will be a major workload to be supported by next-generation networks. While 5G promises us high-speed mobile internet, 6G pledges to support ubiquitous AI services through next-generation softwarization, heterogeneity, and configurability of networks [13]. The work on 6G is in its infancy and requires the community to conceptualize and develop its design, implementation, deployment, and use cases [13]. This paper is part of our broader work on distributed AI as a Service and is a timely contribution to this area of developing next-generation infrastructure, including the network infrastructure, needed to support smart societies of the future. Our earlier work [13] proposed a framework for provisioning Distributed AI as a service in IoE (Internet of Everything) and 6G environments and evaluated it using three case studies on distributed AI as service delivery in smart environments, including a smart airport and a smart district. This paper adds to the earlier work by extending another case study on developing a service catalog of distributed services.

**Author Contributions:** Conceptualization, N.J. and R.M.; methodology, N.J. and R.M.; software, N.J.; validation, N.J. and R.M.; formal analysis, N.J., R.M., J.M.C. and T.Y.; investigation, N.J., R.M., I.K., A.A., T.Y. and J.M.C.; resources, R.M., I.K. and A.A.; data curation, N.J.; writing–original draft preparation, N.J. and R.M.; writing–review and editing, R.M., I.K., A.A., T.Y. and J.M.C.; visualization, N.J.; supervision, R.M. and I.K.; project administration, R.M., I.K. and A.A.; funding acquisition, R.M., A.A. and I.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors acknowledge with thanks the technical and financial support from the Deanship of Scientific Research (DSR) at King Abdulaziz University (KAU), Jeddah, Saudi Arabia, under Grant No. RG-10-611-38. The experiments reported in this paper were performed on the Aziz supercomputer at KAU.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The HAM10000 dataset is a public dataset available from the link provided in the article.

**Acknowledgments:** The work carried out in this paper is supported by the HPC Center at King Abdulaziz University. The training and software development work reported in this paper was carried out on the Aziz supercomputer.

**Conflicts of Interest:** The authors declare no conflict of interest.
