*Article* **Automatic Failure Recovery for Container-Based IoT Edge Applications**

**Kolade Olorunnife \*, Kevin Lee \* and Jonathan Kua \***

School of Information Technology, Deakin University, Geelong, VIC 3220, Australia **\*** Correspondence: kolorunnife@deakin.edu.au (K.O.); kevin.lee@deakin.edu.au (K.L.);

jonathan.kua@deakin.edu.au (J.K.)

**Abstract:** Recent years have seen the rapid adoption of Internet of Things (IoT) technologies, where billions of physical devices are interconnected to provide data sensing, computing and actuating capabilities. IoT-based systems have been extensively deployed across various sectors, such as smart homes, smart cities, smart transport, smart logistics and so forth. Newer paradigms such as edge computing are developed to facilitate computation and data intelligence to be performed closer to IoT devices, hence reducing latency for time-sensitive tasks. However, IoT applications are increasingly being deployed in remote and difficult to reach areas for edge computing scenarios. These deployment locations make upgrading application and dealing with software failures difficult. IoT applications are also increasingly being deployed as containers which offer increased remote management ability but are more complex to configure. This paper proposes an approach for effectively managing, updating and re-configuring container-based IoT software as efficiently, scalably and reliably as possible with minimal downtime upon the detection of software failures. The approach is evaluated using docker container-based IoT application deployments in an edge computing scenario.

**Keywords:** Internet of Things (IoT); edge computing; failure recovery

#### **1. Introduction**

The past decade has seen the rapid development and adoption of the Internet of Things (IoT). IoT refers to an ecosystem where billions of physical devices/objects are equipped with communication, sensing, computing and actuating capabilities [1,2]. In 2021, there is an estimated 12.3 billion of active IoT endpoints and it is forecast that the number of active IoT endpoints will reach 27 billion in 2025 [3]. IoT-based systems have been extensively deployed across many industries and sectors that impact our everyday lives, ranging from smart homes, smart cities, smart transport and smart logistics. Business opportunities and new markets abound in IoT, with new IoT applications and use cases constantly evolving. However, IoT applications are increasingly complex, commonly with a large number of end devices with many sensors and actuators and computing spread across end-nodes, edge and cloud locations. There has been substantial effort in designing architectures and frameworks to support good IoT application design [4,5]. An effective and efficient IoT system requires optimal architectural, communication and computational design. The heterogeneity of device hardware, software, communication protocols has made achieving the performance requirements of specific IoT services challenging [6–8]. Service requirements for IoT systems have motivated the development of new computational paradigms such as edge and fog computing to facilitate IoT data computation and analysis, bringing data intelligence and decision-making processes closer to IoT sensors and actuators, thus improving their performance by reducing service latency [9–12]. This is particularly important for time-sensitive tasks, such as those in autonomous vehicles, manufacturing and transport industries, where minute delays in services can have serious safety consequences [13].

**Citation:** Olorunnife, K.; Lee, K.; Kua, J. Automatic Failure Recovery for Container-Based IoT Edge Applications. *Electronics* **2021**, *10*, 3047. https://doi.org/10.3390/ electronics10233047

Academic Editor: George Angelos Papadopoulos

Received: 9 November 2021 Accepted: 2 December 2021 Published: 6 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

There are significant trade-off considerations when deciding on the computational paradigm for IoT systems [14]. Cloud computing offers higher reliability, availability and capabilities, while edge computing offers low-latency services. However, like all complex computer systems, edge-based IoT deployments can suffer from failures [15]. This is potentially more problematic for IoT deployments due to the nature of these often being in an embedded or hard to reach physical locations. IoT devices are usually deployed at scale and has cheaper chipsets and hardware, which make them more prone to faults. It is challenging to manage faults, especially when these devices are deployed across a large area in physically hard-to-reach environments. Solving this issue can be through faulttolerance [16,17] attempting to prevent errors in the first place, or through failure recovery techniques that attempt to recover from problems [18,19]. In addition, IoT applications are increasingly deploying software on cloud services which have no physical access making fault-tolerance necessary and difficult [20]. A potential solution to this problem is to use virtualisation technologies for resource optimisation in heterogeneous serviceoriented IoT applications [21]. Redundancy is another common approach which has been extensively studied, particularly providing fail-overs in the IoT sensing, routing and control processes [22]. In addition to fault-tolerance, approaches using anomaly detection and self-healing techniques have also been explored for building more resilient IoT and Cyber-Physical Systems (CPS), addressing the various risks posed by threats at the physical, network and control layers [23].

Among the many fault-tolerance approaches studied and proposed, container-based approaches are still lacking. The aim of this paper is to investigate the feasibility of automatically detecting failures in container-based IoT edge applications. Specially, this paper investigate if this technique be used in scenarios with IoT software deployed in embedded or hard to reach scenarios, with no or difficult physical access. The proposed approach can automatically diagnose faults with IoT devices and gateways by monitoring the output of IoT applications. When faults are discovered, the proposed approach will reconfigure and redeploy container-based deployments. The experimental evaluation analyses the impact of error rate, redeployment time and packed size on the recovery time for the IoT application.

The contributions of the paper are (i). An evaluation of approaches for failure recovery for IoT applications (ii). A proposed framework for enabling failure recovery for IoT applications (iii). Experiments to evaluate the flexibility, resilience and scalability of the proposed approach.

The structure paper is as follows. Section 2 provides background of edge computing and failure recovery for IoT applications. Section 3 proposes the architecture of a framework for automatic failure recovery of IoT applications. Section 4 describes the experimental setup for this paper and Section 5 presents results of the evaluation of the framework. Section 6 concludes the paper and discuss future work.

#### **2. Background**

This section presents some background information on edge computing, IoT software management and failure recovery in IoT.

#### *2.1. Edge Computing*

With billions of IoT sensors and devices generating a large amount of data and exchanging communication and control messages across complex networks, there arise a need for more efficient computational paradigm, rather than merely relying on cloud computing infrastructure. Edge computing is designed and proposed to address the complex challenges resulting from the large amount of data generated by IoT systems, such as resource congestion, expensive computation, long service delays which negatively impact the performance of IoT services. Edge computing aims to be a distributed infrastructure and perform data computation and analysis closer to the sensors that collect the data and actuators that act upon the decisions [9]. This significantly reduce service response time

and is particularly important for IoT applications that requires real-time or time-sensitive services [13].

Put simply, an edge device is a physical piece of hardware that is a bridge between two given networks. In IoT, an edge device would commonly receive data from end devices which include sensors such as temperature sensors, moisture sensors or radio-frequency identification (RFID) scanners. The data from the sensors would be passed onto a edge device that then forwards the data to the cloud or minor data processing occurs at the edge which is then forwarded to the cloud [24].

Figure 1 illustrates a general IoT and edge-based architecture. On the left there are various sensors connected to a singular edge device which in this case is an embedded device. The edge device in Figure 1 is connected to an actuator which could possibly flip on a light switch or turn on an air conditioning unit. The edge device will forward the sensor data to an internet gateway. The internet gateway will pass this data onto an application server which will then handle the processing of this data or store it. The relationship between an edge device to edge gateway can be many to many. We can have many edge devices connect to a singular edge gateway or have many edge devices to connect to one of many edge gateways. Processing of data can happen at any stage of this architecture [25].

**Figure 1.** Edge Computing Architecture.

The increase in computational power for embedded devices allows for complex data processing on the edge devices before they even reach the cloud service [26]. Fog computing is where the majority of the processing is done at the gateway rather than the edge or the cloud. For IoT applications, there are strong benefits of using edge computing or fog computing. One such benefit is that latency for time-sensitive IoT applications is minimised due to the fact that the processing occurs at the edge rather than transferring data to and from the cloud through a gateway. Lower latency allows for real-time applications to become a forefront within the consumer, industrial and commercial space. In edge architectures, it is also possible that the end device and edge device work together to perform edge computing in order to make informed decisions or trigger an action. In edge architectures, real-time operations become a more feasible option because of lower latency due to the fact that no communication with a cloud service is needed. The sensor(s) embedded within the edge device can process the data as they are collected and then act upon this information without minimal impact to quality of service [9].

To understand the reliability and failure characteristics of IoT application deployments its useful to discuss the architecture. IoT can be thought of as an evolution of traditional sensor networks; however, there is an inherent and growing need for resources to e.g., process video. The traditional solution to this has been to use Cloud Computing resources. This has distinct advantages, such as access to almost unlimited cheap resources, but distinct disadvantages such as high-latency and a lack of control. If there is a failure of

communication to the cloud resources or a failure of the cloud computing resource itself, there is often no ability to recover from this. Edge computing offers a compromise, with the introduction of powerful resources at the edge of the network, often in full or partial control of the application developer. In an edge computing architecture, reliability of the IoT application can potentially be better than just using a cloud service as the system does not need to rely entirely on the cloud service for it to fully function. In particular, communications are much simpler for IoT node to edge node, than that of to the Cloud Computing resources which will have many hops.

However, edge computing reaps several benefits as previously mentioned. They still suffer from a common issue, namely, network instability, which plagues any system or application that requires a steady connection to the internet or some form of secondary device. A device network stability has a correlation to consumed power. The greater the amount of consumed power, the higher the chance of instability possibly due to heavy computation on the edge device for processing a large volume of data, which is why segmenting and de-identifying the sensor data for privacy reasons to then send to the gateway for pre-processing is a new challenge.

The need for fault tolerant IoT systems and application have been on the steady incline and also the need for fault tolerant within IoT has been made apparent. The need for fault tolerant IoT systems is due to the possibility of intermittent long distance network connectivity problems, malicious harming of edge devices, or harsh environments physically affecting the devices performance [27]. As IoT systems become increasingly large, the effectiveness of having software frameworks automatically manage the different components within an IoT application is needed [28].

#### *2.2. IoT Software Management*

Recently, software defined network (SDN) technologies have been considered as a dominant solution for managing IoT network architecture. Dang et al. propose incorporating SDN-based technologies with over the air (OTA) programming to design a systematic framework which allows for remotely reprogramming heterogeneous IoT devices. This framework was designed for dynamic adaptability and scalability within IoT applications [29].

Soft-WSN is a software defined wireless sensor network architecture in an effort to support application aware service provisioning for Internet of Things systems. Soft-WSN proposed architecture involves a software controller, which includes two management procedures, device management and network management. Device management allows users to control their devices within the network [30]. The network configurations is controlled by the network management policy, which can be modified in run time to deal with dynamic requirements of IoT applications.

UbiFlow, is a software defined IoT system for mobility management in urban heterogeneous networks. UbiFlow adopts multiple controllers to divide software defined networks into different partitions and achieve distributed control of IoT data flows. UbiFlow mostly pertains to scalability control, fault tolerance and load balancing. The UbiFlow controller differentiates scheduling based on per device requirements. Thus, it can present an overall network status view and optimize the selection of access points within the heterogeneous networks to satisfy IoT flow requests, and guaranteeing network performance for each IoT device [31].

#### *2.3. Fault Tolerance in IoT*

Providing fault tolerance support to Internet of Things systems is an open field, with many implementations utilising various technologies like artificial intelligence, reactive approaches and algorithmic approaches [32].

A plug-able micro services framework for fault tolerance in IoT devices can be used to separate the work flow for detecting faults. The first micro service utilises complex event processing for real time reactive fault tolerance detection, whereas the second micro service

uses cloud-based machine learning to detect fault patterns early and is a proactive strategy for fault tolerance. The reactive fault tolerance that uses complex event processing only initiates recovery protocols upon the detection of an error. This sort of strategy is only effective for systems that have a low latency connection to the faulty device. Whereas the proactive strategy that uses machine learning initiates recovery protocols before errors occur using predictive technologies. The main process behind a proactive strategy is to temporally disable or isolate IoT devices that will cause an error or harmfully impact the system before it occurs [33].

A mobile agent-based methodology can be utilised to build a fault-tolerant hierarchical IoT-cloud architecture that can survive the faults that occur at the edge level. In the proposed architecture, the cloud is distributed across four separate levels which are cloud, fog, mist and dew. The distribution is based on the processing power and distance from the edge IoT devices. This makes for a reliable system by redirecting the application onto an alternate server when faults occur within any level of the system [34].

Whereas, utilising a bio-inspired particle multi-swarm optimization routing algorithm to ensure connections between IoT devices remain in a stable condition is also a feasible methodology for fault tolerant IoT networks. The multi-swarm strategy determines the optimal directions for selecting the multipath route while exchanging messages from any positions within the network [35].

In deployment scenarios where wireless technologies are used, such as those in WSNs, virtualisation technologies for resource optimisation can be used to assist heterogeneous service-oriented IoT applications [21]. There are many different redundancy and fail-over strategies across the IoT stack and ecosystem. The paper in [22] provides a comprehensive survey, particularly in the IoT sensing, routing and control processes [22]. Another paper [23] presents a comprehensive roadmap for achieving resilient IoT and CPS-based systems, with techniques encompassing anomaly detection and self-healing techniques to combat various risks and threats posed by internal/external agents across the physical, network and control layers.

#### *2.4. Theories, Metrics and Measurements for System Reliability*

Today's technological landscape requires high system availability and reliability. System downtime, failures, or glitches can result in significant revenue loss and more critically, compromising the safety of systems. Hence, measurement metrics for system reliability are used by companies to detect, track and manage system failures/downtime. Some commonly used metrics are Mean Time Before/Between Failure (MTBF), Mean Time to Recovery/Repair (MTTR), Mean Time to Failure (MTTF) and Mean Time to Acknowledge (MTTA). These metrics allow the monitoring and management of incidents, including tracking how often a particular failure occurs and how quickly can the system recover from such failure. For a more detailed explanation and derivation of these metrics, we refer the readers to [36].

There are many approaches presented in the literature to minimise system failures and manage incidents more effectively. These approaches span multiple industry applications, with most techniques focusing on optimising the MTBF. However, there is currently limited work in applying these techniques to IoT and edge computing. For example, Engelhardt et al. [37] investigated the MTBF for repairable systems by considering the reciprocal of the intensity function and the mean waiting time until the next failure; Kimura et al. [38] looked at MTBF from an applied software reliability perspective by analysing software reliability growth models as described by non-homogenous Poisson process; in two separate works, Michlin et al. [39,40] performed sequential MTBF testing on two systems and compared their performance; Glynn et al. [41] proposed a technique for efficient estimation of MTBF in non-Markovian models of highly dependable systems; Zagrirnyak et al. [42] discussed the use of neuronets in reliability models of electric machines for forecasting the failure of the main structural units (also based on MTBF); Suresh et al. [43] unconventionally applied MTBF as a subjective video quality metric, which makes for an interesting evaluation of MTBF in other application areas other than literal system failures.

Reliability curve is also an important metric that is used widely across many applications and industries [44]. Variants of reliability curves have been recently applied to IoT, edge computing and Mobile Edge Computing (MEC) in the advent of innovations in 5G. For example, Rusdhi et al. [45] performed an extensive system reliability evaluation on several small-cell heterogeneous cellular networks topologies and considered useful redundancy region, MTTF, link importance measure and system/link uncertainties as metrics to manage failure incidents; Liu et al. [46] propose a MEC-based framework that incorporates the reliability aspects of MEC in addition to the latency and energy consumption (by formulating these requirements as a joint optimisation problem); Chen-Feng Liu et al. [47,48] proposed two (related by separate) MEC network designs that also considered the latency and reliability constraints of mission-critical applications (in addition to average queue lengths and queue delays, by using Lyapunov stochastic optimization and extreme value theory); Han et al. [49] proposed a context-aware decentralised authentication MEC architecture for authentication and authorisation to achieve an optimal balance between operational costs and reliability.

Link importance measure is another important aspect for measuring system reliability. There are several techniques that considered link importance measure in the context of IoT and edge computing. For example, Silva et al. [50] developed a suite of tools to measure and detect link failures in IoT networks by measuring the reliability, availability and criticality of the devices; Benson et al. [51] proposed a resilient SDN-based middleware for data exchange in IoT edge-cloud systems, which dynamically monitors IoT network data and periodically sends multi-cast time-critical alerts to ensure the availability of system resources; Qiu et al. [52] proposed a robust IoT framework based on Greedy Model with Small World (GMSW), that determines the importance of different network nodes and communication links and allows the system to quickly recover using "small world properties" in the event of system failures; Kwon et al. [53] presented a failure prediction model using a Support Vector Machine (SVM) for iterative feature selection in Industrial IoT (IIoT) environments which calculates the relevance between the large amount of data generated by IIoT sensors and predict when the system is more likely to experience downtime, Dinh et al. [54] explores the use of Network Function Virtualisation (NFV) for efficient resource placements to manage hardware and software failures when deploying service chains in IoT Fog-Cloud networks.

#### **3. Proposed Framework for IoT Failure Recovery**

The aim of this paper is to propose a failure recovery framework for IoT applications. This framework focuses on the problem of failures in IoT applications that are deployed on end nodes and edge nodes that utilise container deployment techniques. This is a relatively new potential problem, as it is only recently that end nodes and edge nodes have the resources to use containerization. The proposed framework assumes that failures occur in IoT deployments, due to corruption or configuration in the end node or edge gateway [55]. The framework will passively monitor communications from the IoT node and gateway to detect potential failures. On discovery of a potential failure, the framework will deploy known good applications in containers. The general aim of the framework is to minimise downtime due to application failure.

Figure 2 presents the overall architecture of the proposed framework which would be bolted on to an existing IoT edge-computing deployment. The deployment controller is a monitoring agent which can send action requests to any IoT gateway or device. An action request is one of two things, either a reconfiguration request to reconfigure one to many IoT gateways or devices, or a redeployment request in which the code-base for the IoT gateways or devices can be updated. The following describes how each device of the IoT-edge deployment interacts with the proposed framework.

**Figure 2.** Proposed framework for IoT failure recovery.

The IoT Device receives action requests from the gateway. If the action request is a redeployment request, the helper process is notified and proceeds to handle managing, rebuilding, restarting and deleting of containers and images for a seamlessly migration to the new version. The IoT Device can also request from the helper process to perform a network scan for any active gateways that are connected to the same network as the IoT device. The IoT device sends the collected data to the gateway.

The IoT Gateway receives action requests from the cloud server and will route the request to their corresponding target devices or ignore the request entirely if it is required to execute it. The IoT gateway will communicate with the helper process if the gateway is the target for the redeployment request. When the helper process is notified it will proceed to handle managing, rebuilding, restarting and deleting of containers and images for a seamlessly migration to the newer version. The IoT gateway receives the data stream and forwards it to the cloud server.

The Cloud Server receives action request from the deployment controller and will propagate the request to all connected gateways. The cloud server receives the data stream from the IoT gateway and forwards it to the deployment controller.

The Deployment Controller receives the response of an action request completion or failure from the cloud server which is sent from the IoT devices through the IoT gateway. The deployment controller initiates of action requests and receive the data stream from the cloud server and display them.

IoT-edge Deployments are described through *YAML* configuration files. Redeployment requests must have the *build.yml* file at the root of the update folder. The updated

folder is located at /update of the current working directory of the deployment controller. The structure of the build file is illustrated in Figure 3.

target: group: "\*" type: iot-gateway actions: - DELETE cache.json - OVERWRITE Dockerfile - OVERWRITE device-helper.js - OVERWRITE gateway.js


**Figure 3.** Example build.yml.

The build file is split into two configuration sections. The target section indicates to the IoT gateways whether this build package is for a gateway or IoT device(s). The first path within the target section, target.group is the identifier of the device to update. Specifying the \* key, like in the example, will address all devices of target.type to update. The second path of target within the target section, target.type specifies if the build package should be deployed on the IoT gateway or IoT device.

The second configuration section is the actions section which will tell either the IoT device or gateway what actions to perform to unpack the build. There are only 4 action commands that are usable within the build file actions configuration section. OVERWRITE (filename) will create a new file or overwrite the existing file, MKDIR (dir) will create a directory if it does not exist, DELETE (filename) will remove a file, finally REBUILD will rebuild the device's image. The REBUILD action will not run any subsequent actions and should always be at the bottom of the actions list.

The build.yml refers to a small number of files, which allow the building and rebuilding of the node software. The first of these, *cache.json* is a JSON file on the IoT gateways that store the time data was last received from an IoT device. This allows the IoT gateway to know when data was last sent even if the IoT gateway restarts. *Dockerfile* is the docker configuration file that describes how the image should be built.

*device-helper.js* is a JavaScript file that executes outside the docker container environment and is solely responsible for interacting and modifying the behaviour of the docker containers when requested to do so. The device-helper primary role is to delete and rebuild images to the new specifications, attempt to start the new image and if a failure occurs attempt failure recovery protocols. The device-helper script is always in continuous communication with the active running docker container.

*gateway.js* is the main JavaScript file that runs within the docker container and is responsible for keeping track of all the currently connected IoT devices and automatically removing any disconnected IoT device from the list. *gateway.js* has the critical responsibility of relaying data and requests to the required devices at any given point in time.

#### **4. Experiment Setup**

To evaluate the effectiveness and robustness of the proposed framework, a series of experiments have been performed with varying configurations. A testbed configuration was setup using common hardware platforms widely used for IoT nodes and gateways. The proposed architecture as described in Figure 2 using several Raspberry Pi small board computers. Raspberry Pis are commonly used in IoT testing and deployment due to their versatility. Their computational capabilities are typically suitable for most IoT deployment scenarios.

The experimental testbed setup consists of (i) a laptop as the deployment controller, (ii) a Raspberry Pi 4 as an IoT node and a (iii) Raspberry 3B+ as an edge node. The Raspberry Pi 4 Model B consists of a 4 GB LPDDR4-3200 SDRAM, Broadcom BCM2711,

Quad core Cortex-A72 (ARM v8) 64-bit SoC at 1.5 GHz. The Raspberry Pi 3 Model B+ consists of 1 GB LPDDR2 SDRAM Broadcom BCM2837B0, Cortex-A53 (ARMv8) 64-bit SoC at 1.4 GHz. The laptop consists of an Intel i7-8565U at 2.00 GHz, 16 GB RAM, Geforce GTX 1050 GPU. Figure 4 illustrates the different devices in the experimental setup.

**Figure 4.** Experiment setup with IoT devices/nodes (RPi 4B), IoT gateway (RPi 3B+), deployment controller (laptop).

There are multiple stages of the IoT redeployment process. The deployment controller will first read the *build.yml* file and send it to the IoT gateway. Upon sending the *build.yml* file, the deployment controller will open a stream pipe to the IoT gateway and start streaming a large data payload. While the segmented data are being received at the IoT gateway, the *build.yml* is stored in memory and the large payload for the image rebuild process is written to the Raspberry Pi's disk space. After the streaming process has completed, the IoT gateway will again read the *build.yml* file and verify if the redeployment request must be forwarded to an IoT devices or the gateway should apply the update to itself.

If the redeployment request must be forwarded to an IoT device, the IoT gateway will open a direct stream pipe to the IoT device and proceed to send the *build.yml* first and then stream the payload that previously saved to the IoT gateway's disk. When each bit of the segmented stream data is received at the IoT device, the IoT device will write the file to the disk on the Raspberry Pi. After the stream process is complete on the IoT device, the IoT device will automatically incorporate the large payload to as a part of the build process for the new image.

The IoT device build process starts with the helper process as shown in Figure 2. It will first terminate the active container, rebuild the new Docker image with the included payload, run the new Docker image and when successfully starts delete the previous image and record the elapsed time for that process of building a new image and then starting the new container and forward the results back to the IoT gateway.

#### **5. Experimental Evaluation**

This section presents an experimental evaluation of the proposed framework by investigating five distinct scenarios using the same testbed setup described in Section 4: (i) IoT device redeployment; (ii) IoT gateway redeployment; (iii) IoT device sensor fault detection; (iv) IoT device redeployment failure detection; and (v) IoT gateway redeployment failure detection.

#### *5.1. Scenario 1: IoT Device Redeployment*

In this scenario, the aim is to calculate the time required for a redeployment request to be sent from the deployment controller and for the IoT device to successfully fulfill the request and send a response back to the corresponding IoT gateway. The data collected from this experiment will be used as a base value to determine on average how long the deployment controller should wait with no response from the IoT devices. IoT devices that exceed this threshold will be considered to have possibly encountered an error and further diagnostics on that device may be required.

Table 1 contains the results of 3 consecutive redeployment requests.

**Table 1.** IoT device redeployment duration.


The *Test Number* refers to the number of that test and *Elapsed Time* is the overall duration for the following actions to occur.

The redeployment processes/steps across multiple IoT devices are described as follows (presented in Figure 5):


**Figure 5.** Scenario 1: Redeployment process across multiple IoT device(s).

Scenario 1 explores the average time in seconds for a redeployment request to be successfully fulfilled. The results from this experiment can be used as a benchmark to determine a time frame window for how long the deployment controller should wait for requests from the IoT device or devices. If the response time exceeds the time frame, the system will assume that the device was unable to migrate to the new version and run further diagnostics as to determine if the device is fully offline or requires another redeployment request.

Further exploration of Scenario 1 led to varying the size of the redeployment package by 50 MB chunks and simultaneously recording the time data as inactive in the system while the IoT devices migrate to the new build.

Figure 6 displays a linear trend that a larger build package sent across the network will increase the time it takes for the redeployment request to be completed. At a 50 MB redeployment package, it takes approximately 150 s for a successfully redeploy and a 100 MB redeployment package on average requires 250 s. The time difference between each 50 MB increase in the redeployment package size is about 100 s.

**Figure 6.** Redeployment Time vs. Redeployment Package Size.

Figure 7 displays a relatively linear trend where the larger the redeployment package, the greater time, no data are sent from the IoT devices to the IoT gateways. The time difference between each 100 MB redeployment package size is greater than 40 s, whereas the jump between 50 MB is relatively stagnant.

**Figure 7.** Data Inactivity vs. Redeployment Package Size.

Although Scenario 1 gives a benchmark for the average time it would take for a redeployment request to be successfully fulfilled, it does not take into account when a IoT device does successfully migrate to the new version but the time it took exceeded the response timeout window. In this scenario, the deployment controller assumes something is wrong with the IoT device and will resend the redeployment request which in effect will cause the already updated IoT device to forcefully update again. Scenario 1 results do not take into account build packages that are large or very small, which can greatly vary the average time required for the completion of a redeployment request.

#### *5.2. Scenario 2: IoT Gateway Redeployment*

Scenario 2 explores the average time in seconds for a redeployment request to be successfully fulfilled. The results from this experiment can be used as a benchmark to determine a time frame window for how long the deployment controller should wait for requests from the IoT gateway(s). If the response time exceeds the time frame, the system will assume that the gateway was unable to migrate to the new version and run further diagnostics as to determine if the gateway is fully offline or requires another redeployment request.

**Table 2.** IoT gateway redeployment times.


Table 2 contains the results of 3 consecutive gateway redeployment requests. The *Test Number* refers to the number of that test and *Elapsed Time* is the overall duration for the following actions to occur.

The redeployment processes/steps across multiple IoT gateways are described as follows (presented in Figure 8):


**Figure 8.** Scenario 2: Redeployment process across multiple IoT gateway(s).

completes redeployment request

Scenario 2 explores the average time in seconds for a redeployment request to be successfully fulfilled. The results from this experiment can be used as a benchmark to determine a time frame window for how long the deployment controller should wait for requests from the IoT gateway or gateways. If the response time exceeds the time frame, the system will assume that the gateway was unable to migrate to the new version and run further diagnostics as to determine if the gateway is fully offline or requires another redeployment request.

Further exploration of Scenario 2 lead to varying the size of the redeployment package by 50 MB chunks and simultaneously recording the time data as inactive in the system while the IoT devices migrate to the new build.

Figure 9 displays a very linear trend that a larger build package sent across the network will increase the time it takes for the redeployment request to be fulfilled. At a 50 MB redeployment package, it takes approximately 110 s for a successful redeploy and a 100 MB redeployment package on average requires 185 s. The time difference between each 50 MB increase in the redeployment package size is approximately 80 to 90 s.

**Figure 9.** Redeployment Time vs. Redeployment Package Size.

Figure 10 displays a linear trend where a larger build package that is sent across the network increases the time no data will be received from the IoT devices. The time difference between each 100 MB redeployment package size is not consistent by a certain amount, but starts to sharply increase when the redeployment package is above 200 MB.

**Figure 10.** Data Inactivity vs. Redeployment Package Size.

Although Scenario 2 gives a benchmark for the average time it would take for a redeployment request to be successfully fulfilled, it does not take into account when a IoT gateway does successfully migrate to the new version and the time it took exceeded the response timeout window. In this scenario the admin panel assumes something is wrong with the IoT gateway and will resend the redeployment request which in effect will cause the already updated IoT gateway to forcefully update again. Scenario 2 results do not take into account build packages that are large or very small which can greatly vary the average time required for the completion of a redeployment request.

#### *5.3. Scenario 3: IoT Device Sensor Fault Detection*

Scenario 3 aims to test how long it takes a gateway to detect that an IoT device has stopped sending sensor data and how long the IoT device spends to recover and reinitialise data sending to the IoT gateway. For Scenario 3, each message sent from the IoT device had a 60 percent chance of failing and each request sent from the IoT gateway to the IoT device to recover had a 40 percent chance of success.

Scenario 3 explores the average time in seconds for a IoT device to recover from a sensor fault. The results from this experiment can be used as a benchmark to determine a time frame window for how long on average an IoT device will take to recover one of its critical functions.

Exploration of Scenario 3 lead to varying the frequency at which data are sent from 1 to 10 s and recording the elapsed time for the IoT gateway to detect an error and request for the IoT device to self-heal. The IoT gateway was configured to check for errors every 5 s.

Figure 11 is a demonstration of the time increasing for the IoT device to recover from sensor failure when the time between messages is increased with 10 s having the largest spread of varying recovery times. The 5 s interval for Figure 11 contains the smallest spread due to the fact that messages are being sent every 5 s to the IoT gateway and the IoT gateway itself was configured to check for problems with the IoT device every 5 s. There is a clear distinction that increasing the frequency at which data are sent allows for a much quicker recovery time with a smaller spread.

**Figure 11.** Recovery time vs. Data Frequency.

Although, Scenario 3 explores the average time in seconds for a IoT device to recover from a sensor fault. It is not a true measurement of what would happen in the real world as no physical sensors were attached and forced to fail and then recover. The sensor failures in this experiment are purely software-based and there is a fixed chance for sensor failure and recovery. Due to the software constraints imposed to test sensor recovery the results can only be regarded as a theoretical time constraint.

#### *5.4. Scenario 4: IoT Device Redeployment Failure Detection*

Scenario 4 explores the average time in seconds for a redeployment request to successfully recover after encountering failure on the IoT devices. The results from this experiment can be used as a guide to determine on average number of attempts until the redeployment is successful and as well as the elapsed time from failure detection to recovery.

Exploration of Scenario 4 led to varying the chance of and triggering an error during a redeployment request and measuring the elapsed time and attempts that is required for the system to recover and update the IoT device.

Redeployment time is the overall duration for the following actions to occur: deployment controller sends redeployment request to the IoT gateway. IoT gateways verifies the IoT devices to forward the request to. IoT device receives and complete the redeployment request. IoT device sends success of failure message back to the IoT gateway. IoT gateway checks if the redeployment request was successful and if it has failed it will request for the IoT device to perform the redeployment request again. Upon success of the redeployment request the IoT gateway will pass the message back to the deployment controller and the timer is stopped and time taken is displayed.

Figure 12 shows that the redeploy time is not greatly affected when the chance of a build error occurring is lower than 30 percent. The redeployment time begins to have a major increase from 60 percent chance of build error.

**Figure 12.** Redeployment Time vs. Build Error Chance.

Scenario 4 does not take into account when the IoT device attempting to update is stuck in an infinite loop due to attempting to start the program that has errors. Currently, the experiment will try as long as it needs to until the IoT device has updated which can result in an extremely high number of attempts.

#### *5.5. Scenario 5: IoT Gateway Redeployment Failure Detection*

Scenario 5 explores the average time in seconds for a redeployment request to successfully recover after encountering failure on the IoT gateways. The results from this experiment can be used as a guide to determine on average number of attempts until the redeployment is successful and as well as the elapsed time from failure detection to recovery.

Exploration of Scenario 5 lead to varying the chance of an triggering an error during a redeployment request and measuring the elapsed time and attempts that is required for the system to recover and update the IoT device.

Redeployment time is the overall duration for the following actions to occur: deployment controller sends redeployment request to the IoT gateway. IoT gateways verifies if it

is required to execute the redeployment request. IoT gateway completes the redeployment request. IoT gateway helper process (See Figure 2) sends success of failure message back to the IoT gateway. Iot gateway checks if the redeployment request was successful and if it has failed it will request for the IoT gateway helper process to perform the redeployment request again. Upon success of the redeployment request the IoT gateway will pass the message back to the deployment controller and the timer is stopped and time taken is displayed.

Figure 13 shows that the redeploy time is not greatly affected when the chance of a build error occurring is lower than 30 percent. The redeployment time begins to have a major increase from 40 percent chance of build error and does not slow down. Results indicate that minimising the build error as much as possible for the gateway will ensure quality of service for end users will not be tarnished.

**Figure 13.** Redeployment Time vs. Build Error Chance.

Scenario 5 does not take into account when an IoT gateway attempting to update is stuck in an infinite loop due to attempting to start the program that has errors. Currently the experiment will try as long as it needs to until the IoT gateway has updated, which can result in an extremely high number of attempts.

#### **6. Conclusions and Future Work**

The focus of this paper is to add a layer of failure recovery to deployed container-based IoT edge applications. A framework was proposed the monitors deployed IoT applications and detects if either the IoT end node or IoT gateway has potentially failed. Once potential failures have been detected, the deployment controller will rebuild the application and deploy to containers in the effected node. The aim is to minimise downtime due to potential failures. This paper evaluated an implementation of this approach through a series of experiments testing different configurations for viability.

This paper argues that low latency IoT systems have a significantly higher fault detection and recovery time due to the system being able to verify more frequently if a device has yet to send a message for a specified number of seconds. The paper argues that decreasing the chance of build errors is critical in ensuring a that the quality of service remains consistent even when the devices require new redeployed software. It has demonstrated the design and implementation of a framework that automatically detects faults and attempts to automatically recover and as well as contains functionality to automatically reconfigure and redeploy software to all or targeted IoT devices or gateways. The proposed framework can be used to evaluate the occurrence of errors with an multi tiered system and the average theoretical recovery time for when an IoT device or gateway is down due to faults.

The work in this paper can be expanded on by implementing and refactoring the framework to be suitable to run on low powered devices. This paper had a primary focus on varying many factors to see the significant changes and impacts that could be seen if implemented on a real physical system. However, the paper uses higher tier technologies such as Docker, Raspberry Pis and real-time socket streams with the intended purpose to simulate the behaviour of low powered IoT devices and IoT gateways and the failures that can occur. The paper still presents a general overview of how faults can affect any given multi-tiered IoT application. The proposed framework for automatic failure recovery and the simulation of a multi tiered IoT system can be improved by rate-limiting in the network connection and processing power to closely mimic the behaviour of a lower powered device and what behaviours can occur when faults arise within these systems.

**Author Contributions:** Conceptualization, K.O., K.L. and J.K.; methodology, K.O., K.L. and J.K.; software, K.O.; validation, K.O., K.L. and J.K.; investigation, K.O.; resources, K.L.; data curation, K.O.; writing—original draft preparation, K.O.; writing—review and editing, K.O., K.L. and J.K.; visualization, K.O.; supervision, K.L. and J.K.; project administration, K.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Linked-Object Dynamic Offloading (LODO) for the Cooperation of Data and Tasks on Edge Computing Environment**

**Svetlana Kim <sup>1</sup> , Jieun Kang <sup>2</sup> and YongIk Yoon 2, \***


**Abstract:** With the evolution of the Internet of Things (IoT), edge computing technology is using to process data rapidly increasing from various IoT devices efficiently. Edge computing offloading reduces data processing time and bandwidth usage by processing data in real-time on the device where the data is generating or on a nearby server. Previous studies have proposed offloading between IoT devices through local-edge collaboration from resource-constrained edge servers. However, they did not consider nearby edge servers in the same layer with computing resources. Consequently, quality of service (QoS) degrade due to restricted resources of edge computing and higher execution latency due to congestion. To handle offloaded tasks in a rapidly changing dynamic environment, finding an optimal target server is still challenging. Therefore, a new cooperative offloading method to control edge computing resources is needed to allocate limited resources between distributed edges efficiently. This paper suggests the LODO (linked-object dynamic offloading) algorithm that provides an ideal balance between edges by considering the ready state or running state. LODO algorithm carries out tasks in the list in the order of high correlation between data and tasks through linked objects. Furthermore, dynamic offloading considers the running status of all cooperative terminals and decides to schedule task distribution. That can decrease the average delayed time and average power consumption of terminals. In addition, the resource shortage problem can settle by reducing task processing using its distributions.

**Keywords:** edge computing; offloading computation; distributed collaboration; data processing; dynamic offloading; IoT; gateways

#### **1. Introduction**

Nowadays, with the rapid evolution of technology and a vast number of Internet of things (IoT) devices, including individual units, have improved processing ability (robust computing environment) with various embedded sensors. Due to this progress, the IoT devices can perform multiple functions (such as receiving/refining data from sensors in real-time, transferring, processing, and storing) independently and without computing resources in the server (external cloud) [1,2]. Edge computing has been proposed to process existing integrated service platforms by moving computing tasks from cloud servers to IoT devices. Edge computing is the workload of devices and offloads computational tasks to nearby computing devices, significantly reducing processing latency [3]. Existing high-latency and low-reliability cloud computing solutions are challenging to support timecritical cloud/IoT services requirements in transmission data and task processing delay. Edge computing is a centralized architecture where all nearby service requests are directed to a 'central' edge server. However, since the computing power of edge servers is not powerful as cloud-based servers, some issues such as resource limitations and computing latency between multiple competing tasks still occur [4,5].

**Citation:** Kim, S.; Kang, J.; Yoon, Y. Linked-Object Dynamic Offloading (LODO) for the Cooperation of Data and Tasks on Edge Computing Environment. *Electronics* **2021**, *10*, 2156. https://doi.org/10.3390/ electronics10172156

Academic Editors: Ka Lok Man and Kevin Lee

Received: 30 June 2021 Accepted: 30 August 2021 Published: 3 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Various approaches such as Mobile Cloud Computing (MCC) and Multiple Access Edge Computing/Mobile Edge Computing (MEC) support complementary cloud computing solutions using services adjacent to the edge network to cope with this disturbing problem [6–8]. MCC offloading, which uses unlimited resources in a cloud for some tasks, is the method to reduce loads of mobile devices. However, MCC has considerable disadvantages, such as low expandability, long propagation distance between the cloud server and devices, excessive consumption of limited bandwidth, personal protection and security problems [9]. Current MCC offloading methods cannot guarantee real-time data transfer and task delay, making them unsuitable for latency-sensitive applications. On the other hand, MEC is the offload concept of tasks collaboratively by leveraging both the MEC server and the end device (such as a mobile phone). Due to the limited battery life of the mobile device and the growing number of latency-sensitive applications, offloading tasks come with additional overhead in terms of latency and power consumption [10,11]. This problem was partially solved by dividing each task into local task and offload task. Local tasks are processing on the end device, and offload tasks are performing on the MEC server. However, computing resources can increase in some areas, which can exacerbate network problems and even cause issues that affect task execution times.

In recent years, much research has proposed addressing the resource limitations and computing latency issues by scheduling strategies of collaboration task offloading in an edge computing environment. The computing resource on edge is mainly composed of intelligent devices (note: intelligent devices (e.g., smart sensors and smartphones, can access a network, resulting in a considerable amount of network data) are called edge devices/end devices) and edge servers. For example, in [12–14], there is a job scheduling algorithm that utilizes the resources of cloud servers to handle highly overload situations. In [15], leverage resources in edge servers by offloading all computing-intensive tasks of the edge device to the edge server. In this case, computing resources and storage of devices are not utilized properly and wasted. In addition, multiple computation tasks or many devices may access the edge server at the same time. As a result, the workload will increase, long queues, and processing delays of tasks accordingly. In limited computational resources, a multi-MEC system has been proposing by joint communication offloading methods at the ends and edges [16–19]. However, due to the dynamic environment of the computing system, it is not easy to achieve task offload performance. In addition, the resource-rich IoT devices do not collaborate and are fully underutilized, resulting in a waste of their computing resources.

Since task offloading plays a critical role in edge-based service, edge must take full advantage of IoT's communication and computational resources. To this end, the edge server needs a balance task scheduler that, according to the characteristics of the computing task and device status, decides which tasks to offload to which IoT devices. Our previous paper [20] suggested distributed collaboration for computing offloading architectures. The architecture consists of a balanced collaboration system by assigning the master node, the second node, and the action node individually to edge nodes which contribute to all connected and communicable areas. It balanced computing resources through edge-toedge collaboration and minimized latency due to relatively unbalanced overloads. Based on the [20] architecture, this paper proposes the LODO (Linked-object dynamic offloading) algorithm. LODO focuses on reasonable use of the computation resources of the IoT devices to processing computing tasks efficiently. The IoT devices are called edge nodes in the LODO algorithm. The LODO algorithm receives the compute tasks offloaded by the edge nodes and provides the hybrid state with offloading the tasks according to the resources and state of all edge nodes. The offloading location of computing tasks is uncertain and has various types; so are the compute resources status and performance gaps between nodes. Considers both characteristics, the computing resources of edge nodes can be fully balancing utilized. The main feature here is that the computing task can be reasonably offloaded to different edge nodes by collaborative processing.

The main contributions of this article are summarized as follows:


The rest of this paper is organized as follows. Section 2 explains DOM (Dynamic Offloading Method) by hybrid states is formulating. Section 3, introduces details of the LODO properties approach for collaboration offloading. Scenario and results are presented in Section 4. Section 5 presents the discussion. Finally, Section 6 concludes this.

#### **2. Dynamic Offloading Method (DOM) with Hybrid States**

Based on our previous paper [20], this section identifies the reason for loads according to data-linked and task-linked, which are the major processes of an edge node. On edge node have edge gateways, end devices, and schedulers. Edge gateways can provide computing (*CPU*), storage (memory), bandwidth, and other system resources for edge computing operations. It is essential to reasonably use these computing resources depending on the offloading configuration to solve computation offloading efficiently. For example, suppose the goal is to reduce the load for data-linked processing. In this case, memory utilization is more important than *CPU* ratio, so utilizing the edge with more storage resources is necessary. On the other hand, task-linked processing increases *CPU* usage; it should use an edge with free computing resources. Using edge resources according to processing status can increase edge resource utilization and reduce the average execution time of computational tasks.

The resource range that can process in the edge node is definition 80% of *CPU* and 90% of memory. We define this as the reference overload threshold. If one exceeds the existing threshold range using a monitoring process, it determines the overload of data-linked and task-linked. The ratio is classified as a task-linked overload with high *CPU* usage and categorized as a data-linked overload with high memory usage (ref. Figure 1).

Depending on the hybrid state of the edge node, the dynamic offloading method (DOM) defines an overload range that considers the correlation/relationship between data and tasks. The following section describes whole activity states (hybrid states) through discussion and formulation about overload issues according to data and task of an edge node using DOM.

**Figure 1.** The resource range for data-linked and task-linked.

#### *2.1. Expression of Hybrid States on Edge Node*

As shown in Figure 2, edge computing is a process of collaboration offloading by monitoring the state of all edge nodes that change dynamically. Edge node consists of end devices that receive, store, and preprocesses the sensing data, and edge gateways analyze and outputs the results following the purpose of the domain. Since edge nodes have different resources and characteristics, real-time monitoring is a requirement in an everchanging environment. In addition, there are different data and performance characteristics during the analysis depending on the purpose of the domain, so the offload characteristics may also differ. Representative offloading characteristics can classify into "offloading due to data," "offloading due to task," or "offloading due to Data & Task," and the offloading range need determine according to each characteristic. According to the current offloading execution state, the edge node must also provide suitable edge nodes. can classify into "offloading due to data," "offloading due to task," or "offloading due to Data & Task," and

**Figure 2.** Monitoring process with hybrid states on edge node.

The hybrid state to reasonably use the computing resource, and computing task is divide into three statuses, the "Activation status", "Dynamic offloading status," and "Slack space." The activation status means accurate assessment and incorporating the edge node state consumption due to the offloaded tasks. Receiving data and computing processes are the functions that derive an offloading task method to the due to data or/and task according to the problem computing resource. Offloading data method is when the edge node uses a memory value is more than 90%, and the task offload method is when the *CPU* value is more than 80%.

The dynamic offloading between edges nodes is one of the essential factors in determining task range based on offloading task method. According to the task-method attributes, the transmitting offloading state extract the range of data-linked or task-linked offloading. Transmitting offloading state extracts the range of data-linked or task-linked offloading according to the offloading method. The Receiving offloading state selects the offloading range that can tolerate in the slack space of the cooperative edge node, excluding 10% memory and 20% *CPU*. Here the slack space means a value obtained by subtracting activation status and dynamic offloading status from the hybrid state. This value can describe the state of the edge node like a hybrid state, should have more than 10% memory or 20% *CPU* basically to prevent overload.

The following Equation (1) computes the hybrid state of the edge node's dynamic active and slack space, depending on the data and task computation function. Equation (1) indicates the total performance state of an edge node. Based on data and task, the hybrid state is the sum of Equation (1a) the current performance (activation) state, Equation (1b) the offloading state, and the slack space.

$$\begin{array}{c} \text{Hydbrid state} \{ \text{CPII, Memory} \} \\ = \text{Activation} \{ \text{C}\_{i\prime}, M\_{i} \} + \text{Dynamic} \, off \, \text{loading} \, \text{Status} \{ \text{C}\_{i\prime}, M\_{i} \} \\ + \text{Slack } \, Spac \{ \text{C}\_{i\prime}, M\_{i} \} . \end{array} \tag{1}$$

Equation (1a) shows the current activation state, where *C<sup>i</sup>* is *CPU* and *M<sup>i</sup>* is Memory. This definition is a sum of the data received by the process of edge node and the computing process of data and tasks.

*Activation status*{*CPU*, *Memory*} = *Receiving Data Process*{*C<sup>i</sup>* , *Mi*} + *Data*&*Task Computation Process*{*C<sup>i</sup>* , *Mi*} (1a)

Equation (1b) shows that the dynamic offloading state is defined as the sum of the offloading state by overloads in the edge node and the collaboration offloading state for other nodes.

*Dynamic o f f loading Status*{*CPU*, *Memory*} = *Transmission o f f loading Process*{*C<sup>i</sup>* , *Mi*}+*Receiving o f f loading Process*{*C<sup>i</sup>* , *Mi*}, (1b)

#### *2.2. Definition of Assigning an Offloading Range*

To assign the offloading range, the group should divide as being data-linked or tasklinked. First, the offloading range based on memory is grouped in data connectivity and minimizes its overlapped offloading. In this computation, let D = {*d*1, *d*2, . . . *dn*}, where D is the set of edge gateway (device), and *d<sup>i</sup>* is the *i-*th edge gateway (device). Equation (2) means the energy *E* generated by moving data from edge node *d*1 to *d*2. The *S*(*d*1, *d*2) is the total amount of data that must transmit between nodes; hence, the group's energy consumption based on data connectivity is proportional to the amount of data. Therefore, the minimum range of transmitted energy is extracting by the amount of data after selecting collaboration edge nodes.

$$E(d1, d2) = ((w \times S(d1, d2)) \div bandwidth) \times Power\_{wifi} \tag{2}$$

Second, minimizing the optimum edge node's performance time includes offloading by grouping task connectivity based on *CPU* for the offloading range. Because of various *CPU* capacities according to the slack space, each edge node has different processing speeds. Therefore, it is essential to assign the offloading range because the total processing time is dependent on how to use the *CPU*. Equation (3) shows the difference of processing time between the existing node and the collaborating node of the group based on task connectivity for offloading. The *to*(*i*) is the consumed time in an overloaded edge node, tc(*i*) is the consumed time in a collaborated node. The w is a weighted factor that depends on the slack space of the collaboration node. As long as the slack space increases, the weighted

factor (*w*) also increases, so its consumed time becomes rapidly faster than the overload edge. Therefore, the range with a maximum weighted factor must extract as the offloading range by selecting the collaboration edge node.

$$\mathbf{t}\_{\mathbf{c}(i)} = \mathbf{t}\_{\mathbf{o}(i)} \div w,\tag{3}$$

Therefore, the defined Equation needs to compromise (*CPU*, Memory) in the range of data and task. It aims to extract the range of minimizing consumption energy and final time for offloading.

#### **3. LODO (Linked-Object Dynamic Offloading) Algorithm**

The LODO algorithm provides two cooperation offloading options depending on the cause of the overload. In Section 2.1, the cause of overload according to the state of an overloaded edge node through monitoring was definition. When a memory overload problem occurs, executing the algorithm using the data-linked offload method described in Section 3.1. If the cause of the overload is in the *CPU*, execute the task-linked offloading option described in Section 3.2.

#### *3.1. Data-Linked Algorithm*

The Data-linked offloading (ref. Algorithm 1), which concerns data correlations, improves the energy efficiency by choosing the data redundancy based on memory, the range in minimizing energy during transmitting, and the collaboration node. Create a collaboration data group based on the currently executed data. Figure 3 explains how to create a group when performing offloading based on a scenario that allows Task1 to be performing at the existing edge. Derivation of the task-based offloading range is the least frequent of the remaining data, excluding the data used in Task1. It is possible to set the task priority according to the existing domain and create a list expression task offloading according to the criteria for the task and the data that is reducing due to offloading. For example, offloading the least frequent Task8 reduces the number of data12 on existing edge nodes. Offloading with Task7 to free up additional memory space will free up the amount of data8. A list of expression tasks is an input to the algorithm's data dependency groups. The output categorizes the minor offloading group, the collaboration node, and the total consumed memory.


**Figure 3.** List expression task based on data-centric offloading.

$$\text{Amount of memory in } List\_j = \sum\_{i=1}^{List[2]} \mathcal{g}\_{ki} \* DT\_{ki} \* DS\_{ki\prime} \tag{4}$$

$$\text{Amount of } \mathbb{CP}lI \text{ in } List\_j = \sum\_{i=1}^{\text{list}[2]} g\_{ki} \* DT\_{ki} \* DS\_{ki} \* CT\_{ki} \tag{5}$$

≤ −

"

"

"

 

"

Each group's memory and *CPU* usage can calculate via (Equation (4)) and (Equation (5)). *gki* is whether the *i-*th data used in the *k-*th task is used, and *DTki* is the data type. *DSki* is the amount of sensing data that is instantaneously accumulated as a data size. *CTki* is a 'number of *CPU* cycles for 1-bit data' value and is the *CPU* value used in each task. The calculated Equations (4) and (5) values changes depending on the task and data included in the corresponding list. Through the calculated value, the "possible offloading list" is extracting by matching the *CPU* usage rate that is not larger than the idle space of the collaboration edge node. In other words, depending on the cause of the overload, if the memory is insufficient, the data relevance list that can secure the maximum memory is selected in the offload range. The selected data relevance list becomes the "data dependency group," which is the input to the algorithm. The list of feasible groups is sort by examining their executions in collaboration nodes, and the offloading is processed in each node sequentially. If slack energy in the collaboration node is not enough, it could skip out to the next node for offloading. The Data-centric Offloading stops in the case of solving the overload issue. Otherwise, it keeps progressing to offloading.

**Algorithm 1** Data-linked offloading

#### **Input: Next Execution, Data Dependency Group Output: Offloading Group, Collaboration Edge Node, TotalMemory**


#### *3.2. Task-Linked Algorithm*

The task-linked algorithm, which concerns task correlations, reduces the overall time by selecting the most considerable *CPU*, the range in increasing energy efficiency during processing, and the collaboration node (ref. Algorithm 2). Create collaboration groups based on tasks to performing and data to be using. Figure 4 shows a subordinate system consisting of 8 tasks. In this way, the application is complexly configuring. Some tasks use the output value of each task as an input value. A dependency system is a workflow made up of dependent tasks that interact between modules. Task2 creates data9, which is the input value of Task3, Task6, Task7, and Task8. Based on task connectivity, Task3, Task6, Task7, and Task8 cannot start execution until Task2 is executing and the output is displayed.

≤

≤

**Figure 4.** Workflow and Task Connectivity.

When performing tasks, an offloading group is creating in consideration of task relevance. Groups that consider task relevance are composed of one or more tasks, as shown in Figure 5.

**Figure 5.** Offloading group of Task relevance.

To select an offloading range is necessary to calculate the mobility of data generated during offloading. Figure 6 illustrates the importance of task connectivity group-based offloading through task mobility. In Case 1, Task3 and Task4 are offloading, and in Case2, Task3 and Task6 are offloading. Task3 and Task4 use the result of Task2 together, and the output of Task3 becomes the input of Task4 and proceeds sequentially. Also, since Task4's output is used by Task5 and Task6 simultaneously, the total number of movements is two times. However, the input value of Task3 and Task6 is common to Task2, but Task6 requires the output value of Task4 as an input value. Each output value is also using as an input value for other tasks. Thus, the total number of moves is four. Even if the same amount of data is using, the delay time increases as the total number of movements increases. Hence, the available memory and *CPU* are calculating through Equations (4) and (5), appropriate lists are extracting, and the group with the least mobility is select as the offloading range.

**Figure 6.** Task connectivity group-based offloading through task mobility.

" " A grouped list of task dependencies is the input to the algorithm. The task connectivity list described above becomes the "task dependency group," which is the input to the algorithm. The output classifies the least offloading group, the collaboration node, and the overall time. The list of feasible groups is sorting by examining their executions in collaboration nodes, and the offloading is processed in each node sequentially. Offloading continues until the overload problem is resolving, and if slack energy in the collaboration node is not enough, it could skip out to the next node for offloading. Total output time means the time until the end of offloading for a specific application and is calculated from the number of moves and the amount of data.

'

**Algorithm 2** Task-Linked offloading


#### **4. Scenario and Results**

This section provides an experimental scenario and the LODO algorithm's performance that proposes in data-linked and task-linked algorithms. The performance of the

algorithm is tested based on the variables and tasks used in the forest fires scenario and the relationship between them. Based on the algorithm experiment results, the evaluation scenario includes memory usage and execution time. Table 1 lists tasks and variables used in forest fire response scenarios. In practice, the quantity and size of a variable can be the same or different.


**Table 1.** Tasks for forest fire response and data types used.

Tasks are prioritized based on what is essential or should do first. Task 1 is the most basic analysis that starts in the forest fire scenario, so it should be performing on the edge node. In addition, the priority is high because it is performing in real-time. Before the data processing process and algorithm apply, the offloading range list becomes a subset of 127 for all tasks except for Task 1. As the number of tasks and variables increases, the selectable range of offloading becomes more diverse.

In case of memory overload, a list extracted based on data association is using. Table 2 is a data correlation list that derives an offloading range that can quickly acquire memory based on data.


**Table 2.** Data-linked list.

In case of *CPU* overload, a list extracted based on task association is using. As with the data center, except for Task 1, the connectivity between the remaining tasks creates a group that reduces mobility between edges and secure *CPU* space. Table 3 is a task connectivity list that derives an offloading range that reduces mobility while quickly obtaining *CPU* space centered on tasks. Figure 7 shows the total number of moves and the task connectivity list (red dots). In the figure, the ox is each group corresponding to the task connectivity list, and oy is the number of moves that occur when offloading each group.

**Table 3.** Task-linked list.


Table 4 compares the range to which offloading is applying within the range of possible collaboration usage (Memory, *CPU* 70%) and the entire range based on the lists in Tables 2 and 3. "Offloading Transmission" is the memory size calculated based on the data included in the offloading group. "Offloading performance" is a value calculated based on the number of *CPU* cycles for 1-bit data, and the value varies depending on the data used in the task and the function of the task. With "Data-linked algorithm groups," the average data movement is 1,574,156 (Kb), which offloads more data than before applying the algorithm, but the average memory securing is 8.15%, which is faster than before applying the algorithm. Suppose the "task-oriented algorithm group" is used. In that case, the offloading task group sends a smaller amount than before applying the algorithm at an average of 4.19 (GHz). It is possible to secure free space faster, and the average number of moves is an average of 2.63 Cycles enable fast processing.

**Figure 7.** Graph the number of movements for the entire offload range.

**Table 4.** Data & Task algorithm performance evaluation.


#### **5. Discussion**

Spatial dependence has two meanings. The first is that the data generated by one sensor is not used in only one task or app and is can use in multiple spaces. For example, diffusion direction data is generating by combining wind, slope, and temperature data with circumstances such as earthquakes affecting different regions at time intervals. The second is task dependencies. Task dependencies include the order of tasks as well as data correlation between tasks. In refs. [21,22], the data dependency, the output of the task located in Area A, can be entered in Area B.

They should process several tasks, such as knowing the types of forest fires, predicting how the wind changes, and identifying how ignition materials are distributing in fire areas by season, temperature, humidity. Additionally, some cases have data with two or more tasks used simultaneously or including complicatedly mixed order of tasks. For example, to determine the diffusion range with the location of forest fires. However, as the accurate diffusion range and location could identifying many tasks in various areas could be linked and applied to each other. In other words, multiple tasks do not follow in a series of workflows but are spread in many branches or entangled like spider webs.

However, most offloading does consider only a series of processing in [23] and does not include correlations and interactions of data and tasks. One or more outputs can be applying to the next task after starting four different tasks simultaneously, or new input that changes the result executed already. Because output data transferred from a server to a local device is much smaller than input data, the time overhead of a backhaul link could be ignorant [24,25]. This is considered only a series of dynamic applications, not static tasks. If one output data can be the input of many tasks, the increase of task number cause to increase both redundant data and output data; hence the overhead due to another delay and energy consumption from a transmission cannot be disregarded.

Correlations of data and tasks have to consider for dividing tasks of granularity and dependency in various circumstances. Therefore, we proposed a collaboration LODO algorithm that determines idle space and offloads data and tasks based on spatial dependencies.

#### **6. Conclusions**

This paper considers collaboration edge computing in offloading with an edge node that can collaborate with tasks. We proposed an energy-efficient LODO algorithm to extract the scope and offload of collaboration nodes to save energy and reduce execution time at the edge nodes. Formulated hybrid states in an edge node could predict overloads through monitoring and applied in the LODO algorithm. Furthermore, in selecting an offload range by considering data correlations and task connectivity, the LODO algorithm reduces data redundancy and delays and minimizes energy consumptions during offloading. Therefore, a collaboration offloading model based on the LODO algorithm minimizes the energy of the entire edge node so that it is more efficient to execute within a short time.

**Author Contributions:** Conceptualization, S.K., J.K. and Y.Y.; methodology, S.K.; formal analysis, S.K. and J.K.; investigation, J.K.; resources, S.K. and J.K.; data curation, J.K.; writing—original draft preparation, S.K. and J.K.; writing—review and editing, S.K. visualization, J.K.; supervision, Y.Y.; project administration, S.K.; funding acquisition, Y.Y. and S.K. This J.K. author contributed equally to this study as co-first authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1I1A1A01064054). This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1B07047112).

**Data Availability Statement:** Data is describing within the article. The data that support the findings of this study are available from the corresponding author upon reasonable request.

**Acknowledgments:** The writers would like to thank the editor and anonymous reviewers for their helpful comments for improving the quality of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

