1. Introduction
The emerging popularity of machine learning (ML) solutions in recent years has led to a sharp increase in the number of industry deployments. Many ML pipelines that could previously be integrated only into environments with vast computational resources, most often cloud-based, have been further expanded to include edge devices or even the Internet of Things (IoT) [
1], i.e., the Cloud–Edge–IoT continuum. In the Cloud–Edge–IoT continuum, processing and storage tasks are performed on all levels of the network hierarchy and not just in the cloud [
2]. Such deployments often face similar challenges, many of them purely related to the infrastructure: its energy consumption, scalability, and the lack of user-friendly tools [
3]. Therefore, many require similar solutions, which for ML often come in the form of general purpose ML inference servers.
Here, an ML inference server is understood as an application that, upon receiving a request containing data intended as input for inference, uses an ML model to obtain predictions, which it returns in the form of a response. ML inference servers can function as standalone deployments without additional software infrastructure, thus making them more suitable for edge environments. As a consequence, numerous approaches to ML inference servers have been introduced by the industry.
In contrast, recent academic works tend to focus on developing elaborate infrastructures dedicated to model serving in the form of machine learning inference
systems [
4,
5,
6]. These systems are not limited to a single application but instead leverage existing containerization and orchestration tools to support and manage the deployment of complex architectures. They are often integrated with well-known platforms such as Docker [
7] and Kubernetes [
8]. As a result, it is common for ML inference pipelines inside these systems to be implemented as chains or graphs of networked services, in which each of the ML models, as well as preprocessing and preprocessing steps, is encapsulated as a separate container. In the scope of this work, we assume an ML inference pipeline to be a series of complex data processing steps, which typically have to be deployed alongside the ML model [
9] and with a particular focus on providing ML inference services.
There are many advantages of the inference system approach, especially with the growing length of ML pipelines and increasing model complexity in modern deployments [
4]. First of all, it allows the pipeline to be easily modifiable. Any changes to the selection or ordering of data processing steps do not necessitate the redeployment of the whole pipeline. Instead, its composition can often be altered with just a few commands provided by the integrated orchestration tools (for example, Kubernetes). Those commands are usually well-documented and known to the maintainers of the infrastructure, which lowers the barrier to adoption [
10,
11]. Only when a completely new data processing step is added does it warrant additional effort from a software developer. As such, many of the changes to the pipeline can be made very quickly. In addition, most of the data processing code has been separated into independent components by design, which promotes reusability. Already developed components can be easily repurposed into other deployments and pipelines. Because of that, this approach is inherently extensible and flexible, as it can accommodate complex pipelines involving multiple ML models as ensembles, as well as multiple preprocessing steps. Furthermore, the encapsulation of data processing steps as standalone applications simplifies the development of custom components. Instead of ensuring that new code is well-integrated within a pre-existing server (a solution that would be harder to debug and deploy), the user only has to worry about providing a well-performing component. Overall, the concept of flexible ML inference pipelines enables rapid, iterative experimentation with the deployed ML workflow. This focus on experimentation may be especially beneficial for projects involving applied ML research tested in pilot conditions, as it allows many variants of the workflow to be easily created and tested. The users can adapt the pipeline to the real-life environment by adding or removing custom preprocessing and postprocessing steps, testing multiple versions of a given module, and changing the behavior of existing transformations through parameter modification, all without the need to rebuild the application as a whole.
However, the design philosophy of treating each step in the pipeline as a separate container also introduces a certain performance overhead. Even though the employment of Kubernetes orchestration in edge environments is not only possible but also increasingly popular [
12], limited resources still necessitate a different architectural approach in edge-friendly solutions. Firstly, although the encapsulation of applications inside of, for instance, Docker containers, has only limited influence on the overall CPU usage, it is a noticeable influence nonetheless [
13]. This influence is then multiplied with each preprocessing and postprocessing step, thus limiting the length of the pipeline. Secondly, dividing preprocessing steps that use the same modules between different containers, which run separate processes, means that the containers have to reserve RAM or GPU memory for the same modules multiple times. This is a factor that influences the overall usage of those resources [
7]. Finally, and perhaps most importantly, this architecture causes the need for continuous serialization and deserialization of data when passed from one preprocessing step to the next. The consistent inclusion of such a transformation between each step diminishes the effectiveness of the overall solutions. These drawbacks are not as relevant for deployments located on multiple powerful devices as for those involving only a singular machine with limited communicational and computational capabilities. As such, providing dynamic pipelines by dividing discrete steps between separate containers fits better with the characteristics of cloud than edge environments.
Related to our research work involving the testing of ML models in real-life environments, there was an interest in a solution that would facilitate easy experimentation with ML inference pipelines by providing the flexibility, extensibility, and reusability of ML inference systems. It should be stressed that this solution would have to be deployed in a very resource-constrained edge environment. The main challenge of this paper therefore lies in providing the aforementioned features while facing stringent environmental limitations. In this work, an approach designed to address this problem and developed within the scope of the ASSIST-IoT Horizon 2020 project [
14] is presented. Namely, the Modular Inference Server (MIS), a flexible solution for deploying ML inference pipelines in a wide variety of computational settings, is proposed along with the complementary Component Repository. The Component Repository supports managing and persisting components used to build ML inference pipelines.
The proposed system was designed with two real-life use cases in mind, which differ greatly in terms of used ML models and hardware requirements. Thus, a large focus was placed on obtaining an ML inference server that would be general enough to fulfill the needs of both scenarios. Special attention was put on ensuring compatibility with a variety of hardware platforms present in the use cases. Consequently, the persistent storage of pipeline components was recognized as a useful addition, thus improving their overall scalability and reusability. Finally, the resulting framework still had to offer fast communication and low resource consumption or otherwise risk losing compliance with the stringent performance requirements of the use cases. As an ML inference server, the MIS achieves these goals by prioritizing a simpler architecture than that of ML inference systems and by not requiring any data transfer between server instances. As a consequence, the MIS does not offer support for distributing its pipelines across multiple compute nodes.
In summary, the main contributions of this work consist of (1) an extensive comparison of freely available solutions for ML inference serving (including ML inference servers and ML inference systems); (2) a novel solution for integrating flexible and reusable ML inference pipelines into an ML inference server; (3) examples of two different ML inference pipelines designed for two real-life use cases and integrated into the MIS; (4) the description and results analysis of experiments testing the performance of the solution in multiple scenarios motivated by real-life use cases. Our approach for integrating ML inference pipelines focuses on providing a combination of wide hardware support, lightweight communication, and scalability, which we were unable to find in existing works.
This paper is organized as follows.
Section 2 introduces the proposed approach to the problem of flexible inference pipelines in the Cloud–Edge–IoT continuum—the Modular Inference Server and the supplementary Component Repository.
Section 3 provides relevant context for experiments with the MIS in the form of use case descriptions.
Section 4 explains the setup of the experiments used to test the effectiveness of the solution.
Section 5 contains an analysis of the results obtained from conducted experiments. In
Section 6, the current state of the art is presented, with an emphasis given to the identified research gaps and how the MIS aims to address them. Here, the MIS is compared with other relevant approaches. Finally,
Section 7 discusses potential ambiguities and areas requiring future work, and
Section 8 formulates the final conclusions.
4. Experimental Setup
The setup for the experiments with the Modular Inference Server was designed to maximize the consistency of benchmark results while emulating the real deployment as much as possible. In all scenarios, the task of sending inference requests, analyzing inference results, and collecting various metrics was delegated to a separate machine: an x86-64 workstation. This separation allowed us to minimize the number of services that had to be run on the inference server, thus yielding cleaner benchmark results.
During the experiments, two different machines were used for hosting the Modular Inference Server. The first is the GWEN, the ARM64-based edge device developed as a part of the ASSIST-IoT project. The GWEN was designed to be power-efficient and thus has very limited processing capabilities. The second is an x86-64-based GPU server with ample computing resources. The detailed specifications of all machines used in the experiments can be found in
Table 1. The machines were connected via Gigabit Ethernet, with a TP-Link TL-SG108 L2 switch, as illustrated in
Figure 6.
The inference requests were made by a dedicated test driver application written in Scala and running on the client machine. The test driver uses the Apache Pekko Streams library to reactively and reliably manage streaming gRPC requests to the Modular Inference Server. It takes accurate (sub-microsecond) measurements of request and response times using the system’s monotonic clock. This allows for very precise calculation of the round-trip request–response latency in the experiments. The application was containerized for easier use and published under the Apache 2.0 license on GitHub (
https://github.com/Modular-ML-inference/benchmark-driver) (accessed on 3 April 2024) and Zenodo [
18], along with usage instructions.
The client machine also hosted a Prometheus 2.48.1 server to collect metrics from two sources: the Modular Inference Server instances and the Prometheus Node exporter. The MIS exposed metrics related to request processing time and the status of the Python virtual machine. On both the GWEN and the GPU server, Prometheus Node exporter 1.7.0 was installed to expose metrics about the operating system and the hardware. Additionally, Nvidia DCGM exporter 3.3.0 was installed on the GPU server to expose metrics about the GPU.
In the experiments, a total of three different deployment scenarios were considered, as summarized in
Figure 7. For the GWEN edge device, due to its limited computing resources, only one instance of the MIS was deployed in Docker without a load balancer (scenario A). For the GPU server, two deployment variants were used with Kubernetes (kubeadm 1.28.2). In scenario B, one instance of the Modular Inference Server was deployed without load balancing. In scenario C, a varying number of MIS instances were deployed in Kubernetes (1, 2, and 4), with a load balancer directing gRPC requests to them. The load balancer used in the tests was Envoy 1.18.3 in a round-robin configuration.
The approach to the experiment design in this study was informed by existing works describing ML inference servers and ML inference systems [
6,
38,
39,
40,
41], both in terms of metric selection, as well as the number and architecture of devices included in the setup. The experiments focused on scenarios with a single compute node while testing different hardware and deployment strategies. Due to the MIS being an ML inference server, it does not rely on complex, multinode deployments to realize its pipelines. Instead, all stages of the pipeline are encapsulated within one compute node. The devices used in the experiments were chosen on the basis of use case requirements, thus mimicking the hardware used in real-life scenarios. At the same time, the used devices represent two of the most popular CPU architectures and possess different ML acceleration capabilities (GPU), thus demonstrating the portability of the MIS.
4.1. Fall Detection
For the fall detection use case, two host devices were tested: the GWEN and the x86-64 server. In both cases, only the CPU was used for inference, even though the server had a GPU installed. This is due to the very small size of the model used (203 parameters) and the need to maintain very low and consistent latency. With standard GPU accelerators, parallelization is only possible with batching or by time-sharing the GPU between multiple applications, both of which would incur additional latency. Therefore, the GPU was not used in these experiments.
The workload was simulated with real-world data collected from workers performing a range of activities on an active construction site during the trials of the ASSIST-IoT project. The dataset consists of 11 h of recorded acceleration patterns from a single accelerometer and is available publicly on GitHub (
https://github.com/Modular-ML-inference/ml-usecase) (accessed on 3 April 2024) and Zenodo [
17]. During experiments, the benchmark driver used it to simulate the load of a configurable number of devices. Several instances of the benchmark driver could be launched simultaneously to simulate multiple clients—such a situation would occur if several GWENs only collected acceleration data, while the inference was performed on a central, more powerful machine. Each simulated device generated data for 15 min with the use case-dictated 2 Hz frequency.
The following experiment variants were conducted. With the GWEN (setup A on
Figure 7), only one client was tested, which corresponded to the real-life scenario. The one client simulated the load of 10, 20, 40, 80, or 160 devices. With the GPU server, both setup B and C were tested (without or with load balancing; see
Figure 7). In setup C, the number of MIS pods was 1, 2, or 4. In all experiments with the server, the number of clients was 1, 4, or 16, and the number of devices per client was 10, 20, 40, 80, or 160.
4.2. Scratch Detection
Due to the aforementioned privacy and licensing restrictions, the real images from vehicle scanners could not be published. However, preserving the features of the real dataset is important for the experiments, as the number of detected scratches on each image has an impact on the total amount of data transferred during the inference. This is because for each detected scratch, additional data is returned (e.g., image masks), thus yielding larger response sizes. Hence, to precisely model the use case with regard to the volume of data returned from the pipeline, the distribution of the number of detected scratches must be kept.
In order to achieve this while keeping the experiments described in this work reproducible, the CarDD [
42] dataset was used as a substitute instead of the real images from vehicle scanners. Specifically, first, the trained model performed inference on the real dataset, and for each inferenced image, the number of scratches reported by the model was recorded. This was used to estimate the probability distribution of the number of detected scratches per image (
Figure 8). Subsequently, the same model was used for inference on the CarDD dataset, and the number of detected scratches was again recorded for each image, thus forming the second probability distribution. During the next step, this distribution was adjusted to match the real dataset’s distribution by subsampling the images that reported a given number of detected scratches in the necessary proportions. This resulted in a subset of the CarDD dataset that has the same detected scratches probability distribution as the evaluation dataset. This ensured that during the inference experiments performed in this paper, the distribution of response sizes from the MIS closely mimicked the one that would be obtained with the real dataset. The code needed to generate this subset, along with the list of used images from CarDD is available on GitHub (
https://github.com/Modular-ML-inference/ml-usecase) (accessed on 3 April 2024) and Zenodo [
17].
The workload was simulated with the benchmark driver application, where each client corresponded to one vehicle scanner. Every 3 min, the client generated a scan consisting of 50 to 300 images, with continuous uniform distribution of image count. Each image was 1200 pixels in width and 900 pixels in height with three color channels. The images were grouped into batches of one or more images and sent over gRPC to the MIS. The Modular Inference Server was deployed on the GPU server (benchmark setups B and C; see
Figure 7), and the GPU was utilized as the inference device. In setup C, GPU time sharing was employed to allow multiple MIS instances to access the accelerator.
The following experiment variants were conducted. With both setup B and C (see
Figure 7), the number of clients was 1, 2, or 4, while the batch size (number of images per gRPC request) was 1, 4, or 16. In setup C, the number of MIS pods was 1 or 2. In all experiments, the clients generated vehicle scans in the same manner as described above.
4.3. Experiment Summary
Table 2 summarizes the deployments used in the performed experiments. The two use cases were tested in several very different configurations, with the fall detection use case using exclusively CPU-based inference, while scratch detection resorted to GPU-based inference. In addition,
Table 3 summarizes the pipelines deployed to the MIS in both use cases.
The two use cases have very different deployment, pipeline, and workload characteristics. For fall detection, the requests are very small, very frequent, and arrive at a constant pace. For scratch detection, the requests are much larger and arrive in irregular bursts. The use case-specific performance requirements also differ—in fall detection, latency must be optimized, while for scratch detection, throughput is the most important aspect. Overall, the designed experiments present a diverse challenge for the MIS.
5. Results and Analysis
The following section describes the results of the experiments performed with the Modular Inference Server in the context of the two presented use cases.
5.1. Fall Detection Results
For the fall detection use case, the most important performance aspect is latency (the time from the client sending an inference request to it getting a response). Throughput is less of a concern, as long as the MIS is able to operate in near-real time, that is, without creating long queues of requests. Therefore, the following analysis focuses on end-to-end latency. Due to the large number of performed experiments, the detailed results were placed in
Appendix A, while this section focuses only on the most important results.
For the resource-constrained GWEN, only one deployment variant was investigated: one MIS container in Docker, and one client with a varying number of client devices, which served as the data sources.
Figure 9 illustrates the request–response latency distribution, thus measuring the time from the client sending an individual request in a gRPC stream to it getting the corresponding response. The plot does not include the experiment with 160 client devices, as the MIS did not manage to process that many requests in real time, which yielded very high latencies (see
Appendix A). As can be seen in the figure, for 10, 20, and 40 devices, the most common latency was around 8 ms, with some requests taking less. For 40 and 80 devices, requests taking longer appeared more often. In all presented cases, the maximum latency did not exceed 62 ms, while the median ranged from 7.52 ms (40 devices) to 11.63 ms (80 devices).
Interestingly, for higher numbers of devices, the proportion of sub-8 ms requests is larger than with only 10 devices. The cause of this phenomenon has not been investigated in depth; however, it may be a consequence of saturating the streams on both the client and the server. When a stream processor has no elements to process, it is typically paused. For example, the process is removed from the CPU by the kernel scheduler, or, in Apache Pekko, the stream stage is removed from the actor thread to allow other workloads to take its place. In such a case, when a new stream element arrives later, the stream processor must be started again, which unavoidably introduces some latency. This can be circumvented if the stream processor always has more elements to process, which is a situation which occurs with streams of higher throughput.
Due to the nature of the gRPC protocol, the request and response streams are not synchronous. This effectively means that the client may send several messages in the request stream before it starts getting the corresponding responses from the server. In the context of this study, an inference request that was sent but not yet responded to is called an
in-flight request—it can be thought of as queued for processing.
Figure 10 presents the distribution of in-flight requests for the experiments on the GWEN. The in-flight count was measured and recorded every time a request was sent and a response was received. Hence, for 10 and 20 devices, the in-flight count was only 0 or 1, which corresponds to requests being sent and received serially in a synchronous manner. For 40 devices, situations with two in-flight requests occurred, while for 80 devices, the in-flight count reached up to 10 requests (not shown on the plot due to the very small bar size).
In experiments with the server, different deployment variants and numbers of clients were tried, thus yielding two more dimensions in the data to explore.
Figure 11 presents the relation between the number of clients, devices per client, the deployment variant, and the mean inference latency. In cases where the mean latency was very high (more than
ms), the inference was not real time due to the formation of large queues of in-flight requests. It can be observed that one-pod deployments were not able to support real-time scenarios with 16 clients and 40 devices per client, or 4 clients and 160 devices per client (a total of 640 devices). Increasing the number of MIS pods to two allowed the server to handle these two scenarios. Consequently, four pods could handle 16 clients with 160 devices each, thus raising the total number of supported devices to 2560. When comparing the one-pod deployments (with or without load balancing), it can be seen that the load balancer appeared to be introducing additional latency. This is expected—the load balancer is essentially a relay which requires additional CPU time. However, for many-pod deployments, the load balancer must be included to help the system scale to a higher number of devices.
The same stream saturation phenomenon to that found in the GWEN deployment is noticeable in
Figure 11, when comparing the experiments with 10 and 40 devices per client. Namely, the variants with 40 devices per client sometimes have lower mean latencies, which may stem from the same root cause as with the GWEN.
The latency distributions for 40 devices per client are visualized in
Figure 12. In the first two subplots, the peaks of the distributions for variants with load balancing are visibly shifted to the right in relation to the variant without load balancing. This implies that the load balancer consistently increased the latency. The distributions for higher numbers of clients are also flatter, with a higher overall variance. This is caused by the clients having to wait for the server to finish processing a request made by a different client.
Finally, performance metrics collected during the experiments were explored.
Figure 13 visualizes how the total CPU usage and network traffic scaled with a growing workload. For the GWEN (on the left subplot) it can be seen that CPU usage increased predictably with the increasing number of devices. It should be noted here that the MIS is a Python application. Python uses the Global Interpreter Lock (GIL), which limits the number of active Python interpreter threads to one [
43]. This means that the MIS is almost entirely single-threaded—almost, because code outside the interpreter (e.g., optimized numerical routines, network code) can execute asynchronously. Therefore, for 160 devices, the GWEN CPU usage was observed to be on average 114.8%, which corresponds to one fully loaded MIS instance. For 80 devices, the average CPU usage was at 95.9%.
There was a disparity between network transmission (TX) and receive (RX) usages for 10 and 20 devices on the GWEN. In the fall detection use case, the requests to the MIS were larger than its responses, and therefore the RX should have been higher. This can be, however, explained by the presence of the Prometheus metrics exporters on the GWEN. The client workstation’s Prometheus instance regularly reads the metrics from the exporters, thus adding significant TX traffic. This traffic is constant and independent of the number of devices.
For the server, with four pods and 160 devices per client, the CPU usage scaled predictably to an average of 438.7% with 16 clients. The average total network usage (RX + TX) reached 1.92 MB/s in the most challenging scenario, thus using approximately only 1.5% of the total bandwidth of the Gigabit connection.
5.2. Scratch Detection Results
In the scratch detection use case, the most important performance aspect is the time needed to process a complete vehicle scan. Contrary to the fall detection use case, individual request latency is not important, and therefore the focus of this analysis was different.
Figure 14 illustrates the total time needed to process a complete vehicle scan, from the first request being sent to the last response being received. The most significant difference that can be observed is the impact of batch sizes. The processing time dropped sharply by increasing the batch size to 4 and then to 16. This is because the GPU can perform inference on multiple images in parallel—at the cost of higher memory usage. Therefore, with larger batch sizes, the GPU is expected to be better utilized.
The plots in
Figure 14 show a very high variance for all experiments, which was largely due to the varying vehicle scan size (from 50 to 300 images). This variance makes it harder to assess the impact of other variables (number of clients, deployment). Therefore,
Table 4 presents the same results, but they have been normalized by the number of images in each vehicle scan, thereby effectively removing this variance. Firstly, it is noticeable that for batch sizes 4 and 16, and with one MIS pod, the deployment with load balancing was consistently faster than no load balancing. The variance was also decreased with the load balancer, especially for the more challenging cases with more clients. This was likely caused by the additional buffering done by the load balancer, which can increase throughput and make the traffic flow more consistent. Secondly, it can be observed that the difference between times with one and two pods was minimal.
Moving on to performance metrics,
Figure 15 visualizes the GPU utilization and GPU memory usage in different deployments with four clients. Looking at the GPU utilization distribution (top-left subplot), it is evident that the GPU usage peaks was higher with larger batch sizes, as expected. For a batch size of 16, the peak utilization reached 100%, thereby making use of the full potential of the GPU. Although the peak utilization increased with batch size, the opposite was true for mean utilization (top-right subplot). Effectively, total processing time on the GPU decreases with larger batch sizes. Together, these two observations show that larger batch sizes are doubly beneficial—by both using the hardware to the maximum and utilizing less GPU time overall—thereby leaving more time for other processes. The only outlier in these results was the higher than expected mean GPU usage for two pods and a batch size of four, which was most likely caused by a random occurrence.
The GPU memory usage increased predictably with the batch size, as higher batch sizes require linearly more memory. With four clients, two pods, and a batch size of 16, the GPU memory usage reached almost 14 GB, which is well below the capacity of this GPU (24 GB). Unlike the mean GPU utilization, the mean GPU memory usage was positively correlated with the batch size.
Figure 16 illustrates network usage in different deployment variants with four clients. Firstly, on the left subplot, it can be observed that the peak network receive usage was noticeably higher with a load balancer in place, thereby reaching up to 120 MB/s (the limit of the Gigabit Ethernet connection). This is likely due to the additional buffering done by the load balancer, which in turn allows the network bandwidth to be used in full. As for network transmit, the usage increased predictably with batch size, thus peaking at about 35 MB/s for a batch size of 16 and two pods with load balancing.
7. Discussion and Future Work
What follows is a discussion of the obtained results, along with an outline of future work directions. Investigated here are the functionalities of the MIS and the Component Repository, performance and deployment aspects, and the specific use cases that were included in the experiments.
7.1. Software Functionalities
The Modular Inference Server and the Component Repository have demonstrably covered the needs of the two tested use cases. The software was successfully applied in pilot trials involving real hardware, as well as in relevant simulated workloads. However, for future production deployments, further work on selected improvements and functionalities would be needed.
Firstly, the software currently assumes that the deployment is done in a secure private network, without the possibility of any foul play—these presuppositions were reflected in the approach to interface design used by the system. Introducing authentication, authorization, and encryption to increase the security of these components would allow for a broader range of applications. Secondly, the mechanism of modular inference pipelines could be extended to support the integration of multiple models chained together or forming an ensemble. Similarly, the pipeline module format verification in the MIS could be formalized and expanded into a semantically coherent system. Integrated into a command line tool and coupled with the Component Repository, it would aid users in designing custom pipelines based on pre-existing components, thereby increasing overall reusability. Additionally, a broader analysis of potentially useful features adapting the MIS to scenarios beyond the ASSIST-IoT use cases in the Cloud–Edge–IoT continuum could be conducted, thus focusing on important aspects such as varying data location or geographical distribution [
64].
7.2. Performance and Deployment
The Modular Inference Server was tested in a range of demanding and diverse benchmarks, thus demonstrating robust performance. It could cater both to latency-sensitive workloads (fall detection) and to throughput-restricted ones (scratch detection). The software scaled both up/down (from the GWEN to the x86-64 server) and out (with multiple MIS instances per server). It also managed to run on two of the most popular CPU architectures (x86-64 and ARM64) and make efficient use of the GPU when available. As such, it has fulfilled the requirements of adaptability within the described use cases. Due to the use of Docker, the MIS can be assumed to be usable on any reasonably modern x86-64 or ARM64 platform supporting containers, which covers a significant portion of the computing continuum. The range of supported CPU architectures can be extended in the future to include, for example, RISC-V. Adapting the MIS to new architectures is feasible, given the few dependencies of the MIS (the most important are Docker and Python, which are already ported to RISC-V). However, some of the needed machine learning libraries (e.g., TensorFlow) are at the time of writing still poorly supported on RISC-V, thus constituting the largest barrier. A broader range of ML acceleration hardware should also be tested, including various Neural Processing Units (NPUs) and GPUs from other vendors. This would be especially useful for energy-efficient edge devices such as the GWEN. Additionally, further experiments, including scenarios with the load balancer managing the communication between devices of different computational abilities, should be included in future work.
As for Cloud–Edge–IoT continuum support, the MIS can not only run in the standalone mode suitable for resource-constrained edge devices but also in larger Kubernetes deployments, thus making it suitable for the cloud or cloud-native deployments on the edge [
12]. Extending the MIS’s support to IoT devices depends largely on the device in question—some would argue, for example, that the GWEN and comparable hardware platforms (e.g., RaspberryPi) qualify as IoT devices. As of now, the MIS supports any x86-64 or ARM64 platform that can run Docker. Using the MIS on smaller devices is currently not possible but could be investigated, for example, with the use of WebAssembly [
65]. This technology, although very promising, is still relatively young, and it is not immediately clear how feasible would such an implementation be.
The results from the conducted experiments can be used to draw general conclusions about other workloads that can be deployed with the MIS. While investigating the effect a load balancer had on deployments with one MIS instance, it has been observed that although it did increase latency in the fall detection use case, it decreased the time to process a vehicle scan in scratch detection. This is a well-known trade-off of latency versus throughput—with additional buffering and optimized networking code, the load balancer can increase throughput, although at the cost of latency. Thus, when deploying workloads with the MIS, using the load balancer for one-pod deployments is recommended only if throughput is more important for the use case.
Analyzing the collected performance metrics allowed us to determine the bottlenecks in both use cases. Depending on the used pipeline and the workload characteristics, the availability of different resources may be the limiting factor. For example, the fall detection use case is visibly CPU-bound, and more devices could be handled by simply increasing the number of available CPU cores and MIS instances. On the other hand, in scratch detection, full parallelization can only be achieved with batching, which is in turn restricted by the available GPU memory. Peak inference throughput also appears to be limited by network speed; therefore, employing faster networking (e.g., 10 Gigabit Ethernet) or compressing the requests may substantially decrease inference time. In summary, for a given workload, it is essential to evaluate its resource use, determine the bottlenecks, and select appropriate hardware.
7.3. Use Cases
In the fall detection use case, the GWEN could responsively handle up to 80 client devices with a single MIS instance, thus meeting the relevant requirements. The achieved latencies were well within the expected ranges, while the in-flight request count analysis shows that the MIS was able to process the requests in real time. In the server deployment the solution scaled predictably—most likely it would be able to process thousands of devices on many-core CPUs. In the future, using a specialized ML accelerator could be investigated to further increase the number of supported devices per GWEN, or to make it possible to use a larger ML model. The accelerator must be chosen carefully so as to support very-low-latency inference, which would be very hard to achieve with standard GPUs employing batching and time sharing concurrency.
For the scratch detection use case, the MIS supported up to four clients without any issues in all deployment scenarios. Network bandwidth was clearly a bottleneck, which was mostly due to the images being sent as raw tensors without compression. Each image encoded in such a manner takes up 3 MB, which would be significantly decreased if, for example, JPEG encoding was used instead. This may be achieved by either treating the JPEG image as a flat tensor of unsigned 8-bit integers (bytes) or by implementing a custom gRPC
Service in the MIS. Higher processing throughput per server machine could likely be obtained by using Multi-Instance GPU (MIG) technology, which allows multiple applications to use the same GPU concurrently [
66]. This, however, requires using highly specialized hardware.