A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection

Papavasileiou, Apostolis; Nikoladakis, Stelios; Basamakis, Fotios Panagiotis; Aivaliotis, Sotiris; Michalos, George; Makris, Sotiris

doi:10.3390/app14104138

Open AccessArticle

A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection

Laboratory for Manufacturing Systems and Automation, Department of Mechanical Engineering and Aero-Nautics, University of Patras, 26504 Patras, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4138; https://doi.org/10.3390/app14104138

Submission received: 16 April 2024 / Revised: 6 May 2024 / Accepted: 9 May 2024 / Published: 13 May 2024

(This article belongs to the Special Issue Intelligent, Sustainable and Resilient Personalized Product-Service Systems towards Industry 5.0)

Download

Browse Figures

Versions Notes

Abstract

:

Quality inspection plays a vital role in current manufacturing practice since the need for reliable and customized products is high on the agenda of most industries. Under this scope, solutions enhancing human–robot collaboration such as voice-based interaction are at the forefront of efforts by modern industries towards embracing the latest digitalization trends. Current inspection activities are often based on the manual expertise of operators, which has been proven to be time-consuming. This paper presents a voice-enabled ROS2 framework towards enhancing the collaboration of robots and operators under quality inspection activities. A robust ROS2-based architecture is adopted towards supporting the orchestration of the process execution flow. Furthermore, a speech recognition application and a quality inspection solution are deployed and integrated to the overall system, showcasing its effectiveness under a case study deriving from the automotive industry. The benefits of this voice-enabled ROS2 framework are discussed and proposed as an alternative way of inspecting parts under human–robot collaborative environments. To measure the added value of the framework, a multi-round testing process took place with different parameters for the framework’s modules, showcasing reduced cycle time for quality inspection processes, robust HRI using voice-based techniques and accurate inspection.

Keywords:

Human–Robot Collaboration; modular framework; ROS2; quality inspection; voice-based interaction

1. Introduction

Over the last few years, the strive of the production systems to keep up with the ongoing market demands for customizable products has resulted in a direct need for more flexible and time–cost efficient industrial solutions [1]. In this scope, many efforts have been made to utilize frameworks [2] that are in line with the Industry 4.0 standards [3], aiming to provide a direct solution to the existing industrial needs. This new wave of research and innovation has led to the creation of complex and efficient cells that combine different communication and execution systems [4]. While the field is rapidly evolving, there are many aspects of production that seem to be staggering and remain a subject for further optimization.

1.1. Literature Review

1.1.1. Human–Robot Collaboration (HRC) in Industrial Applications

One of the most promising areas for new research is shaping up to be the field of Human–Robot Collaboration (HRC). The concept of HRC has become very popular during the last few years [5]. The main ideology revolves around the fact that robotic manipulators such as robotic arms, mobile robots, etc. may provide many benefits to the production systems (repeatability, precision, strength) but have some significant limitations to their use. These limitations come from the manipulation of flexible components such as cables and clothes, which usually requires high dexterity and cognition [6]. Additionally, it is very common for assembly actions to be followed by secondary assembly steps such as the inserting of fasteners or the connection of wires with other hardware components. These secondary assembly steps require high levels of accuracy and speed. These limitations can be addressed by utilizing the capabilities of human operators able to collaborate with robotic resources providing intelligence, flexibility and dexterity to the operation itself [7]. The field has already proven to be a rapidly evolving one as, over the last years, it has managed to eliminate the previously mandatory fences around robotic manipulators [8] while establishing HRC as a standard in many aspects of manufacturing [9]. Meanwhile, research efforts are always ongoing [10], with the next goal being the creation of a robot and human shared workspace where contact is not avoided but desirable [11]. Even though the space of HRC is a promising one, its uses are only shown in a small number of stages in industrial production lines. More specifically, most of its applications seem to be connected with the assembly process, with some minimal involvement in the material manipulation of automotive industries [12].

1.1.2. Human-Robot Interaction Techniques in Industrial Environments

Several approaches towards the efficient interaction of human and robotic resources in HRC scenarios have been presented in the literature. Existing approaches are based on Augmented Reality (AR) techniques [13,14,15], smartphones [16], tablets [17,18], smartwatches [19] and voice-based interfaces [20]. In this paper, special emphasis is given to voice-based approaches due to the fact that human operators are able to interact with the robot without needing to stop their actions and use their hands to interact with the required interfaces. While the idea of robots communicating with humans through voice commands is not a new one and was already explored at the start of the 2000s, new advancements have come to the fore. In more detail, the efforts at the start of the 2000s were severely limited by the lack of a reliable and efficient voice recognition software [21]. Nowadays, the latest advancements in software design have provided the opportunity for more advanced solutions in controlling robots using vocal commands [22]. Voice-based techniques have become a part of robot manipulation [23], assisting operators [24] in many steps of the industrial process [25] while resulting in the creation of methods and tools for the programming of robot functions based on human voice [26].

1.1.3. Human-Robot Collaboration (HRC) in Quality Inspection Processes

While the benefits of HRC are apparent in assembly operations, there seems to be space for further research in other industrial applications, like quality inspection. Nowadays, the leading practice for quality inspection seems to be that of the visual examination which, in most cases, involves a human operator responsible for the overall controlling and calculation. The visual examination by operators results in several challenges such as limited scalability and operator-to-operator inconsistencies [27]. The shortcomings of these practices are already being noticed, with several vision-based inspection methods being proposed over the last few years [28]. The field has noticed some types of robots being added, such as ground robots that are being coupled with certain vision techniques to ease the burden of human personnel [29]. Recent advances in Deep Learning (DL) have brought into the spotlight a variety of algorithms for object detection, semantic segmentation or image classification which significantly enhance vision systems in terms of quality inspection capabilities [30]. Such approaches are investigating the implementation of new systems by utilizing already existing or new and developing CNN (Convolutional Neural Network) architectures showing promising results [31]. While there is promising research on inspection solutions [32], most efforts are still focusing on minimal human–robot interaction, resulting in rigid systems that cannot take advantage of the flexibility, intelligence and dexterity that the HRC approach adds to a structure [33].

1.1.4. Controlling and Integration Approaches for Human-Robot Collaborative Operations

Many resources and efforts have been focused on the development and implementation of controlling and integration approaches over the years. Approaches based on digital-twin architectures for monitoring the task planning, robot programming [34] and even the safety control [35] side of the operations, have proven to be quite successful for manufacturing. Meanwhile, some Artificial Intelligence (AI)-based approaches have been implemented with success [36] for the task of collision checking, with some implementations even succeeding in incorporating cloud components to their design [37]. Despite the wide variety, though, the leading design choices for the implementation of control frameworks seem to be the node-based ones and, more specifically, the Robotic Operating System (ROS)- and Open Platform Communication Unified Architecture (OPC UA)-based designs. The OPC UA is a service-oriented communication architecture in which complex information is represented by nodes that relate to each other by reference [38]. That same node-based design is adopted by ROS, which emphasizes the separation of functions through interfaces, allowing for more modular and flexible architectures [39]. While both tools have significant uses, they both come with drawbacks. The OPC UA does not have standards on how to share the desired information, so it needs an information system to function. Despite the fact that ROS covers that vulnerability, its master-based design lacks inherent real-time design capabilities and suffers from a centralized approach [40], with many packages being developed on top of the existing ROS architecture to cover for those shortcomings [41].

1.2. Progress beyond the State of the Art

Despite the fact that HRC frameworks for quality inspection operations are available in the literature, the authors have identified several limitations in them, as presented in Table 1.

Summarizing the previous analysis, there is a gap in a modular solution for quality inspection operations’ execution in HRC industrial shopfloors supporting easy integration with key human–robot interaction modules. In this paper, a modular framework is presented based on a core orchestration module, including several interfaces for connection with quality inspection, visualization, robot controlling and robot voice-based interaction modules. This framework design and implementation focuses on the automated quality inspection of parts in human–robot collaborative environments where the human can also observe the assembly product and request new inspection tasks by the robot on demand. The novelty of the proposed solution relies on developing an end-point-based framework consisting of different modules for the following:

Human—robot interaction techniques based on voice capturing solutions towards the efficient and easy collaboration of the resources during the tasks’ execution;
Vision-based quality inspection solutions using AI techniques;
Robotic resource monitoring and controlling solutions for the successful execution of the required tasks;
Visualization of the manufacturing shopfloor layout, robotic resource motions, identified voice commands and quality inspection results.

The proposed solution is based on the ROS framework, with the integration of the ROS2 version [45] also being investigated. After evaluating the pros and cons of ROS2, the authors selected the ROS2 version, even for the designing phase, due to the following points:

ROS2 is based on the Data Distribution Service (DDS) [46] standard architecture, enabling the framework to have quality of service parametrization;
ROS2 has no single point of failure, so the system will not fail if one ROS node malfunctions (typical in unstable settings like the industrial one);
ROS2 consistently delivers data over lossy links so the usage of multiple machines connected through Wi-Fi is enabled, running the required modules of the framework;
ROS 1st version distributions are about to be discontinued, receiving no updates after 2025 and being replaced by ROS2 distributions.

A robust quality inspection module has been developed and proposed alongside a voice-based interaction mechanism in order to facilitate accurate collaborative inspection under HRC workspaces. A brief comparison between the existing ROS2 based solutions in manufacturing is being presented in Table 2, highlighting their main limitations compared to the proposed voice-enabled ROS2 framework for HRC operations.

The paper is organized as follows: Section 2 provides a description of the proposed framework and its modules, while Section 3 presents the description of the architecture and software modules used, as well as how these are implemented; Section 4 describes the validation of the proposed framework in an industrial scenario derived from the automotive industry; Section 5 analyzes experiments with real industrial equipment and presents their main outcomes and results; and, finally, Section 6 discusses the main conclusions regarding the deployment of the proposed framework as well as the future steps of this research.

2. Approach

In this paper, a voice-enabled ROS2 framework is being presented, one that is able to accommodate the organizational and functional needs of a human–robot collaborative workstation focused on the automated quality inspection of products.

The proposed framework has been designed and implemented following an end-point architecture. In this case, the modules of the proposed framework can be treated as separate solutions for solving specific manufacturing problems, such as (i) the automated quality inspection of products using fixed or movable sensors for images acquisition, or (ii) human voice recognition for operators’ interaction with the manufacturing system and its components (robots, machines, etc.). The end-point architecture of the proposed framework enables the connection of the developed modules to fill a specific gap in the literature, as presented in the previous section of the manuscript.

In the proposed configuration of the framework, four modules are integrated with the core module of the framework “Orchestration” for the HRC quality inspection of products, namely:

Voice recognition;
Quality inspection;
Robot manipulation;
Visualization.

Thanks to these modules’ development and integration with the core module of the framework, different operations are controlled through voice commands, while the inspection actions are handled both by an operator and a robotic manipulator collaboratively achieving greater time efficiency and inspection accuracy while relieving the operator from unnecessary ergonomic stress. The whole process is monitored through a digital environment that depicts all the different actions. All the modules of the investigated frameworks’ modules are presented in the following subsections.

2.1. Orchestration

In this core module of the framework, the basic functionalities and utilities of ROS2 are being incorporated. More specifically, the orchestration module includes different aspects such as the sequencing of actions, the distribution of data and the overall control inside the framework. A human operator and a robotic manipulator serve as the main resources of the monitored structure, while an inspection component and a voice recognition component are working in parallel. It is important to be aware of the different components’ status in order for the orchestration module to act as a safety, control and organization center for all of them.

Based on this hierarchy, it is clear that this module handles the control of each component in the proposed solution. Specifically, in this orchestration, different basic control functionalities are being handled, namely the following:

Operation Sequencing: Each command from the voice recognition module correlates with many different tasks that may overlap with one another. The preparation of the distinct resources that each one needs and the orchestration of the tasks into flows that do not conflict with each other is being addressed.
Task Assignment: Following the flows of execution created, the task assignment functionality distributes the correct commands to each component while checking and maintaining information for the status of the corresponding execution.
Error Handling: In case of a system error or a component malfunction, the task execution is stopped and the appropriate back up operations are initiated, ensuring safety and stability for the system.

As the core module of the proposed framework, the orchestration module consists of several interfaces towards the integration of the required modules, presented in Figure 1 with the proposed solution.

2.2. Voice Recognition

The voice recognition module is able to detect the vocal command from the operator and this has been utilized in order to be matched with predefined speech commands (start, stop, right, left, top, front, back) and to be mapped to robot tasks accordingly.

The main interface of this component is built with the main purpose of providing an easy and descriptive way of using all the included functionalities hidden underneath:

Voice Identification: Start and stop triggering is provided in order for the user to initiate or pause the capturing of his/her voice and transmit the speech signal for identification.
Command Recognition: After the transmission of the speech signal, the output is provided, namely the recognized sentence, phrase or word that has been pronounced.
Online Filtering: An online filtering of the pronounced words has been established, allowing the user to experience a live validation of correct pronunciation. In other cases, the output of the area presenting the recognized words is left blank. Last but not least, an informative message is created in order to ensure that the recording of the microphone has been successfully enabled.

The concept of the voice recognition module includes the aforementioned functionalities, which are based on Algorithm 1.

Algorithm 1: Voice recognition wrapper

The connection of the voice recognition module with the core module of the proposed framework is based on the initialization of software components (web sockets, browser user interface (UI), communication backend, etc.) and the microphone device from the orchestrator module’s side.

2.3. Quality Inspection

The quality inspection module is provided through a vision-based solution and consists of an industrial camera, a processing unit and a detection algorithm that is able to automatically classify potential mispositioned parts in real time. More specifically, a deep object detection solution is proposed that does not solely localize the desired parts that appear in the camera’s field of view, but also classifies them into different categories based on some specific visual characteristics like shape and size. The results of the inspection are visualized via the visualization module, so that the operator is able to review the product’s quality aspects. The quality inspection module provides three core functionalities that are presented in detail below.

2.3.1. Image Acquisition

In order for an inspection to occur, an image or frame is required to be fed into the object detection model. The acquisition of that image is achieved via the camera, and it is stored in the cache memory of the computer for further processing. During the acquisition procedure, the succeeding steps are followed:

Firstly, the connection between the processing unit and the chosen acquisition device is established; in this case, this connection is achieved via an Ethernet connection;
Then, before starting the quality inspection procedure, the camera parameters are determined; these parameters are the exposure time, the width and height of the image as well as the gain and gamma; the latter are used in order to adjust the overall brightness of the image so that the objects of interest are visible;
Finally, after all the necessary configurations have been set, the image grasping and storing operations are triggered; for this, specific ROS2 drivers, dedicated to the selected camera, are utilized; while operating under ROS2, the camera is locked in other processes.

2.3.2. Part Recognition

For obtaining automatic vision-based defect identification, an object detection framework based on YOLOv4 (You Only Look Once version 4) model was developed. The basic components/functionalities of the proposed object detection framework are presented below.

You only look once (YOLO) models [51] are a series of convolutional neural networks that are specialized in object detection and classification tasks in images or video frames. As result, the proposed object detection framework is extremely fast and generalizable while also offering high detection accuracy. To achieve this, eventually, the original input image is divided into an

S \times S

grid, meaning that the image is segmented in an equal number of rows and columns. Each grid cell is responsible for detecting the center of an object, if one appears inside of it, and produces a number of predefined anchor-bounding boxes

N

, which are basically boxes of fixed sizes that are used in order to enhance the speed of the box regression for the model. Additionally, the model produces, for each object, a confidence score that indicates the model’s certainty that an object is detected and that the bounding box for that object is accurate. Hence, the confidence score for a grid cell is zero if no object appears inside of it, and equal to the Intersection over Union (IoU) otherwise. IoU is basically the overlapping area of the predicted and the ground truth bounding boxes for an object divided by the total area that they form. Each predicted bounding box consists of 5 parameters:

Two of them represent the center of the box (x, y);
Two are for the width and height of it (w, h), relative to the shape of the image;
One is for the box’s confidence that basically represents the IoU.

Additionally, the model provides a prediction for the object’s class. This is represented by a 1D vector

C

with length equal to the number of all the classes that exist in the dataset. Thus, the final confidence score for a detected object encodes the probability of the predicted class and it is transformed into the following:

P (c l a s s_{p r e d i c t e d}) \times I O U,

(1)

Considering all the aforementioned information, the final prediction vector of each grid cell is of shape

N * 5 + C

and if that is expressed relative to the grid-separated image, it is transformed into the matrix

S x S x (C + N * 5)

, as it depicted in Figure 2. It is worth mentioning here that the grid is not applied to the original image but rather to the feature map of the final layer of a convolutional neural network referred to as backbone. The aforementioned procedure is visualized in relation to the corresponding image part in Figure 2.

2.3.3. Model Architecture

As already mentioned, for this work, the YOLOv4 model was implemented. This model consists of three different stages, namely the following:

Input stage, which is basically the RGB images that are fed to the network either one by one (frames) or as a batch;
The backbone and the neck, which are responsible for the feature extraction and grouping;
The final stage of the model, which is referred to as the head of the architecture and this is where the final object detection occurs.

The aforementioned stages and features are presented below in further detail and their architecture is visualized in Figure 3.

For the backbone architecture, the model uses the CSPDarknet53 which is a modified version of the Darknet-53 architecture, used in YOLOv3 [52], that introduces the advantages of the CSPNet [53]. Darknet-53 is composed of a series of convolutional layers, which are designed to identify patterns in images. Each layer has a specific purpose, such as edge detection or object recognition. The network, as the name suggests, consists of 53 layers and it is trained on the ImageNet [54] dataset, which consists of over 14 million images. More specifically, the model combines a modified version of the YOLOv2 [55] Darknet-19 architecture that uses successive 1 × 1 and 3 × 3 convolutions in order to achieve basic feature extraction, and the deep residual network—ResNet [56]. Residual networks were introduced in order to handle the vanishing gradient problem where, basically, the gradient of a deep neural network becomes too small leading to bad performance. Residual connections allow the gradient to pass through layers of the network to make it easier for it to learn. In a residual setup, the output of a layer is not only passed through its next layer but also adds up to the output of it. By applying this principle, a deep neural network can learn more complex and deeper patterns and provide overall better performance. An overview of the CSP and residual block architecture is visualized in Figure 4.

In the architecture of the proposed object detection framework, the CSPNet is applied on top of the Darknet-53 architecture leading to CSPDarknet53. This network is composed of CSP blocks that are basically formed from partial residual blocks and partial transition layers. In the residual block, the base layer is split into two halves, as already mentioned. One half is passed through the residual block while the other is sent into the transition layer without any processing. The purpose of the transition layer is to prevent distinct layers from learning duplicate gradient information that emerges from the distinction of the base layer into two different parts. Additionally, the model includes a Spatial Pyramid Pooling—SPP [57]—block between the backbone and the neck for separating the most significant context features. SPP allows us to generate fixed size features independent of the size of the feature maps. It does so by first pooling every feature map to become one value. Then, a pooling occurs two more times, but the output values become 4 and 16, respectively. Finally, the three produced vectors are concatenated to form a fixed size vector that will be the input of the next layer of the model. Here, the SPP (Figure 5) is modified to retain the output spatial dimension. A maximum pooling is applied with sliding kernels of size 1 × 1, 5 × 5, 9 × 9 and 13 × 13. The produced feature maps are then concatenated together as output and are sent to the neck of the model.

The deployed model uses a modified version of the Path Aggregation Network—PANet [58]—as its neck architecture to localize and classify various objects that appear in a scene. The main advantage of PANet compared to FPN (Feature Pyramid Network) [59], which was used in the previous YOLO version, is that it is able to detect more complex features and preserve the spatial information that exists in an image with fewer computations. This leads to an overall enhancement of the model’s performance. More specifically, this network aggregates features from different backbone levels and its outputs are sent to different detection heads. This is achieved by adopting two techniques, namely bottom-up path augmentation and adaptive feature pooling.

For the head of the architecture, the same process as in YOLOv3 is utilized. As already mentioned, the final layer of the network is divided into a grid. This is basically achieved by using a convolutional layer which uses 1 × 1 convolutions and producing a prediction map of the same size as the feature map before it. Every cell of this prediction map seeks an object center and predicts three bounding boxes of fixed size, referred to as anchor boxes. The box with the highest IoU score is the one responsible for enclosing the object of interest and that score represents how accurate that bounding box is. Furthermore, the network uses the feature maps extracted from the PANet at strides 32, 16 and 8. A stride basically defines how far a kernel filter moves and, therefore, determines the output shape of a filtered image. In this case, the model makes predictions at shapes that divide the input image by the aforementioned factors. At each scale, the model predicts three bounding boxes for every cell and that leads to thousands of bounding boxes for a single image. In order to display only those that are relative to the detected objects, the following two conditions are used:

Every confidence is first compared to a threshold; if it is lower than the threshold, the corresponding bounding box is ignored;
If, however, there are bounding boxes that pass through the threshold filtering, then the non-maximum suppression method is used that selects a single entity out of many overlapping ones by a more dynamic thresholding operation.

The integration of the quality inspection module with the orchestrator module of the proposed framework is based on the initialization of software components (web sockets and communication backend) but also the camera sensor’s connection from the orchestrator module’s side.

2.4. Robot Manipulation

The robot manipulation module is responsible for planning and executing the complex actions of the robotic manipulator. It is divided into two distinct structures, namely, the planning structure and the movement structure. These two sub sections are intricately connected to each other, with their communication being embedded on the ROS2 communication system. It utilizes the DDS standard, which allows the system to configure Quality of Service (QoS) options while ensuring connectivity beyond the standard TCP (Transmission Control Protocol) protocols. This aspect allows the framework, and especially the robot manipulation module, to function under the previously prohibitive conditions, utilizing QoS settings to better allocate the available resources.

2.5. Visualization

This module is responsible for handling different types of demonstration, depending on the information provided. More specifically, the following data are currently supported for visualization:

Inspection Results: The quality inspection module provides the detected and classified objects via ROS2 communication mechanisms; the detected parts are visualized in order for the operator to be aware of the system results and proceed with any additional corrective action that may be needed;
Voice commands: Different commands are recognized from the voice recognition module while the operator provides the required speech signals; these commands are visualized via a separate graphical user interface; in this way, the operator is able to understand which commands have been identified correctly or misunderstood due to false pronunciation.
Workstation Layout: Depending on the selected industrial scenario, different layout configurations may occur; this means that hardware materials such as tables, conveyors, fixtures, grippers, cameras, sensors and others can have different positioning inside the workstation, thus affecting the collisions and motion planning of the overall scenario; thus, the visualization module provides the opportunity to update the overall workstation and depict it to the user via the ROS2 architecture;
Robot Resources Location: The major resource currently considered are the different robots that may be included; depending on the selected workflow, the robot may change multiple positions and configurations, thus leading to changes in the overall execution; similarly to the workstation layout, the location of resources is visualized digitally in order for the operator to be aware of their position in real-time, and thus, increase his/her safety feeling.

3. System Implementation

Based on the characteristics of the proposed voice-enabled ROS2 framework for quality inspection operations, a three-layer architecture has been designed. These layers (Human Voice-Based Interaction, Digital Computation and Visualization, Robot-Side Interaction) work together in order to facilitate the necessary functions of the framework and the communication between its modules, as presented in Figure 6. The aforementioned layers incorporate different modules that are responsible for handling different functionalities. This section is focused on the implementation of these layers and their modules and the analysis of their specific use and added value within the overall architecture.

3.1. Human Voice-Based Interaction Layer

The Human Voice-Based Interaction Layer is implemented as a standalone web application, which is the gateway to the system and connects the user with the functionalities of the framework. The application is hosted in a local server and acts as a user interface presenting the available commands to the user and providing a visual verification of the selected action. For the various functionalities of this layer, the JavaScript programming language is utilized. Finally, in order to directly comply with the ROS architectural standard, the roslibjs external library, a standard extension of the JavaScript language which provides ROS functionalities, is used.

3.1.1. Voice Backend Services

Overall, this layer mainly consists of two main components, namely, voice backend services and the frontend user interface. The voice backed services include the implementation of the following three separate services:

Voice Recognition Service: This is the core service of the Human Voice-Based Interaction Layer; it is implemented on the Google Chrome browser functionality to recognize human speech through a connected microphone device under the supervision of Google’s speech recognition engines; more specifically, the service uses the browser and the Google Voice Recognition API [60] to accurately transcript human speech to a text message; that message is then reconstructed to a form that can be transmitted through the ROS2 network, more specifically, an std_msgs/String type, which is a unique type of message built for the ROS2 communication;
WebSocket Service: This component of the Voice Recognition layer is responsible for the communication between the local network and the ROS2 network; even though the ROS2 network is built on top of the original TCP/IP protocol, in order to exchange data, following the traditional TCP/IP protocol and creating a communication port between the networks that are viewed as distinct by the system is still needed; then, the standard TCP connection is established, and data can be exchanged through the port.
RosBridge Service: This last component of the Human Voice-Based Interaction Layer functions as the opposing side to the WebSocket Service and is responsible for transferring the data that are presented to the port inside the ROS2 network; specifically, it publishes the aforementioned std_msg, provided to the port by the WebSocket Service, to a ROS2 network end-point, namely, a topic. In order to achieve these functionalities, the rosbridge_server package is used. This package is part of the rosbridge_suite package that provides a communication point for non-ROS-based programs to interact with the ROS network.

3.1.2. Frontend User Interface

In addition, the frontend user interface (Figure 7) includes the implementation of the following submodules:

Available Commands: A custom implementation based on HTML5 (Hypertext Markup Language version 5) is deployed in order to filter, in real time, the spoken commands and match them with the permitted ones; in addition, HTML canvas is utilized for visualizing the permitted commands to the operator through the corresponding user interface;
Validation Checkmark: Using Cascading Style Sheet (CSS) functionalities, a visual verification of the pronounced command is provided; more specifically, a custom checkmark is depicted each time an accepted speech command is recognized; this offers the opportunity to the user to validate the command itself, as well as proceed with the next desired one;
Execution Visual Enabler: This module incorporates different visual effects and buttons that support the execution of the speech recognition process with the operator; more specifically, using HTML5 and CSS, two buttons are created in order to start or stop the capturing of the voice; in addition, a blinking text prompt demonstrates to the user that his/her voice is being captured.

3.2. Digital Computation and Visualization Layer

The bedrock of the entire architecture is the Digital Computation and Visualization Layer, which functions as the main orchestrator of the overall framework. Its components are focused on functionalities crucial for the system such as error handling, scenario planning and task monitoring. In this layer, different components are included such as the ROS2 Orchestrator, Robot Manipulation Servers, Quality Inspection Server, Scene Monitor and Error Handling.

The following subsections provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

3.2.1. ROS2 Orchestrator

The ROS2 Orchestrator is responsible for managing the actual execution flow of the overall scenario, as well as the different connections inside the framework. More specifically, it consists of the following submodules:

Scenario Selector Service: In order to execute a voice command, many different points of the system need to work together; to orchestrate these interactions and functionalities, a scenario service is put in place; this service is responsible for breaking down the command into sub actions and communicating the appropriate actions to each sub component of the framework; it is created using the C++ programming language; the C++ language is mainly used throughout this framework for its performance, efficiency and flexibility; additionally, C++ provides excellent tools for resource management and memory allocation that are crucial in order to better streamline the processes of the framework; in the case of the scenario service, the rclcpp library [61] is used along with some custom libraries to create a ROS2 service that is then broadcasted to the ROS2 network;
Action Server Handler: The available voice commands are broken down to actions by the scenario service. One of the most used actions in the system is that of robot manipulation. While the actions may seem straight forward, in actuality, different actions of manipulation require different implementations in order to be executed properly; even the basic functions of the system, like the inspection of a part, require coordination between the Robot Manipulation Servers (explained in Section 3.2.2), while also utilizing the Quality Inspection Server (explained in Section 3.3) based on their results; this results in the creation of many different servers, while their handling is carried out by the Action Server Handler component, responsible for manipulating these servers and appropriately using them for each action; this module is also implemented with the use of the C++ language; utilizing the rclcpp libraries, a node is created that implements the various action requests that are needed for the ROS action servers; according to the modules used, the handler compiles the appropriate messages for each server and then facilitates the action needed.

3.2.2. Robot Manipulation Servers

The robot’s motion execution is based on the usage of the custom MovePose and MoveJoint servers. While these are completely distinct from each other, they are categorized based on their main functionality, which is the movement of the robot component of the framework. Their responsibilities are to monitor and facilitate communication between the robot and the system, monitor the execution of any motion request sent to the robot, and provide feedback to the system while handling any errors that may occur. The difference between these two servers lies in the way they approach movement requests and forward them to the robot controller. While the MoveJoint server uses joint torques to manipulate the robot in certain positions, the MovePose uses 3D space coordinates and quaternions to adjust a robot link to a desired pose using an inverse kinematics approach. These action servers are used in completely different situations; hence, the need for these two components is necessary.

3.2.3. Scene Monitor

The system’s many functionalities depend on accurate, precise calculations that consider many factors. The spatial coordinates of the robot and its relations to the environment, the limits of its joints and the external forces that affect the robot (gravity, angular movement, inertia) all need to be accounted for when manipulating the robot. In order to accommodate these needs, the system recreates the operation space digitally and performs approximate calculations in order to facilitate movement. All these parameters are taken into account with the help of a scene monitor component that uses a Unified Robot Description Format (URDF) data structure to recreate the joints and links of the robot. While movement is not directly computed in this component, its role is crucial in the planning phase of every operation. Moreover, this component has a digital visualization functionality through software that gives a whole representation of the environment to the operator in real time. Combining the data from the camera sensor, this component is used as a digital output of the system that is presented directly to the operator. The module uses the MoveIt2 ROS2 package [62] for its functionalities. The Moveit2 package is a direct upgrade of the MoveIt package modified in order to further work within the ROS2 architecture. It provides a robotic manipulation platform for ROS2 and incorporates the latest advances in motion planning, manipulation, 3D perception, kinematics, control, and navigation. In this particular module, all these functionalities are used to create a three-dimensional representation of the working space. In combination with the RViz2 package, a ROS2 package that facilitates 3D API and interface visualization, the real-time depiction of the environment, as well as the motions and states of the robot, is presented to the operator (Figure 8).

3.2.4. Quality Inspection Server

Τhe quality inspection server basically refers to the training of the model proposed in Section 2, as well as its integration in the production in order to be utilized in collaboration with different modules. For the implementation of the model’s architecture, PyTorch’s (version 2.1.0) framework was utilized. PyTorch is an optimized Deep Learning tensor library based on Python that provides tensor computation with strong GPU acceleration support for automatic differentiation for creating and training neural networks. For the visualization and the collection of the data, dedicated Python (version 3.8.10) libraries like OpenCV and PIL were used. For the integration of the model and the communication with the different modules, an action server based on the ROS2 framework was utilized.

Dataset Generation and Augmentation

For an object detection process, raw images, as well as the bounding boxes of the desired objects among these images, are the necessary information for the training procedure to begin. However, the different models implemented for this object detection task do not employ the same method of data processing. That means that different formats of data are available. For instance, one popular format used for object detection is the COCO (Common Objects in Context) [63] dataset, which provides labelled images with corresponding bounding boxes for various object categories. In the current implementation, the dataset consists of 1200 images, which were collected using an rc_visard65 camera by ROBOCEPTION GmbH (sourced from Munich, Germany) [64] and python-based scripts for acquiring and storing data through the ROS2 network. The dataset includes images in which the objects are exhibited in different lighting conditions, as well as backgrounds for helping the model generalize and avoid overfitting. Specifically, for this work, a set of six categories was established to denote whether objects on a motor, which is the component of interest for the inspection, were correctly or incorrectly positioned. The corresponding quantity of images for each class is presented in Table 3.

The task of labeling the images is accomplished manually by employing a graphical user interface (GUI) developed in Python. For each image of the dataset, there is a corresponding label file that consists of all the information regarding the objects of interest. A label file is basically a text file with every row dedicated to a different object on the corresponding image. Each row has five different values; the first one is an integer associated with the class of the object, and the other four describe the bounding’s box center coordinates (x, y) as well as its width and height. An indication of the corresponding labeled data using an automotive engine is visualized in Figure 9.

Moreover, within this scenario, data augmentation methods were employed to apply diverse transformations to the original dataset, thereby generating extra images. These transformations are implemented through Python-based functions, and involve rotations, flips, crops, and more sophisticated techniques such as mosaic and mix-up.

Inspection Model Training

In the present work, the training process of the implemented model spans over 300 epochs, employing a linear learning rate that gradually decreases. This ensures that the network can adapt its weights effectively in later epochs in order to learn new features and perform higher accuracy in the object detection task. Additionally, to further enhance the training performance, dedicated activation and loss functions are utilized. Specifically, for this case, the mish activation function originally proposed in [65] is considered, which was basically formed in order to address the limitations of traditional activation functions by combining the non-linear properties of

t a n h

with the saturating properties of the exponential function, resulting in an activation less prone to saturation. The corresponding formula for this activation function is defined as follows:

f (x) = x \tanh (\ln (1 + e^{x}))

(2)

It is non-monotonic, meaning that its slope changes as the input value changes, which allows it to better approximate complex functions and it also has a continuous and smooth first derivative, which is important for efficient training. The visualization of the selected mish activation function can be seen in more detail in Figure 10.

As for the selection of the loss function, it is closely tied to the specific task that the network is performing, and different tasks require different handling. Therefore, in the presented work, two loss functions are utilized, namely, the mean squared error (MSE) and cross-entropy losses. The MSE loss is used to optimize the bounding box regression, and it basically measures the average squared difference between the predicted and the ground truth bounding box coordinates as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2},

(3)

where,

n is the number of bounding boxes;
y_i represents the predicted bounding box coordinates;
$\hat{y_{i}}$ represents the ground truth coordinates.

Cross-entropy loss is used to optimize the object classification and measures the dissimilarity between the predicted class probabilities and the encoded ground truth labels. In object detection, the predicted class probabilities are typically represented as a vector of scores, where each score corresponds to the predicted probability of a particular class. Cross-entropy loss is calculated as the negative log-likelihood of the ground truth class label, given the predicted class probabilities as follows:

C E = - \sum_{c = 1}^{C} y_{c} l o g (P_{c}),

(4)

where,

C is the number of classes;
y_c is the ground truth class label;
P_c is the predicted probability of class c.

Inspection Model Integration

For the integration of the final trained model and its real-time operation for achieving the desired performance, the ROS2 framework is utilized. Specifically, an action server has been developed, one capable of loading the trained model and calling it to obtain results when necessary. This server is tasked with receiving a pre-defined goal and deciding, based on the goal’s information, whether an inspection should occur. The results are then published through the network and are available to the rest of the modules connected to it. Upon completion of the inspection procedure, a decision-making process occurs, determining whether production for the specific part should proceed or if an operator action is required for fixing any existing faulty components.

3.3. Robot-Side Execution Layer

Under this last architectural layer, mostly low-level control aspects take place, mainly robot monitoring, controlling, as well as its motion planning. This layer, while consisting of very few components, is designed for accuracy and fast communication between its parts since any potential errors can be quite critical. More specifically, the ROS2 Orchestrator and the Robot Manipulation Servers receive constant feedback from the components of this layer in order to closely monitor any abnormal variation to its results and functions.

3.3.1. Robot State Monitor

The Robot State Monitor functions as the main feedback provider between the ROS2 Orchestrator and the robot-side execution. Its main functions revolve around the real-time monitoring of the states of each robot joint, a critical role for the functionality of the system. Even small delays in communication can result in malfunctions, which is why the system is designed around the feedback cycles of this component. The ROS2 Orchestrator is directly responsible for the error handling of the Robot State Monitor, leaving the component responsible for only providing the robot states and no other functions. This is accomplished using the MoveIt2 package and its functionalities. As described in the Scene Monitor, the Moveit2 package allows for three-dimensional perception. This is accomplished by the continuous monitoring of the state of every robot joint. This functionality is used by the Robot State Monitor that feeds this information and through the MoveIt2 package in order to keep the system updated on the state of the robot. In the ROS2 architecture, a script is also used in the main operational system of the robot that allows for faster communication and higher frequencies for the updates.

3.3.2. Motion Planning

As already mentioned, the system needs to perform precise movements in a three-dimensional space. The planning of these movements is performed by the motion planning component, which is responsible for calculating a safe navigational trajectory for the robot. This component is created by implementing the C++ language with the Moveit2 libraries. A node is created in the ROS2 setting that is responsible for solving the reverse kinematic problem for the desired motion. The path is then constructed as joint deviations from the current position and then passed to the robot controller.

3.3.3. Robot Controller

This last component is the gateway for the system to the robotic manipulator. Through the robot controller, joint angles are then passed to the hardware interface of the robot that facilitates the actual movement. While this component has defined operations, there is great variety in its functions, as the way the controller interprets the angles given can vary a lot. Under the proposed architecture, the movement needs are not that complex, and thus, the JointTrajectory controller [66], a controller for executing joint space trajectories on a group of joints, is used. Trajectories are specified as a set of waypoints to be reached at specific time instants, which the controller attempts to execute as well as the mechanism allows. Waypoints consist of positions, and optionally velocities and accelerations.

The presented implementation layers and components are utilized in a specific way, depending on the selected case and scenario. An actual sequence diagram representing the execution flow of the proposed framework is presented in Figure 11.

4. Case Study

The proposed voice-enabled ROS2 framework is applied in a case derived from the automotive industry. For this application, a combination of hardware and software components is proposed in order to facilitate an accurate representation of the industrial workspace. The hardware modules mainly consist of a robotic manipulator and a vehicle engine that will be used to demonstrate the real-time capabilities of the system, as well as a camera sensor necessary for data collection (Figure 12).

4.1. Hardware Components

In this case study, a UR10 collaborative robot [67] was used for robotic manipulator purposes. The UR10 model has a capacity of 10 kg and a range of 1300 mm, while providing an accuracy in movement of about 0.1 mm. These specifications make the model well suited for a wide range of applications which, in turn, contribute to its popularity throughout the automotive industry sector. In this use case, its accurate movements ensure the optimal inspection process.

For the camera device, an rc_visard65 camera sensor manufactured by ROBOCEPTION GmbH and sourced from Munich, Germany is an ideal choice for a robot-mounted camera due to its small stature and minimal weight. The on-board processing that it offers allows a major part of the image processing to be performed outside the framework, alleviating a huge portion of the computational work. Furthermore, the sensor provides precise color differentiation in both natural and artificial lighting, while the software component of the camera allows for custom modifications to the lighting settings.

In addition, the microphone device that was used in this use case is the BlueParrott B250-XTS [68] (manufactured by Blueparrott and sourced from Patras, Greece), a mono Bluetooth headset used for its flexibility and range. It provides a noise-cancellation feature that is crucial for voice recognition in an industrial setting, while also allowing for up to 20 h in use time. These features, combined with the Bluetooth wireless feature, make this headset optimal for this case.

In order to accommodate the needs of the framework, a specific computer setup was used. The operating system that the framework was applied on had the following specifications:

Operation System: Ubuntu 20.04/apt 2.0;
ROS2 Distribution: Foxy;
ROS2 database: Mongodb 4.4;
ROS2 UR control package: Universal Robots ROS2 Driver.

4.2. Case Scenario

In order to better demonstrate the capabilities of the suggested framework, a step-by-step scenario that represents the real-time needs of the automotive industry was created. The investigated scenario directly correlates with the quality inspection procedures that take place within many industrial facilities of the automotive factories. The selected procedure revolves around the inspection of three small parts of an assembled motor, usually performed by a human operator, namely: (a) the screws, (b) the connectors and (c) the valve pipes.

While the process of quality inspection differs according to the set-up, industrial needs, etc., a standard division of tasks was implemented to define a more streamlined procedure. More accurately, the process is broken down into five tasks that represent a specific set of actions for both the human operator and the robotic manipulator. The list of tasks can also be depicted in Table 4.

The proposed framework is responsible for monitoring the tasks mentioned above and performing the robot manipulation, as well as the inspection process, while simultaneously providing feedback to the operator. The inspection process, therefore, can support many different types of parts with small changes to the overall functionality of the system. In this study, a set of valves, screws and connectors are being used to demonstrate the functions of the proposed inspection module.

4.3. Evaluation Criteria

For this use case scenario, the evaluation focuses on the following four distinct aspects of the framework:

Manufacturing process contributions: Evaluation of the contributions to the proposed process by monitoring the impact of the framework to the cycle time provided by the industrial partner;
Quality inspection efficiency: The effectiveness of the framework is evaluated by measuring the accuracy of the quality inspection module;
Human–robot interaction metrics: This aspect of the framework is also evaluated by taking HRI (human–robot interaction) metrics into consideration;
System performance: The target is to track the robustness of the framework by measuring the performance of the process in the system its applied to.

A total of four different sets of results were collected. First of all, data related to the cycle time of the process operation are crucial to the effectiveness of the framework to provide a faster solution for the quality inspection operation. Secondly, a set of tests was performed regarding the accuracy and robustness of the voice recognition module and its ability to accurately detect human commands under industrial settings. Thirdly, the quality inspection module was extensively tested in order to validate that the accuracy of the detection, and the inspection process was up to standard when it comes to end user requirements. Lastly, the performance of the system in which the framework was implemented was monitored, aiming to ensure the robustness and the reliability of the process.

4.3.1. Manufacturing Process Contributions

In the first set of testing, the cycle time metric was used in order to determine the effectiveness of the framework in the decrease of time needed for performing a quality inspection on a motor engine. Taking into account the requirements from the end user, the overall cycle time for the inspection process is currently around 1 min. One of the core goals of the proposed framework is to decrease that time while also providing accuracy and safety to the human operator. In this case, the effectiveness of the framework is tested by measuring the time needed for the operator to finish the inspection while collaborating with the robot. The exact steps of the process are as described in the above sections and are also measured individually.

The parameters for the robot’s speed are the maximum velocity scaling factor and the maximum acceleration scaling factor. These parameters define, through the voice-enabled ROS2 framework, the velocity and the acceleration of the robot while performing the necessary movements. The speed of the robot is obviously a factor that should never exceed the collaborative one set by the ISO standards in order to always ensure the safety of the operator. An additional factor that should be considered are the planning attempts that are needed for the planning module to calculate the trajectory points for the robot to follow. The more attempts, the more robust the planning process, but it takes time for the whole operation. These three variables are the most used in the testing process for robotic cycle times and are, of course, indicative of a larger scale use case, where they serve a more important role.

4.3.2. Quality Inspection Efficiency

To assess the accuracy of the quality inspection module, a distinct validation dataset was employed. The mean Average Precision (mAP) served as the evaluation metric for the model, a widely used metric for object detection tasks. To compute mAP, it is essential to initially form the precision–recall curve by calculating the corresponding sub-metrics for a range of different confidence thresholds.

Usually, object detection models output a confidence score which indicates the confidence of the model that a predicted bounding box contains an object of a specific class. Therefore, confidence and IoU thresholds differ, since the first refers to the class prediction while the latter to the bounding’s box coordinates. The area under the precision–recall curve (AUC) is a single value that represents the model’s performance for a specific class, and it is named Average Precision (AP). In order to find a metric that contains the desired information for all the classes, the mean Average Precision (mAP) is used, which is basically the mean of all the Average Precisions that have been calculated for each class separately.

In the presented implementation, the calculation of the mAP on the evaluation dataset is occurring simultaneously with the training of the model in order to provide an overview of the “learning” across different epochs. More precisely, the training procedure is interrupted after every three epochs to conduct an evaluation process. During that process, two variations of mAP are calculated. The first is the mAP with a constant IoU threshold equal to 0.5, while the second is referred to as mAP:0.5:0.95 and is basically the mean Average Precision calculated for a range of IoU thresholds from 0.5 to 0.95, in increments of 0.05. This metric provides a more complete picture of the model’s performance in detecting objects at different levels of overlap with the ground truth boxes. The difference between the two metrics also appears in the following figures displaying the mAP values during all training steps.

Also, for this use case scenario, a total of 18 tests were conducted in order to evaluate the model’s real-time operation. During these tests, the operator and the robot jointly examined the motor engine to identify any missing or incorrectly positioned components. The tests were conducted under different lighting conditions in order to examine the robustness of the proposed inspection module. Since the main purpose of the proposed system is to determine the quality of a product, the model’s output object categories that were considered for the conducted tests revolved around identifying defective components such as screws, connectors and valves. Nevertheless, as already mentioned, the model also underwent training to accurately identify properly assembled parts. This additional training ensures that the network can effectively understand the different states of the components, thus minimizing any potential confusion, revolving around labeling a correctly functioning component as a faulty one. The confidence threshold of the model was set to 0.92, which is the value in which the system operates in real time.

To measure the illumination variance between the different tests, a modified version of the structural similarity metric (SSIM) was implemented. In general, SSIM is used as a comparison metric between two images and its value ranges from −1 to 1, with 1 indicating identical images and −1 a complete lack of structural similarity. The SSIM is based on three main components: luminance, contrast and structural similarity. In these experiments, since the robot positions, when inspecting the motor, are more or less the same for the three views, the structural similarity is cast out from the calculations. Therefore, the proposed metric is used only to calculate the difference between the luminance and the contrast of the different images returning a value between 0 and 1. For each robot position, a comparison between the images captured for the 18 different tests occurred, resulting in a total of 324 values for each inspected side and, in the end, the mean of those values was calculated. Since the images are identical in their structural aspects, a different value than 1 in the metric shows the difference in the illumination between the images. The final mean comparison values are 0.7 for the left view, 0.75 for the right view and 0.68 for the top view of the motor engine. The inspection module is responsible for detecting and classifying three different types of objects in their correct and incorrect state. The accuracy of the model for the different tests was evaluated and categorized according to the different types of objects, including the True Positives (TPs), False Positives (FPs) and False Negatives (FNs) for each one of them.

4.3.3. Human–Robot Interaction Metrics

Regarding the voice recognition module, a total of five operators (Table 5) were volunteered in order to pronounce the available audio commands.

More specifically, each voice command was pronounced 100 times and the correct/false results were analyzed. The participants were also asked to provide a subjective rating on the aspects of the framework, with the results presented in Section 5. The participants were asked to rate the following:

Ease of use;
Responsiveness of framework;
User interface effectiveness;
Comfort in the interaction with the robot;
Trust in the framework.

In order to better evaluate the effectiveness of the HRC system, the operator workload was measured with the use of the NASA-Task Load Index (NASA-TLX) [69].

4.3.4. System Performance

Lastly, in order to evaluate the reliability and robustness of the system, several stress tests were performed. The framework was established in a PC station equipped with an Intel^® Core™ i7-9700 CPU @ 3.00 GHz × 8 Core Central Processing Unit manufactured by Intel TSMC and sourced from Patras, Greece and an NVIDIA GeForce RTX 3070 Graphics Processing Unit manufactured by NVIDIA and sourced from Patras, Greece. The tests consisted of the continuous iteration of the process of the case study under different workloads for both the CPU and GPU, respectively. The response time of the framework was measured. The CPU usage was monitored with the htop CLI (Command Line Interface) tool for Ubuntu and the usage of the GPU was monitored with the nvidia-smi CLI tool (Driver Version: 535.171.04).

5. Results

The results of the first set of experiments in terms of timing are analyzed in Table 6 and Figure 13.

Comparing the cycle time results of the proposed framework with the current state of the production, there is a significant reduction from the existing 60 s that are devoted for inspection. The developed framework provides effective collaborative inspection functionalities between the human and the robot, and manages to complete the inspection tasks in 39.6 s, on average, which is 20.4 s less, or a reduction of 33.4%, compared to the current state (Figure 14).

Additionally, compared to previous iterations of similar quality inspection modules [70,71], the proposed solution results in a significant reduction in the average time allocated to robot movements by 19–7 s or by 24.2–11.1% (Table 7), respectively. Finally, the developed framework expands on previous architectures [71] by providing more flexibility in the control of the human in the process of quality inspection.

The second set of experiments contains the calculation of the accuracy of the model over training in a validation dataset, as well as its performance in real-time operation. The results for the training accuracy are displayed in Figure 15. The final trained network scored 0.98 for mAP and 0.88 for mAP0.5:0.95, meaning that the model correctly detected and classified 98% of the objects of the validation dataset when the IoU threshold was constant and equal to 0.5, and 88% of the objects while the IoU threshold was progressing. As for the system’s evaluation in real time, the findings for the 18 conducted tests are summarized in Table 8.

Based on these experiments, it is observed that the proposed quality inspection module obtained the following results: (a) screws 94% TP—0.2% FP—6% FN, (b) connectors 90% TP—3% FP—8% FN, and (c) valve/pipes 76% TP—0% FP—24% FN. This is also depicted graphically in Figure 16.

To validate these findings thoroughly, an additional statistical analysis was conducted. Specifically, to assess the model’s performance across the three classes of interest and the final selected confidence threshold value (0.92), the F1 score was computed. This decision was based on the simplicity of the metric and the desire to evaluate the system’s performance at the designated confidence threshold that is used during the system’s operation. The F1 score can be calculated from the precision and recall by the following equation:

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

Using this formula, respective F1 score values for each test are measured, as illustrated in Table 8. The mean of these scores per class indicates that the model achieved an accuracy of 0.96 for faulty screws, 0.94 for faulty connectors and 0.85 for faulty valves. The lower score for the valves compared to the other components is attributed to the similarity between faulty and correct classes for this specific part, resulting in some errors. Nevertheless, overall accuracy remains high, ensuring the model’s robustness in an industrial setting. To evaluate the system’s performance across all relevant classes, one could calculate the average F1 scores across them, resulting in a final value of 91% of correct predictions for the system using the F1 metric. A comparison with existing methods from the field is presented in Table 9.

After testing, the overall accuracy of the voice recognition module is 97%, based on the provided experiments and the available human operators. For the remaining 3%, a more detailed analysis on the wrongly identified commands can also be depicted in Figure 17.

The results of the subjective ratings were positive with the operators, rating the ease of use of the framework an average 9.3/10, and with feedback focusing on improvements in the user interface. Additionally, the results of the workload measurements of the operator are presented in Table 10. According to the metrics provided, the overall framework has a low-medium workload score of 15.33, with operators focusing on the performance aspect of the framework as the most crucial scoring factor (Table 11).

The results of the system performance monitoring with variances in CPU and GPU loads are presented in Figure 18.

The results indicated that the system responded quickly and efficiently under strenuous conditions, with the worst-case scenario of 1.51 ms response time detected when the CPU was under extreme loads.

6. Discussion and Future Work

This paper is focused on the execution of voice-enabled quality inspection processes in human–robot collaborative environments. The proposed framework was developed based on the ROS2 concept for the realization of the communication between the quality inspection module, the robot controller and the voice recognition algorithm. Section 2 is focused on the presentation of the proposed framework’s modules, namely: (a) voice recognition, (b) quality inspection, (c) robot manipulation and (d) visualization. The implementation of the previously presented modules inside the developed framework is described in detail in Section 3, together with its main sequence diagram, presenting the connection between the software components for the proposed framework’s implementation. Finally, the integration of the developed framework in a use case from the automotive industry focused on the quality inspection of the assembled parts on a vehicle engine using a Universal Robot UR10 and a ROBOCEPTION camera is presented in Section 4.

As presented in Section 5, the developed framework for quality inspection was tested in a pre-industrial environment in terms of the following:

Process operation cycle time: Reduction of 33.4% compared to the current state of the inspection process;
Accuracy and robustness of voice recognition module: Based on the conducted experiments, the overall accuracy of the voice recognition module is 97%, and for the remaining 3%, a more detailed analysis on the wrongly identified commands is provided; the wrong identification of audio commands was caused due to wrong pronunciation by the operators or the local dialects of the users; in order to minimize these errors, different voice-capturing devices, but also the creation of filters to analyze the received data, will be investigated in the future;
Accuracy of inspection process based on end user’s requirements: The final version of the inspection algorithm was able to correctly detect and classify 98% of the objects of the validation dataset when the IoU threshold was constant and equal to 0.5, but also 88% of the objects while the IoU threshold was progressing; to further increase the accuracy of the inspection module, the authors will investigate, in the future, the addition of extra light installed either in a static position in the robotic cell or on the robotic manipulator.

The proposed solution introduces a modular voice-enabled ROS2 framework for quality inspection, designed to operate in industrial environments. The central point of this approach is to provide human operators with control over the inspection process, minimize downtime and promote efficiency compared to existing methods. By prioritizing modularity, the framework achieves seamless integration with existing systems and provides the flexibility to introduce different technologies related to human–robot interaction and inspection. Furthermore, compared to past solutions, the deployed ROS2-based architecture enables scalability and reliability since the system’s execution and coordination will not be halted in case a node is shut down.

The future steps for the developed framework include its integration in a real industrial environment. This will enable the quality inspection of multiple vehicle engines for the developed framework’s performance assessment based on the operator’s feedback. A questionnaire will be prepared by the authors and shared with the operators of the production line to collect valuable information regarding the framework’s functionality, which will be further analyzed.

In addition, the safety implication of collaborative inspection operations will be considered since the operators are required to perform some tasks quite close to the robotic manipulators. For this reason, a certified collaborative UR10 robot was used in this study; however, further measures can be utilized on this topic. More specifically, the authors are going to integrate industrial safety devices such as laser scanners or safety cameras in order to program safety zones that can be configured dynamically and either adjust the robot speed during execution or stop its movement in case of emergency.

Additional studies should be focused on the integration of the voice-enabled ROS2 framework with autonomous mobile manipulators able to navigate freely on industrial shopfloors, performing inspection actions on various products. Thanks to the navigation ability of mobile robots, quality inspection processes can be executed on-the-fly during the transportation of a product to the next workstation of its assembly line using conveyors. The collaboration of human operators with the overall system can be enhanced with the introduction of Augmented Reality (AR) devices such as AR glasses, smartphones and tablets. Thanks to these devices’ integration, the operators will be able to receive more information about the production line status and quality inspection results, but also interact with the system in a more intuitive way by creating new poses for the robot and trigger the quality inspection algorithm to run. Finally, other hardware devices for voice capturing will be tested to further enhance the functionality of the voice recognition module.

Other ROS-based modules like robotic grippers’ controlling or object detection could also be connected with the proposed framework for the execution of other manufacturing operations, like part assembly, drilling operations, etc. which will be investigated as future steps for the proposed solution.

Author Contributions

Conceptualization, S.M., G.M. and A.P.; approach and methodology, A.P.; software, S.N., F.P.B. and S.A.; validation, S.N. and F.P.B.; formal analysis, A.P. and S.M.; investigation, A.P., S.N. and G.M.; resources, S.M.; data curation, F.P.B.; writing—original draft preparation, A.P., S.N., F.P.B. and S.A.; writing—review and editing, S.A., A.P. and S.M.; visualization, S.N., F.P.B. and S.A.; supervision, A.P. and S.M.; project administration, A.P., G.M. and S.M.; funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by the European project “ODIN Open-Digital-Industrial and Networking pilot lines using modular components for scalable production” (Grand Agreement: 101017141) (http://odin-h2020.eu/ (accessed on 23 December 2023)) funded by the European Commission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would also like to express their gratitude to PEUGEOT CITROEN AUTOMOBILES S.A. (PSA) for providing important information about the current status and the challenges of the vehicle engine assembly line.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chryssolouris, G. Manufacturing Systems: Theory and Practice, 2nd ed.; Mechanical Engineering Series; Springer: New York, NY, USA, 2006; ISBN 978-0-387-25683-2. [Google Scholar]
Cañas, H.; Mula, J.; Campuzano-Bolarín, F.; Poler, R. A conceptual framework for smart production planning and control in Industry 4.0. Comput. Ind. Eng. 2022, 173, 108659. [Google Scholar] [CrossRef]
Nosalska, K.; Piątek, Z.M.; Mazurek, G.; Rządca, R. Industry 4.0: Coherent definition framework with technological and organizational interdependencies. J. Manuf. Technol. Manag. 2019, 31, 837–862. [Google Scholar] [CrossRef]
D’Avella, S.; Avizzano, C.A.; Tripicchio, P. ROS-Industrial based robotic cell for Industry 4.0: Eye-in-hand stereo camera and visual servoing for flexible, fast, and accurate picking and hooking in the production line. Robot. Comput.-Integr. Manuf. 2023, 80, 102453. [Google Scholar] [CrossRef]
Makris, S. Cooperating Robots for Flexible Manufacturing; Springer Series in Advanced Manufacturing; Springer International Publishing: Cham, Switzerland, 2021; ISBN 978-3-030-51590-4. [Google Scholar]
Michalos, G.; Makris, S.; Spiliotopoulos, J.; Misios, I.; Tsarouchi, P.; Chryssolouris, G. ROBO-PARTNER: Seamless Human-Robot Cooperation for Intelligent, Flexible and Safe Operations in the Assembly Factories of the Future. Procedia CIRP 2014, 23, 71–76. [Google Scholar] [CrossRef]
Tsarouchi, P.; Makris, S.; Chryssolouris, G. Human—Robot interaction review and challenges on task planning and programming. Int. J. Comput. Integr. Manuf. 2016, 29, 916–931. [Google Scholar] [CrossRef]
Chryssolouris, G.; Mourtzis, D.; International Federation of Automatic Control (Eds.) Manufacturing, Modelling, Management and Control 2004 (MIM 2004): A Proceedings Volume from the IFAC Conference, Athens, Greece, 21–22 October 2004; Elsevier for the International Federation of Automatic Control: Oxford, UK, 2005; ISBN 978-0-08-044562-5. [Google Scholar]
Müller, R.; Vette, M.; Scholer, M. Robot Workmate: A Trustworthy Coworker for the Continuous Automotive Assembly Line and its Implementation. Procedia CIRP 2016, 44, 263–268. [Google Scholar] [CrossRef]
Semeraro, F.; Griffiths, A.; Cangelosi, A. Human–robot collaboration and machine learning: A systematic review of recent research. Robot. Comput. -Integr. Manuf. 2023, 79, 102432. [Google Scholar] [CrossRef]
Papanastasiou, S.; Kousi, N.; Karagiannis, P.; Gkournelos, C.; Papavasileiou, A.; Dimoulas, K.; Baris, K.; Koukas, S.; Michalos, G.; Makris, S. Towards seamless human robot collaboration: Integrating multimodal interaction. Int. J. Adv. Manuf. Technol. 2019, 105, 3881–3897. [Google Scholar] [CrossRef]
Segura, P.; Lobato-Calleros, O.; Ramírez-Serrano, A.; Soria, I. Human-robot collaborative systems: Structural components for current manufacturing applications. Adv. Ind. Manuf. Eng. 2021, 3, 100060. [Google Scholar] [CrossRef]
Aivaliotis, S.; Lotsaris, K.; Gkournelos, C.; Fourtakas, N.; Koukas, S.; Kousi, N.; Makris, S. An augmented reality software suite enabling seamless human robot interaction. Int. J. Comput. Integr. Manuf. 2023, 36, 3–29. [Google Scholar] [CrossRef]
Quintero, C.P.; Li, S.; Pan, M.K.; Chan, W.P.; Machiel Van Der Loos, H.F.; Croft, E. Robot Programming Through Augmented Trajectories in Augmented Reality. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1838–1844. [Google Scholar]
Kyjanek, O.; Al Bahar, B.; Vasey, L.; Wannemacher, B.; Menges, A. Implementation of an Augmented Reality AR Workflow for Human Robot Collaboration in Timber Prefabrication. In Proceedings of the 36th ISARC, Banff, AB, Canada, 21–24 May 2019. [Google Scholar]
Maly, I.; Sedlacek, D.; Leitao, P. Augmented reality experiments with industrial robot in industry 4.0 environment. In Proceedings of the 2016 IEEE 14th International Conference on Industrial Informatics (INDIN), Poitiers, France, 19–21 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 176–181. [Google Scholar]
Fraga-Lamas, P.; Fernandez-Carames, T.M.; Blanco-Novoa, O.; Vilar-Montesinos, M.A. A Review on Industrial Augmented Reality Systems for the Industry 4.0 Shipyard. IEEE Access 2018, 6, 13358–13375. [Google Scholar] [CrossRef]
Segura, Á.; Diez, H.V.; Barandiaran, I.; Arbelaiz, A.; Álvarez, H.; Simões, B.; Posada, J.; García-Alonso, A.; Ugarte, R. Visual computing technologies to support the Operator 4.0. Comput. Ind. Eng. 2020, 139, 105550. [Google Scholar] [CrossRef]
Gkournelos, C.; Karagiannis, P.; Kousi, N.; Michalos, G.; Koukas, S.; Makris, S. Application of Wearable Devices for Supporting Operators in Human-Robot Cooperative Assembly Tasks. Procedia CIRP 2018, 76, 177–182. [Google Scholar] [CrossRef]
Tamantini, C.; Luzio, F.S.D.; Hromei, C.D.; Cristofori, L.; Croce, D.; Cammisa, M.; Cristofaro, A.; Marabello, M.V.; Basili, R.; Zollo, L. Integrating Physical and Cognitive Interaction Capabilities in a Robot-Aided Rehabilitation Platform. IEEE Syst. J. 2023, 17, 6516–6527. [Google Scholar] [CrossRef]
Begel, A.; Graham, S.L. An Assessment of a Speech-Based Programming Environment. In Proceedings of the Visual Languages and Human-Centric Computing (VL/HCC’06), Brighton, UK, 4–8 September 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 116–120. [Google Scholar]
Makris, S.; Tsarouchi, P.; Surdilovic, D.; Krüger, J. Intuitive dual arm robot programming for assembly operations. CIRP Ann. 2014, 63, 13–16. [Google Scholar] [CrossRef]
Meghana, M.; Usha Kumari, C.; Sthuthi Priya, J.; Mrinal, P.; Abhinav Venkat Sai, K.; Prashanth Reddy, S.; Vikranth, K.; Santosh Kumar, T.; Kumar Panigrahy, A. Hand gesture recognition and voice controlled robot. Mater. Today Proc. 2020, 33, 4121–4123. [Google Scholar] [CrossRef]
Linares-Garcia, D.A.; Roofigari-Esfahan, N.; Pratt, K.; Jeon, M. Voice-Based Intelligent Virtual Agents (VIVA) to Support Construction Worker Productivity. Autom. Constr. 2022, 143, 104554. [Google Scholar] [CrossRef]
Longo, F.; Padovano, A. Voice-enabled Assistants of the Operator 4.0 in the Social Smart Factory: Prospective role and challenges for an advanced human–machine interaction. Manuf. Lett. 2020, 26, 12–16. [Google Scholar] [CrossRef]
Ionescu, T.B.; Schlund, S. Programming cobots by voice: A human-centered, web-based approach. Procedia CIRP 2021, 97, 123–129. [Google Scholar] [CrossRef]
Rožanec, J.M.; Zajec, P.; Trajkova, E.; Šircelj, B.; Brecelj, B.; Novalija, I.; Dam, P.; Fortuna, B.; Mladenić, D. Towards a Comprehensive Visual Quality Inspection for Industry 4.0. IFAC-Pap. 2022, 55, 690–695. [Google Scholar] [CrossRef]
Spencer, B.F.; Hoskere, V.; Narazaki, Y. Advances in Computer Vision-Based Civil Infrastructure Inspection and Monitoring. Engineering 2019, 5, 199–222. [Google Scholar] [CrossRef]
Hoskere, V.; Park, J.-W.; Yoon, H.; Spencer, B.F. Vision-Based Modal Survey of Civil Infrastructure Using Unmanned Aerial Vehicles. J. Struct. Eng. 2019, 145, 04019062. [Google Scholar] [CrossRef]
Reichenstein, T.; Raffin, T.; Sand, C.; Franke, J. Implementation of Machine Vision based Quality Inspection in Production: An Approach for the Accelerated Execution of Case Studies. Procedia CIRP 2022, 112, 596–601. [Google Scholar] [CrossRef]
He, Y.; Zhao, Y.; Han, X.; Zhou, D.; Wang, W. Functional risk-oriented health prognosis approach for intelligent manufacturing systems. Reliab. Eng. Syst. Saf. 2020, 203, 107090. [Google Scholar] [CrossRef]
Papavasileiou, A.; Aivaliotis, P.; Aivaliotis, S.; Makris, S. An optical system for identifying and classifying defects of metal parts. Int. J. Comput. Integr. Manuf. 2022, 35, 326–340. [Google Scholar] [CrossRef]
Al-Sabbag, Z.A.; Yeum, C.M.; Narasimhan, S. Enabling human–machine collaboration in infrastructure inspections through mixed reality. Adv. Eng. Inform. 2022, 53, 101709. [Google Scholar] [CrossRef]
Ren, W.; Yang, X.; Yan, Y.; Hu, Y.; Zhang, L. A digital twin-based frame work for task planning and robot programming in HRC. Procedia CIRP 2021, 104, 370–375. [Google Scholar] [CrossRef]
Li, H.; Ma, W.; Wang, H.; Liu, G.; Wen, X.; Zhang, Y.; Yang, M.; Luo, G.; Xie, G.; Sun, C. A framework and method for Human-Robot cooperative safe control based on digital twin. Adv. Eng. Inform. 2022, 53, 101701. [Google Scholar] [CrossRef]
Makris, S.; Aivaliotis, P. AI-based vision system for collision detection in HRC applications. Procedia CIRP 2022, 106, 156–161. [Google Scholar] [CrossRef]
Mello, R.C.; Sierra M., S.D.; Scheidegger, W.M.; Múnera, M.C.; Cifuentes, C.A.; Ribeiro, M.R.N.; Frizera-Neto, A. The PoundCloud framework for ROS-based cloud robotics: Case studies on autonomous navigation and human–robot interaction. Robot. Auton. Syst. 2022, 150, 103981. [Google Scholar] [CrossRef]
Olbort, J.; Röhm, B.; Kutscher, V.; Anderl, R. Integration of Communication using OPC UA in MBSE for the Development of Cyber-Physical Systems. Procedia CIRP 2022, 109, 227–232. [Google Scholar] [CrossRef]
Fennel, M.; Geyer, S.; Hanebeck, U.D. RTCF: A framework for seamless and modular real-time control with ROS. Softw. Impacts 2021, 9, 100109. [Google Scholar] [CrossRef]
Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot Operating System 2: Design, Architecture, and Uses In The Wild. arXiv 2022, arXiv:2211.07752. [Google Scholar] [CrossRef]
Bruyninckx, H. Open robot control software: The OROCOS project. In Proceedings of the 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164), Seoul, Republic of Korea, 21–29 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 3, pp. 2523–2528. [Google Scholar]
Treinen, T.; Kolla, S.S.V.K. Augmented Reality for Quality Inspection, Assembly and Remote Assistance in Manufacturing. Procedia Comput. Sci. 2024, 232, 533–543. [Google Scholar] [CrossRef]
Wu, Z.-G.; Lin, C.-Y.; Chang, H.-W.; Lin, P.T. Inline Inspection with an Industrial Robot (IIIR) for Mass-Customization Production Line. Sensors 2020, 20, 3008. [Google Scholar] [CrossRef]
Land, N.; Syberfeldt, A.; Almgren, T.; Vallhagen, J. A Framework for Realizing Industrial Human-Robot Collaboration through Virtual Simulation. Procedia CIRP 2020, 93, 1194–1199. [Google Scholar] [CrossRef]
ROS on DDS. Available online: https://design.ros2.org/articles/ros_on_dds.html (accessed on 29 August 2023).
Pardo-Castellote, G. OMG data-distribution service: Architectural overview. In Proceedings of the 23rd International Conference on Distributed Computing Systems Workshops, 2003. Proceedings, Providence, RI, USA, 19–22 May 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 200–206. [Google Scholar]
Erős, E.; Dahl, M.; Bengtsson, K.; Hanna, A.; Falkman, P. A ROS2 based communication architecture for control in collaborative and intelligent automation systems. Procedia Manuf. 2019, 38, 349–357. [Google Scholar] [CrossRef]
Horelican, T. Utilizability of Navigation2/ROS2 in Highly Automated and Distributed Multi-Robotic Systems for Industrial Facilities. IFAC-Pap. 2022, 55, 109–114. [Google Scholar] [CrossRef]
Paul, H.; Qiu, Z.; Wang, Z.; Hirai, S.; Kawamura, S. A ROS 2 Based Robotic System to Pick-and-Place Granular Food Materials. In Proceedings of the 2022 IEEE International Conference on Robotics and Biomimetics (ROBIO), Jinghong, China, 5–9 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 99–104. [Google Scholar]
Serrano-Munoz, A.; Elguea-Aguinaco, I.; Chrysostomou, D.; Bogh, S.; Arana-Arexolaleiba, N. A Scalable and Unified Multi-Control Framework for KUKA LBR iiwa Collaborative Robots. In Proceedings of the 2023 IEEE/SICE International Symposium on System Integration (SII), Atlanta, GA, USA, 17–20 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.-Y.; Mark Liao, H.-Y.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1571–1580. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Volume 8691, pp. 346–361. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. arXiv 2018, arXiv:1803.01534. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Speech-to-Text: Automatic Speech Recognition. Available online: https://cloud.google.com/speech-to-text/ (accessed on 26 January 2024).
GitHub—Ros2/Rclcpp: Rclcpp (ROS Client Library for C++). Available online: https://github.com/ros2/rclcpp (accessed on 29 August 2023).
MoveIt 2 Documentation—MoveIt Documentation: Rolling Documentation. Available online: https://moveit.picknik.ai/main/index.html (accessed on 29 August 2023).
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Roboception—Rc_Visard 65 Color. Available online: https://roboception.com/product/rc_visard-65-color/ (accessed on 29 August 2023).
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Joint_Trajectory_Controller—ROS2_Control: Rolling Aug 2023 Documentation. Available online: https://control.ros.org/master/doc/ros2_controllers/joint_trajectory_controller/doc/userdoc.html (accessed on 29 August 2023).
UR10e Medium-Sized, Versatile Cobot. Available online: https://www.universal-robots.com/products/ur10-robot/ (accessed on 29 August 2023).
BlueParrott B250-XTS. Available online: https://www.emea.blueparrott.com/on-the-road-headsets/blueparrott-b250-xts##204426 (accessed on 29 August 2023).
Sugarindra, M.; Suryoputro, M.R.; Permana, A.I. Mental workload measurement in operator control room using NASA-TLX. IOP Conf. Ser. Mater. Sci. Eng. 2017, 277, 012022. [Google Scholar] [CrossRef]
Karami, H.; Darvish, K.; Mastrogiovanni, F. A Task Allocation Approach for Human-Robot Collaboration in Product Defects Inspection Scenarios. In Proceedings of the 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Naples, Italy, 31 August–4 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1127–1134. [Google Scholar]
Darvish, K.; Bruno, B.; Simetti, E.; Mastrogiovanni, F.; Casalino, G. Interleaved Online Task Planning, Simulation, Task Allocation and Motion Control for Flexible Human-Robot Cooperation. In Proceedings of the 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Nanjing, China, 27–31 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 58–65. [Google Scholar]
Rio-Torto, I.; Campaniço, A.T.; Pinho, P.; Filipe, V.; Teixeira, L.F. Hybrid Quality Inspection for the Automotive Industry: Replacing the Paper-Based Conformity List through Semi-Supervised Object Detection and Simulated Data. Appl. Sci. 2022, 12, 5687. [Google Scholar] [CrossRef]
Im, D.; Jeong, J. R-CNN-Based Large-Scale Object-Defect Inspection System for Laser Cutting in the Automotive Industry. Processes 2021, 9, 2043. [Google Scholar] [CrossRef]
Basamakis, F.P.; Bavelos, A.C.; Dimosthenopoulos, D.; Papavasileiou, A.; Makris, S. Deep object detection framework for automated quality inspection in assembly operations. Procedia CIRP 2022, 115, 166–171. [Google Scholar] [CrossRef]

Figure 1. Proposed concept of a human–robot collaborative inspection framework.

Figure 2. Visualization of the proposed object detection model’s operation.

Figure 3. Object detection model’s overall architecture.

Figure 4. CSP and residual blocks architecture.

Figure 5. Spatial Pyramid Pooling—SPP—network.

Figure 6. Voice-enabled ROS2 framework architecture.

Figure 7. Frontend user interface of the human voice-based interaction tool.

Figure 8. Digital scene including the object to be inspected and the robot manipulator.

Figure 9. Labeled data of an automotive engine part.

Figure 10. Mish activation function.

Figure 11. Sequence diagram for voice-enabled human–robot collaborative inspection.

Figure 12. Pre-industrial setup of the selected case study.

Figure 13. Execution cycle time results.

Figure 14. Cycle time comparison.

Figure 15. Object detection module evaluation metrics.

Figure 16. Conducted experiments on the quality inspection module.

Figure 17. Recognition error analysis of selected vocal commands.

Figure 18. Process response time measured with CPU and GPU workload variation.

Table 1. Existing HRC inspection frameworks and gap identification.

Existing Quality Inspection Solutions	Main Topic	Limitations Compared to Proposed Framework
[42]	Augmented Reality and Remote Assistances	No robotic manipulator to alleviate the repetitive task from the human and no in-line inspection
[43]	Inline Inspection with an Industrial Robot	Focused on a robot-centric approach and excluded the human factor and benefits of HRC
[44]	Framework for HRC through Simulation	Results are based on simulated environments and the human has minimal control over the process

Table 2. Existing ROS2-based solutions and gap identification.

Existing ROS2 Based Solutions	Main Topic	Limitations Compared to Proposed Framework
[47]	ROS2-based architecture for control in collaborative automation systems	Deprecated ROS1 nodes are still in place, interaction techniques are missing
[48]	ROS2-based multi robotic system navigation	Focused on navigation, HRC operations not covered
[49]	ROS2-based robotic system for pick and place operations	Human–robot collaboration not tackled
[50]	Multi-control framework for KUKA collaborative robots	Focused on specific robot controller, interaction with operators and inspection operations not included

Table 3. Training and validation data.

Classes	Training Images	Validation Images
Correct Screw	661	72
Faulty Screw	743	82
Correct Connector	383	42
Faulty Connector	469	52
Correct Valve	259	28
Incorrect Valve	289	32

Table 4. List of tasks—automotive inspection operation.

ID	Title	Description
Task 1	Initialization	Operator pronounces the command “start” and the robot goes to home position.
Task 2	Inspection of motor’s right side	Operator pronounces the command “right” and the robot goes to the right side to perform inspection. Meanwhile, the operator performs inspection on the back area.
Task 3	Inspection of motor’s top side	Operator pronounces the command “top” and the robot goes to the top area to perform inspection. Meanwhile, the operator performs corrective actions on the right area.
Task 4	Inspection of motor’s left side	Operator pronounces the command “left” and the robot goes to the left area to perform inspection. Meanwhile, the operator performs corrective actions on the top area.
Task 5	Finalization	All motor sides are inspected. The operator pronounces the command “stop” and the robot program is deactivated.

Table 5. Participants tested the voice recognition module and their characteristics.

Participant	Gender	Age	Familiarity with Robotic Frameworks	Familiarity with Voice-Based Frameworks	Profession
1	Male	23	Yes	No	Robotics Engineer
2	Male	26	Yes	Yes	Robotics Software Engineer
3	Male	27	No	Yes	Master Thesis Student
4	Male	20	No	No	Researcher
5	Female	34	Yes	Yes	Manager

Table 6. Execution results for the proposed framework.

Test No.	Velocity Scaling Factor	Accel. Scaling Factor	Planning Attempts	Right Task (sec)	Top Task (sec)	Left Task (sec)	Cycle Time (sec)
1	0.1	0.1	4	10	10	15	49
2	0.2	0.2	4	6	6	10	36
3	0.3	0.3	4	5	5	8	32
4	0.15	0.15	4	8	9	15	44
5	0.1	0.2	4	10	9	14	47
6	0.2	0.1	4	9	9	13	45
7	0.2	0.3	4	7	7	10	38
8	0.3	0.2	4	7	7	10	38
9	0.4	0.4	4	6	5	9	34
10	0.1	0.1	3	10	9	14	47
11	0.2	0.2	3	6	6	10	36
12	0.3	0.3	3	5	6	8	33
13	0.1	0.1	2	10	10	15	49
14	0.2	0.2	2	6	6	10	36
15	0.3	0.3	2	5	5	9	33
16	0.1	0.1	1	9	10	15	48
17	0.2	0.2	1	6	6	10	36
18	0.3	0.3	1	5	5	8	32

Table 7. Comparison of existing frameworks.

Framework	Avg. Robot Action Time (s)	Avg. Robot Action Time (%)	Human Control over the Process
[71]	251.00	70.72	NO
[70]	203.00	82.00	NO
Proposed solution	25.00	63.1	YES

Table 8. Experimental results for the proposed quality inspection module.

ID	Screws						Connectors						Valves
ID	TP	FP	FN	P	R	F1	TP	FP	FN	P	R	F1	TP	FP	FN	P	R	F1
1	24	0	3	1.0	0.89	0.94	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
2	24	1	2	0.96	0.92	0.94	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
3	23	0	4	1.0	0.85	0.92	5	1	2	0.83	0.71	0.76	2	0	1	1.0	0.67	0.79
4	21	0	6	1.0	0.78	0.87	6	1	1	0.85	0.85	0.86	2	0	1	1.0	0.67	0.79
5	24	0	3	1.0	0.89	0.94	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
6	25	0	2	1.0	0.92	0.96	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
7	25	0	2	1.0	0.92	0.96	6	0	2	1.0	0.75	0.85	2	0	1	1.0	0.67	0.79
8	26	0	1	1.0	0.96	0.98	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
9	26	0	1	1.0	0.96	0.98	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
10	26	0	1	1.0	0.96	0.98	7	0	1	1.0	0.88	0.93	2	0	1	1.0	0.67	0.79
11	26	0	1	1.0	0.96	0.98	6	0	2	1.0	0.75	0.85	3	0	0	1.0	1.0	1.0
12	26	0	1	1.0	0.96	0.98	8	0	0	1.0	1.0	1.0	3	0	0	1.0	1.0	1.0
13	26	0	1	1.0	0.96	0.98	7	0	1	1.0	0.88	0.93	3	0	0	1.0	1.0	1.0
14	27	0	0	1.0	1.0	1.0	6	0	2	1.0	0.75	0.85	3	0	0	1.0	1.0	1.0
15	27	0	0	1.0	1.0	1.0	8	0	0	1.0	1.0	1.0	2	0	1	1.0	0.67	0.79
16	27	0	0	1.0	1.0	1.0	7	1	0	0.88	1.0	0.93	2	0	1	1.0	0.67	0.79
17	27	0	0	1.0	1.0	1.0	7	1	0	0.88	1.0	0.93	2	0	1	1.0	0.67	0.79
18	27	0	0	1.0	1.0	1.0	8	0	0	1.0	1.0	1.0	3	0	0	1.0	1.0	1.0
Avg	25.4	0.06	1.56	0.99	0.94	0.96	7.1	0.22	0.61	0.97	0.92	0.94	2.2	0	0.72	1.0	0.75	0.85

Table 9. Existing method comparison.

Existing Methods	Object Detection Accuracy
Hybrid Quality Inspection [72]	82%
ResNet-101-FPN [73]	71.8%
Deep Learning framework [74]	96%
Proposed Framework	98%

Table 10. NASA-TLX scoring for the conducted tests.

Participant	MD	PD	TD	OP	EF	FR	Total
1	0	20	10	10	15	10	9.67
2	10	20	10	30	10	0	15.33
3	10	20	15	10	0	0	10.67
4	15	25	30	10	30	20	21.33
5	25	10	20	10	30	30	19.67
Overall Score							15.33

Table 11. NASA-TLX weighting for the conducted tests.

Participant	Mental Demand	Physical Demand	Temporal Demand	Performance	Effort	Frustration Levels	Total
1	3	1	1	3	3	4	15
2	1	1	0	5	4	3	15
3	3	2	4	3	2	1	15
4	3	3	2	2	2	3	15
5	1	1	4	2	3	2	15
Final Weight	3	1	3	5	1	2	15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Papavasileiou, A.; Nikoladakis, S.; Basamakis, F.P.; Aivaliotis, S.; Michalos, G.; Makris, S. A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection. Appl. Sci. 2024, 14, 4138. https://doi.org/10.3390/app14104138

AMA Style

Papavasileiou A, Nikoladakis S, Basamakis FP, Aivaliotis S, Michalos G, Makris S. A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection. Applied Sciences. 2024; 14(10):4138. https://doi.org/10.3390/app14104138

Chicago/Turabian Style

Papavasileiou, Apostolis, Stelios Nikoladakis, Fotios Panagiotis Basamakis, Sotiris Aivaliotis, George Michalos, and Sotiris Makris. 2024. "A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection" Applied Sciences 14, no. 10: 4138. https://doi.org/10.3390/app14104138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Voice-Enabled ROS2 Framework for Human–Robot Collaborative Inspection

Abstract

1. Introduction

1.1. Literature Review

1.1.1. Human–Robot Collaboration (HRC) in Industrial Applications

1.1.2. Human-Robot Interaction Techniques in Industrial Environments

1.1.3. Human-Robot Collaboration (HRC) in Quality Inspection Processes

1.1.4. Controlling and Integration Approaches for Human-Robot Collaborative Operations

1.2. Progress beyond the State of the Art

2. Approach

2.1. Orchestration

2.2. Voice Recognition

2.3. Quality Inspection

2.3.1. Image Acquisition

2.3.2. Part Recognition

2.3.3. Model Architecture

2.4. Robot Manipulation

2.5. Visualization

3. System Implementation

3.1. Human Voice-Based Interaction Layer

3.1.1. Voice Backend Services

3.1.2. Frontend User Interface

3.2. Digital Computation and Visualization Layer

3.2.1. ROS2 Orchestrator

3.2.2. Robot Manipulation Servers

3.2.3. Scene Monitor

3.2.4. Quality Inspection Server

Dataset Generation and Augmentation

Inspection Model Training

Inspection Model Integration

3.3. Robot-Side Execution Layer

3.3.1. Robot State Monitor

3.3.2. Motion Planning

3.3.3. Robot Controller

4. Case Study

4.1. Hardware Components

4.2. Case Scenario

4.3. Evaluation Criteria

4.3.1. Manufacturing Process Contributions

4.3.2. Quality Inspection Efficiency

4.3.3. Human–Robot Interaction Metrics

4.3.4. System Performance

5. Results

6. Discussion and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI