Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications

Terras, Nuno; Pereira, Filipe; Ramos Silva, António; Santos, Adriano A.; Lopes, António Mendes; Silva, António Ferreira da; Cartal, Laurentiu Adrian; Apostolescu, Tudor Catalin; Badea, Florentina; Machado, José

doi:10.3390/app15031336

Open AccessArticle

Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications

by

Nuno Terras

¹,

Filipe Pereira

^1,2,3

,

António Ramos Silva

^1,2

,

Adriano A. Santos

^2,4

,

António Mendes Lopes

^1,2

,

António Ferreira da Silva

^2,4

,

Laurentiu Adrian Cartal

⁵

,

Tudor Catalin Apostolescu

⁶,

Florentina Badea

⁷

and

José Machado

^3,8,*

¹

Faculty of Engineering, University of Porto, 4200-465 Porto, Portugal

²

INEGI-Institute of Science and Innovation in Mechanical and Industrial Engineering, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

³

MEtRICs Research Centre, School of Engineering, University of Minho, Campus of Azurém, 4800-058 Guimarães, Portugal

⁴

CIDEM-Department of Mechanical Engineering, School of Engineering, Polytechnic of Porto, Rua Dr. António Bernardino de Almeida 431, 4249-015 Porto, Portugal

⁵

Department of Mechatronics and Precision Mechanics, Faculty of Mechanical Engineering and Mechatronics, National University of Science and Technology POLITEHNICA Bucharest, 060042 Bucharest, Romania

⁶

Faculty of Informatics, Titu Maiorescu University, Calea Văcăreşti nr.189, Sector 4, 0400511 Bucharest, Romania

⁷

National Institute of Research and Development in Mechatronics and Measurement Technique, Șos. Pantelimon, Nr. 6-8, Sector 2, 021631 Bucharest, Romania

⁸

CESTER—Research Center for Industrial Robots Simulation and Testing, Technical University of Cluj-Napoca, 400114 Cluj, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1336; https://doi.org/10.3390/app15031336

Submission received: 26 December 2024 / Revised: 13 January 2025 / Accepted: 17 January 2025 / Published: 27 January 2025

(This article belongs to the Special Issue New Challenges in Conceptual Design of Robotic and Mechatronic Systems: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Collaborative robotics and computer vision systems are increasingly important in automating complex industrial tasks with greater safety and productivity. This work presents an integrated vision system powered by a trained neural network and coupled with a collaborative robot for real-time sorting and quality inspection in a food product conveyor process. Multiple object detection models were trained on custom datasets using advanced augmentation techniques to optimize performance. The proposed system achieved a detection and classification accuracy of 98%, successfully processing more than 600 items with high efficiency and low computational cost. Unlike conventional solutions that rely on ROS (Robot Operating System), this implementation used a Windows-based Python framework for greater accessibility and industrial compatibility. The results demonstrated the reliability and industrial applicability of the solution, offering a scalable and accurate methodology that can be adapted to various industrial applications.

Keywords:

collaborative robots; computer vision; YOLO; deep learning; real-time object detection; industrial automation

1. Introduction

Since the beginning of the Industrial Revolution, marked by steam engines and machine tools, the evolution toward more efficient industrial equipment with better interactions with the operator has exceeded all expectations. The introduction of computers and their high processing capacity brought the first generation of robots to the industrial sector. With the integration of programmable logic controllers (PLCs) and servo controllers, these robots became fully operational on production lines [1,2,3]. However, the arrival of perception systems powered by deep learning algorithms catalyzed the current industrial revolution: increasingly adaptable and environmentally conscious robots, improving the physical capabilities of operators, increasing productivity and safety, and reducing costs. On the other hand, the convergence of robotics and deep learning (DL), especially in the field of computer vision, has propelled these systems to the forefront of industrial transformation [4]. DL-oriented models allow robots—particularly collaborative robots—to analyze visual data with exceptional precision, enabling them to perform complex tasks such as sorting, inspection, and assembly with levels of precision, efficiency, and safety that far exceeds traditional methods.

The aim of this work is to design and develop an automated chocolate packaging system. This system will employ a collaborative robot (cobot) together with a vision system, integrating algorithms to detect, identify, and accurately handle chocolate bonbons. The first phase of this work involved research into the technical background and state of the art of relevant technologies and topics in collaborative robotics and deep learning. In order to achieve this objective, the following sub-tasks were identified:

System requirements—Define functional requirements, including chocolate types, packaging speed, weight, and accuracy. Specify hardware and software, including the cobot, camera, gripper, and vision system integration.
Vision system—Select and calibrate a high-precision camera. Develop and validate algorithms for chocolate localization and ensure accuracy and speed.
Cobot integration—Program the cobot for a pick and the available place points, optimize trajectories, and avoid collisions. Configure the gripper to handle chocolates gently and adjust movements based on real-time vision data.
Testing and validation—Validate system accuracy, reliability, and speed under varying conditions, such as different lighting and conveyor speeds.

The main objective of this work is to develop an efficient and adaptable vision system integrated into a collaborative robotic environment. The system aims to perform real-time object detection, classification, and quality inspection, catering to various production scales, from small businesses to large industries. It also addresses specific requirements such as force control for handling delicate items and the high-speed processing needed in production lines, particularly in the food industry. The key innovations and contributions of this work are as follows:

1.: The use of advanced deep learning models—The system employs the YOLOv10 model, known for its high accuracy and low latency, making it ideal for real-time industrial applications.
2.: Data augmentation techniques—Techniques such as CutMix and PixelDropout were used to enhance the model’s robustness, even with limited datasets, ensuring consistent performance under varying conditions.
3.: Optimized communication pipelines—The integration of RTDE (Real-Time Data Exchange) protocols ensures seamless synchronization between the robot, vision system, and conveyor belt, enabling smooth real-time operation.
4.: Adaptable to industry-specific needs—The system is designed to address specific industrial requirements by ensuring precise force control for the delicate handling of items like chocolates, preventing damage during processing. It is capable of operating at high speeds to align with the demands of fast-paced production lines, while its scalable design allows seamless application across both small-scale and high-volume industrial environments.
5.: Improved industrial usability—The system uses a Windows-based Python framework, eliminating the need for ROS or dedicated solutions and simplifying implementation and maintenance across diverse industries.
6.: Multi-class detection and sorting—The system can identify compliant, non-compliant, and unpackaged items, improving quality assurance and reducing waste.
7.: Precise object localization and handling—Methods were developed for the accurate detection and positioning of objects, ensuring successful pick-and-place tasks.
8.: Automation in dynamic environments—The system adapts to changes in factory layouts and conditions, maintaining high-quality standards and operational safety.
9.: Cost-effectiveness and accessibility—The system leverages low-cost cameras that deliver adequate precision for industrial tasks, making it highly accessible to Small and Medium-sized Enterprises (SMEs). Additionally, its easy-to-use APIs and low maintenance costs reduce downtime and simplify integration, ensuring a practical and affordable solution for diverse industrial applications.

One limitation of the proposed system is the potential degradation of model performance over time due to evolving production conditions and datasets. While this issue is not the primary focus of this study, it represents a significant challenge for the long-term deployment of deep learning models in industrial settings. Solutions such as periodic re-training, as demonstrated in the study presented by Dong Wang et al. [5], on the surface could be explored to mitigate this issue.

In resume, this work uses all of these technologies, such as computer vision, collaborative robots, RTDE (Real-Time Data Exchange), low-cost cameras combined with AI models, advanced lighting and image processing techniques, and a gripper with force control. This solution can be scalable to the food industry or other industrial applications that need precise, real-time applications and tasks that need force control and accuracy.

Below is a diagram, Figure 1, that presents the main stages of the development of the research described in this article.

This paper is organized into five sections. Section 2 reviews the current literature on relevant topics and compares them with previous studies. Section 3 details the architectural solution, including the layout, configuration requirements, and communication between equipment. It also explains the implementation of the cobot vision system, the creation of the image dataset, and the development, training, and selection of the most suitable model for the cell. Additionally, the logic behind the cell’s cycle is described, with a state machine illustrating the system’s operation. Section 4 presents the experimental results, focusing on performance analysis and system evaluation. Finally, Section 5 summarizes the main findings, offers conclusions, and proposes future work to enhance the proposed solution.

2. Literature Review

The integration of computer vision in collaborative robotics plays a pivotal role in achieving precise object detection, localization, and handling, which are critical for tasks such as “pick and place” and other industrial operations. This chapter reviews existing technologies for collaborative robots (cobots), focusing on how RTDE (Real-Time Data Exchange) ensures seamless coordination between the cobot, vision system, and conveyor belt, enabling the efficient handling of chocolates under varying production conditions. The subsequent sections will analyze eight of the most impactful works in this field, identifying their key contributions and highlighting gaps in the current state of the art. These insights serve as the basis for the proposed advancements. This chapter also examines how technologies such as low-cost cameras, advanced lighting and image processing techniques, and grippers with force control are combined with AI models to meet the demands of modern manufacturing. These include handling fragile products, optimizing production speeds, and maintaining high accuracy and safety standards. The proposed system addresses these challenges with innovative solutions, paving the way for further advancements in collaborative robotics [6,7,8,9,10,11]. The chapter frames the most recent research on collaborative robotics, vision systems, and deep learning in the context of safe human–robot interaction, a critical requirement for the next generation of industrial automation.

2.1. Integration of Cobots with Computer Vision and Object Detection

From previous works, Adriano A. Santos et al. (2024) presents a work that explores the integration of artificial vision and image processing systems into a collaborative robotic system for “pick-and-place” tasks. The proposed solution uses a UR3e collaborative robot (Producer: Universal Robots, City: Odense, Denmark) equipped with a low-cost Intel D435i camera (Producer: Intel Corporation, City: Santa Clara, CA, USA) to identify and manipulate randomly placed objects (Figure 2). The system incorporates machine vision algorithms to consider features such as color, orientation, and contour, as well as integrating an optimized light source to improve image quality and reduce shadows [12].

The system architecture is based on Neadvance Niop software (Verison 0.9.21), which uses a drag-and-drop approach to schedule workflows. The system includes modules for deletion, object detection, data communication, and manipulation operation. The experimental results demonstrate that using a controlled light source significantly improves the detection accuracy, and vision algorithms are able to identify and classify objects with high accuracy. Based on our research, this work presents the following limitations [12]:

Dependence on controlled lighting;
Tests restricted to laboratory environments;
Need for manual calibration;
Limited robot range;
Lack of deep learning;
Customization restriction due to the use of NIOP;
Errors with objects outside the camera’s center field.

2.2. Integration of Deep-Learning Techniques with Collaborative Robots

Ruohuai Sun et al. (2024) presents a work that developed a vision system integrated into a 6-degree-of-freedom KINOVA robotic arm (Producer: Kinova Robotics, City: Boisbriand, QC, Canada) for real-time grasping tasks. Using computer vision, the system combines object detection with pose estimation, allowing the robot to identify and manipulate objects efficiently. It is designed for applications where precision and efficiency are critical using an architecture based on advanced algorithms to perform tasks such as “pick and place” [13].

The system employs the RealSense D435i RGB-D camera (Producer: Intel Corporation, City: Santa Clara, CA, USA) for image capture, and the YOLOv3 and GG-CNN networks for detection and grasping, represented in Figure 3. Images are processed in real time, with support for segmentation algorithms for object recognition. Furthermore, the project uses a limited dataset (100 NEU-COCO images) for training, with controlled lighting to minimize errors. The integration between the robot and vision is performed synchronously, allowing efficient operation under controlled test conditions [13].

Based on our research, this work presents the following limitations [13]:

The robot used (KINOVA) is more common in laboratory applications and is not optimized for large-scale industrial environments, limiting its applicability as a cobot for industrial applications.
Although the system performs well in controlled scenarios, the combination of YOLOv3 and GG-CNN may not deliver the performance required for real-time industrial applications, particularly in production lines demanding greater speed.
The system was trained with a dataset of just 100 images (NEU-COCO), which restricts its generalization capacity and adaptability to more diverse objects and conditions.
Data synchronization and robot interaction are not designed to handle multiple dynamic factors such as lighting variations, object movement, and changes in workspace layout.
The work does not cover advanced strategies for dealing with lighting variations, such as shadows or diffuse lights, which can compromise accuracy in real-world scenarios.

2.3. Collaborative Robot with Reinforcement Learning for Pick-and-Place Applications

Natanael Magno Gomes et al. (2022) [14] proposed a system for pick-and-place tasks in collaborative robotics, integrating Deep Reinforcement Learning (DRL) with pre-trained convolutional neural networks (CNNs) for object detection and grip positioning. The system is composed of a UR3e cobot, a Robotiq 2F-85 gripper, and an Intel RealSense D435i camera, working together with the ROS framework (Figure 4). DRL allows the robot to learn to identify and manipulate objects in unknown positions without supervised training. The simulation environment in Webots was used to train the model, reducing the time required for learning before being applied to real hardware. CNN models such as MobileNet and DenseNet have been used for visual feature extraction.

Based on our research, this work presents the following limitations [14]:

The CNNs used depend on input images with specific dimensions and characteristics, which can limit the applicability of the system in dynamic industrial environments.
Although the UR3e is a cobot certified for safe interaction, the system does not include detailed force control, limiting delicate tasks such as food handling.
The article uses ROS for communication, but does not detail the integration of RTDE for real-time synchronization between components.
The RealSense camera is effective and affordable, but the DRL approach significantly increases training time, making the solution less practical for rapid iterations in industrial applications.
No specific image processing or advanced lighting techniques were described to deal with variable conditions such as shadows or sudden changes in lighting.
The gripper used does not feature advanced force control, which may compromise the safe handling of fragile or delicate objects.
The system is aimed at generic handling tasks and does not address specific applications in the food industry, such as hygiene requirements or handling delicate foods.

2.4. Identifed Gaps and Proposed Solutions

Table 1 presents the identified gaps in the proposed system and state-of-the-art systems. The authors’ solution proposed in this work introduces a series of innovations aimed at the limitations or gaps identified in the various systems addressed in the state of the art.

By analyzing the main gaps in the systems referred in this chapter, we can identify the following lack of solutions:

Dependence on controlled conditions;
○
Most systems rely on controlled lighting conditions, limiting their robustness in environments with variable light and shadow.
○
Testing is often confined to laboratory settings, restricting their applicability in real industrial scenarios.
Limited datasets;
○
The use of small datasets (e.g., 100 images) reduces the generalization capacity of models, making them less adaptable to diverse objects and scenarios.
Lack of integration and synchronization;
○
Integration between computer vision, collaborative robots, and conveyors via protocols like RTDE is not fully explored, hindering real-time synchronization.
Absence of additional sensors;
○
The systems lack additional sensors, such as force sensors, limits their precision in manipulation and environmental perception.
Lack of advanced strategies;
○
There are no advanced solutions to handle lighting variations, such as diffuse lighting or shadow adaptation, which are essential for real-world applications.
Limited force control;
○
Many systems lack detailed force control, restricting their application in delicate tasks, such as handling food or fragile products.
Focus on generic tasks;
○
Most systems target generic pick-and-place tasks, without addressing the specific needs of the food industry, such as hygiene standards or handling sensitive products.
Use of non-industrial robots;
○
Some systems utilize robots (e.g., KINOVA) more suited for laboratory research, lacking robustness or certifications required for large-scale industrial applications.
High training times.
○
Approaches like Deep Reinforcement Learning (DRL) involve significant training times, hindering rapid iterations for practical applications.

The method proposed in this work aims to fill these gaps, making it a scalable solution that can be applied to a wide range of industrial applications, whatever they may be. In the following chapter, the hardware and software used in this system are presented, as well as the results obtained, demonstrating the effectiveness, industrial applicability, and innovation of the system that is presented in this article.

3. Materials and Methods

This section gives an overview of the requirements needed to create the system. The solution is presented, including the relevant components. The detailed configurations of the cobot and conveyor belt are schematized. The chapter ends with graphics and illustrations of the connections and communication protocols in the developed system.

3.1. Deep Learning

Deep learning is a branch of Machine Learning (ML) that seeks to learn high-level abstractions from data [15]. It uses Artificial Neural Networks (ANNs), which are computer representations of human neurons. These networks are made up of several layers that work together to produce optimal results, enabling the extraction of characteristics from datasets and subsequent learning [16]. The scope of this field ranges from the automotive industry and autonomous driving [17] to all manufacturing tasks, such as assembly [18], automatic sorting [19,20], quality inspection [21,22], stock management in the supply chain [23], and object detection [10,22]. In addition, its impact can also be seen in the healthcare industry [24], the agricultural sector [25], the detection of monetary fraud [26], and cybersecurity [27,28].

Despite all these benefits, it should be emphasized that since DL is a branch of ML and ML, in turn, is a branch of AI, the evolution of deep learning is due not only to the computational capacities available and the inherent cost reduction [28] but also essentially to the evolution of AI algorithms that allow computers to learn and improve automatically with experience, without being explicitly programmed [29].

There are several algorithms that can be used in ML, including K-Nearest Neighbors, which is used for both classification and regression using metrics such as Euclidean distance, cosine, or Manhattan distance, or the Naive Bayes algorithm based on Bayes’ Theorem, among others. Variations such as Gaussian, Multinomial, and Bernoulli Naive Bayes are commonly used for different types of data distributions, however the Naive Bayes algorithm will not be the most suitable for cobots due to its non-linear behavior [30].

One-Stage Detectors

Since this project requires fast inference times and low computational power, single-phase detectors will be the main focus. In the following lines, we will discuss two models that fit this criterion, RetinaNet and YOLO.

RetinaNet

RetinaNet, introduced by Lin et al. (2017) [31], is a one-stage object detection model that addresses the issue of class imbalance during training. It combines high accuracy with efficient processing, making it suitable for real-time applications.

RetinaNet uses an FPN with ResNet as its backbone, enabling the extraction of features at various scales. The introduction of the FPN improves feature maps by combining low-resolution, semantically strong features with high-resolution, spatially fine features. It improves the extraction of features at various scales, from low-resolution and semantically strong features to high-resolution and spatially fine features [31]. It improves the model’s performance in detecting large and small objects. Two sub-networks are used in the detection head: one for classification and one for regression. Using focal loss, it alleviates the problem of imbalance between classes, reducing the loss of well-classified examples by concentrating training on more difficult and poorly classified examples. It improves the performance of the OD model in cases where foreground objects are scarce compared to the large number of background objects. Its performance on the COCO dataset, combined with its compact and computationally efficient design, makes this model a suitable candidate for this work.

YOLO Versions

YOLOv4 [32] improves accuracy and speed by introducing the CSPDarknet53 backbone, Mosaic Augmentation, and Self-Adversarial Training (SAT). These features enhance performance, especially for small objects, while keeping the model efficient. YOLOv5 [33] focuses on ease of use, being the first version implemented in PyTorch. It improves speed, supports larger image sizes, and introduces optimized loss functions and augmentation techniques, making it accessible for developers. YOLOv6 [34] targets industrial applications, optimizing performance for edge devices with the Efficient Rep Backbone (ERB) and neck. It offers better scalability for detecting objects at various sizes. YOLOv7 [35] balances speed and accuracy, introducing the Extended Efficient Layer Aggregation Network (E-ELAN) for better feature reuse. It surpasses previous versions in real-time object detection benchmarks, particularly for small and medium-sized objects. YOLOv8 [36] was developed by the same team that developed YOLOv5, featuring an optimized version of CSPDarknet53 as the backbone of the model. YOLOv8 introduces changes to the architecture, such as the “C2f” module, to deal with the detection of smaller objects, and a return to a model without anchors. It also introduces Complete Intersection over Union (CIoU) for bounding box loss and Decentralized Focal Loss (DFL) for classification and support for multiple backbone architectures, allowing flexibility of choice. Augmentation techniques such as CutMix and MixUp have been incorporated. YOLOv9 [37] builds on YOLOv8 with a focus on stability and scalability, making it suitable for complex scenarios that need high accuracy in object detection. It still uses Non-Maximum Suppression (NMS) for post-processing, which leads to slower inference times compared to YOLOv10 [38]. This version works best for tasks with many objects or smaller targets. YOLOv10, however, introduces a modular design with decoupled heads to improve detection at different scales and removes the need for NMS, cutting the post-processing time. It is the fastest version so far, with inference times as low as 49.1 ms, while maintaining accuracy [38]. YOLOv10 is ideal for real-time applications that need very fast processing, like videos or embedded systems. In short, YOLOv9 focuses on stability, while YOLOv10 is designed for speed and real-time use [39].

Figure 5 shows the complete architecture of YOLOv10, with the base layers presented on the left of the image, the neck in the lower middle position, and the different decoupled head blocks at the right end of the diagram. The top right corner contains the organization of the layers within the different layer modules, such as the C2f module, Spatial Pyramid Pooling Fast (SPPF), as well as the composition of the detection head.

Since then, other new versions have been released, YOLOv9, YOLOv10, and YOLO11 [40], introducing architectural innovations with an improved CSPDarknet53 backbone and new Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (G-ELAN) components in the neck and slight improvements in average precision (mAP) for the MS COCO dataset.

3.2. Performance Metrics Used

Evaluating the performance of object detection algorithms is essential to guarantee their accuracy and reliability in real-life scenarios. There are several metrics for evaluating the effectiveness of these algorithms, each offering a unique view of their performance and providing quantitative evaluations that help test and compare detection models. The following steps explore their importance and methodologies for evaluating the effectiveness of detection systems.

Intersection over Union (IoU) is a popular metric used to assess a model’s ability to accurately distinguish between the boundaries of the object and the background. This is achieved by comparing the area of the manually labelled box in the dataset with the area of the bounding box predicted by the model. The closer this comparison is to a value of one, the more accurate and precise the model’s performance. Figure 6 shows an example of this metric, with the true bounding boxes and the predicted bounding boxes. However, many other parameters must be taken into account, such as accuracy, precision, average precision, mean average precision, recall, F1-score, and inference time [33].

Another fundamental metric for assessing the correct procession of detected objects is confusion matrices. These statistically evaluate the results obtained during the classification phase. The simplest representation in ML is a binary classification, a 2 × 2 matrix, capable of correctly predicting a positive (true positive or TP), correctly predicting a negative (true negative or TN), incorrectly predicting a positive (false positive or FP), and incorrectly predicting a negative (false negative or FN). The matrix tells you which classes perform best and worst. Refining the model makes the matrix evolve into a diagonal matrix, since there are fewer wrong predictions, strongly densifying the main diagonal of the matrix [33].

Accuracy is the overall performance parameter of the model in one class. In a multi-class problem, this metric is the ratio between the number of correct predictions and the total number of predictions made.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

Precision is the fraction of true positives in relation to the total number of positives. In a multi-class problem, it can be understood as the fraction of the number of correct predictions of a given class in relation to the sum of this value and the number of instances in which the model wrongly labelled that class in an object belonging to a different class.

P r e c i s i o n = \frac{T P}{T P + F P}

Average precision is used to quantify the precision of the detections made, and a different value will be obtained for each available class. AP_c corresponds to the sum for each class of the precision values recorded in each instance. In resume, AP measures performance by class.
Mean average precision is the benchmark metric to analyze object detection algorithms. It comprises the average accuracy of all the classes and, in itself, gives a fairly accurate description of the capabilities of a given model. mAP measures the average performance while considering all classes.
Recall is the ratio between the number of correctly predicted positives detected and the sum of the number of correctly predicted positives and the number of incorrectly predicted negatives. In a multi-class problem, it can be understood as the ratio between the number of correct predictions for a given class and the sum of the previous number and the number of times the model failed to detect an instance of that class.

R e c a l l = \frac{T P}{T P + F N}

The F1-score is an evaluation of the precision and recovery of the model of a class, which provides a metric of its balance. In a multi-class problem, this graph will consist of N curves, where N is the number of classes in the dataset. A graph that shows the different classes with similar curves, all linked together, is the main goal.
Inference time is the time it takes a trained model to produce an output when given an unseen input. Reduced inference times are essential for real-time data processing, providing near-instantaneous answers efficiently.

3.3. Image Augmentation

Image augmentation is a technique used in the field of computer vision to improve the learning task. It uses existing images to artificially create newer, slightly altered images, which increases the set of labelled training data, improving performance metrics. By reducing overfitting, the model thus improves its metrics [41,42]. Augmentation can be divided into three groups: pixel-level augmentation (apply offset and gain to all the pixels), spatial augmentation (apply geometrical transformations), and structural augmentation (changes and combines images features) [43,44]. Figure 7, Figure 8 and Figure 9 show each of these augmentations.

3.4. Collaborative Robot

A collaborative robot, also called a cobot, is generally defined as a robot designed to interact directly with humans in a defined collaborative space [44]. The term, introduced by Edward et al. in 1999 [44], has evolved over time from a single-jointed robot into configurations that are considered intelligent and capable of cohabiting with the environment and operators [45]. So, cobots are imminently safer and designed to handle regular interactions. This capability makes them suitable for human–robot collaboration (HRC), which, combined with their programmability and flexibility, makes it easy to carry out new tasks without significant downtime.

The increase in productivity and flexibility that cobots bring compared to traditional industrial-based solutions is remarkable, allowing convergence towards Industry 5.0 and the customization of products in a less monotonous, more rewarding, and challenging process [46,47,48]. Its flexibility is due to its varied areas of application, ranging from the manufacturing and service industries to the medical care sector [49,50,51]. As far as industry is concerned, the predominant areas range from assembly, pick and place, welding and quality inspection [52]. However, the greatest benefit of cobots centers on the collaborative aspect, which involves a direct interaction/proximity between the worker and the robot as they work together on shared or collaborative tasks.

From an organizational perspective, the implementation of a cobot can be seen from financial, technical, organizational, and cultural points of view. However, the introduction of cobots leads to changes in workflows, raising social concerns and concerns about adaptability to the new process, which can result in delays in the transition to a shared space between man and machine [53].

3.4.1. Reference Frames

Reference frames can be understood as coordinate systems that allow the correct definition of a position and orientation in space. Defining the reference point in which the movement is to be made is of the utmost importance, since a movement made in relation to an unplanned reference point can result in situations in which safety can be seriously jeopardized. The most important references for a cobot are the base structure—with the origin of the cobots base, and from which all the other structures are related—and the Tool Center Point (TCP) structure—which details the position and orientation of the end effector—and are crucial for programming the cobots’ movements. Figure 10 shows the two reference bases and TCP.

It is therefore on the basis of these references and with precise movements that the cobots, depending on the angles of the joints, are able to reach the specific target points in their workspace. Inverse kinematics (IK) allows cobots to calculate specific joint configurations to obtain the position and orientation of the end effector in a Cartesian coordinate system. However, despite its importance in the cobots’ behavior, we will not go into the IK algorithm used by Universal Robots, as it is a closed and proprietary source. As a result, high-level movements will also be carried out even without specific knowledge of IK based on the instructions available to the cobot. The choice of each of these conditions controls the movement in the navigation space, the smoothness of the path, and the overall lifespan of the cobot.

3.4.2. Collaborative Robot UR3e

The UR3e is a six DoF articulated collaborative robot known for its compact size and versatile capabilities, making it an ideal robotic solution for various industries (see Figure 11). As the smallest model in Universal Robots’ e-series, the UR3e offers precision and flexibility in a limited workspace. With a maximum payload of 3 kg and a reach radius of 0.5 m, this model is excellent for tasks that require complex movements, allowing collaboration with operators if necessary. Its lightness, just over 11 kg, and built-in safety features guarantee almost perfect integration into any industrial environment. It is also equipped with force sensors and a learning console for quick and intuitive programming, making it extremely easy to set up and operate. From assembly, pick-and-place, and sorting tasks to more complex tasks such as polishing and quality inspection, the manufacturers ensure that the UR3e delivers consistent performance and reliability [54].

This e-Series model includes advanced force and torque sensors on all six joints. These allow the cobot to detect and react to unexpected human or environmental contact, increasing safety during HRC or simply allowing a higher standard of safety without requiring the space needed for fixed protection.

3.4.3. OnRobot Gripper

The gripper model used for implementation in this project was the OnRobot 2FG7 [55] parallel robotic gripper. This model stands out from other possible models for its ability to manipulate small objects and to be easily implemented in small spaces. The gripper has a minimum gripping time of 46 ms, which is sufficient to meet the demands of the real-time system. Designed with cobots in mind, it also includes a force sensor at the tip of the end action device, which further improves overall safety monitoring. The sensor enables remote grip monitoring by the URScript function.

3.4.4. Encoder

The Omron E6B2-CWZ6C [56] is an incremental rotary encoder designed for the precise measurement of a motor’s angular position, speed, and direction. With 360 pulses per revolution, a compact size of 40 mm in diameter, and a 2-meter cable, it adapts well to environments with space restrictions. This encoder outputs three signals: A, B, and Z. The A and B outputs are normally used together in quadrature to determine the direction of rotation and to count the pulses accurately. The Z output gives a single pulse per revolution, which serves as an index or reference position. This can be fed directly from the cobot’s internal I/O board.

3.4.5. Cell Layout

The conveyor belt (5), see Figure 11, has the ability to vary its speed to facilitate the work of the cobot (1) during the collection phase. The conveyor was controlled by varying a pulse width modulation (PWM) waveform, which, as the cobot controller does not directly support PWM outputs, was implemented with an Arduino Mega 2560 microprocessor (Producer: Arduino, City: Turin, Italy) that modulates the PWM signal and sends it to the motor controller. To activate the cobot’s conveyor belt tracking function, the encoder (6) was installed on the conveyor belt frame, followed by integration into the cobot’s internal I/O board. The encoder was connected to the cobot according to the manufacturer’s manual using the D7-D11 digital ports, although UR suggests using the general-purpose D1+ digital ports (Figure 12).

3.4.6. Cobot Calibration

The payload calibration process is quite simple in the PolyScope Graphical User Interface (GUI), taking less than a minute. With the integration of the camera and gripper, an additional calibration of the TCP is required. The specific calibration of the TCP was based on a set of points located at the vertices of the conveyor structure, adjusting part of the joint’s spatial position and therefore the overall configuration of the global joint in each iteration.

3.4.7. Cell Communication Pipelines

Integrating the different components of the sorting cell requires different approaches to establish appropriate communication pipelines. Figure 13 shows a simple schematic of the different communication protocols used.

By analyzing Figure 13, we can see that the communications system has been structured into three different modules: the cobot, the transporter, and the camera. The script and the cobot communicate via a UR Python library called RTDE, which uses TCP/IP to allow communications of up to 500 Hz between an external computer and the cobot controller, essential for a real-time solution. Data exchange between devices was carried out via a physical Ethernet connection, linking the host PC and the cobot’s internal I/O board in an RTDE environment. Other possibilities, such as USB or Wi-Fi, were considered, but our choice fell on the cable connection. The procedure for establishing RTDE communication between the cobot and the Python state machine can be easily understood by analyzing the following flowchart (Figure 14a).

The procedure begins by configuring the host IP, the desired communication port, and the path of the configuration file. This type of file is expected by this communication pipeline, as it defines all the appropriate initialization of double and Boolean variables. Once the connection between the cobot and Python has been established, the data between the two components will be synchronized. If this synchronization fails, the program exits immediately. If not, then the state, step, and watchdog are initialized and returned as accessible objects. The state can be accessed at any time via Python, allowing access to relevant information about the cobot, such as joint values, TCP pose, or speed. The step object refers to the structure that will contain the destination points, which are then sent to the cobot controller. The watchdog variable, in turn, refers to the monitoring function available in this communication environment and is responsible for synchronizing the state machine and the PolyScope program.

The D435i camera’s connection between the controls is established via a USB3.2 cable, which, although not particularly fast, is more than capable of exchanging data at a speed that does not compromise the cobot’s performance. The connection between the camera and the PC can be seen in Figure 14b. Setting up the data flow between the camera and the PC is relatively simple. After initializing the communication, the data streams, the desired frame resolution, and the preferred color and depth formats are defined. For the current application, 8-bit RGB is used for color frames, and 16-bit grayscale is used for depth frames. This algorithm finally returns the pipeline object needed to provide real-time frames to the OD model. Following the same approach as the camera, the transporter is also controlled via a USB2.0 connection between the host PC and an Arduino Mega 2560. As with the RTDE and camera pipeline, it is necessary to configure the serial communication pipelines between the PC and the Arduino. This procedure is shown in Figure 15.

Note that this is a remote-control solution, as it uses a direct Ethernet connection. This allows the cobot to operate in the ‘Local Area Network’ configuration, which speeds up configuration times. Other possible communication protocols, such as MODBUS or PROFINET, cannot be run in the ‘Local Area Network’, as this would require an extra layer of configuration.

3.5. Vision System

3.5.1. Intel RealSense D435i

The Intel RealSense D435i is a compact and lightweight 2.5D camera capable of capturing RGB and depth information, making it ideal for applications that require precise 3D localization with real-time capabilities. It has good spatial perception and fast calibration, which makes it particularly useful for mobile applications such as drones and robots. In our project, the camera was connected to the end-effector according to the configuration shown in Figure 16a.

3.5.2. Intrinsic Calibration

The D435i camera, like other equipment, should be calibrated, especially for depth perception. Several calibration tools are available for this purpose, such as “On-Chip Calibration”, “Tare Calibration”, and “Focal Length Calibration”. The first step consists of applying the RealSense Viewer SDK 2.0, a function integrated into the “On-Chip Calibration” tool, which aims to adjust the camera’s internal settings. The second calibration concerns a manual adjustment designed to improve accuracy throughout the camera’s field of view, the “Tare Calibration” tool, while the last calibration, “Focal Length Calibration”, aims to improve the camera’s precision. While the first step is an exclusively internal action, the second two steps concern the camera’s perception, and so the use of a calibration element is essential. Intel provided a textured A4 image with markers, created specifically for the calibration. However, other ways of calibrating the D435i camera can be used, such as approach more than four bullet points. Figure 16b shows the A4 calibration sheet.

3.5.3. Depth Stream and Coordinates

Based on the RGB images, depth tests were carried out using the SDK application. These tests revealed that calculating the distance to the candy using the depth flow was difficult. The small size of the candy (around 12 mm from base to top) made it difficult to distinguish exactly between the candy and the conveyor belt in terms of depth. Despite the relatively short distance between the camera and the conveyor belt—where reliable depth data would normally be expected—the depth stream failed to provide accurate z-axis information for the position of the candies. However, the sufficiently consistent shape of the candy, regardless of its orientation (normal or lateral), allows the z-axis value to be treated as constant. As a result, only the RGB stream is needed to determine the positions of the candy’s x and y axes, eliminating the need to retrieve and process the depth stream, without affecting performance.

As far as the coordinates are concerned, the similarity of the problems is identical, since the coordinate transformation processing is carried out on the frames acquired during the process. The coordinates are transformed from the original pixel values of each frame into three-dimensional coordinates to be sent to the cobot and correctly interpreted by it. To carry out this transformation, it is necessary to consider the direct relationship between the origin of the cobot’s base frame and the origin of the camera’s frame and the mapping of pixel values to the coordinates of the cobot’s base frame and the deviation produced by the movement of the conveyor belt while a frame is being processed by the OD model. The origin of the pixel-based coordinate system (0, 0) is represented on the bridge called the base reference (Figure 17), together with the representation of the camera frame and the cobot’s base frame.

The point of origin of the image frame corresponds to the following three-dimensional coordinates related to the cobot’s base frame:

(x_pixel,y_pixel) = (0, 0) ⇒ (x_base,y_base,z_base) = (−0.315,−0.155,0.051)[m]

(1)

The mapping of the pixel-level coordinates to the cobot coordinates was achieved by relating the distance travelled by each pixel, while taking into account the initial deviation. On the other hand, the camera is positioned on the carpet at a distance of 340 mm (Z-axis in blue), which corresponds to a field area of 280 × 210 mm, see Figure 15. In this way, the relationship between the pixel values and their respective dimensions on the conveyer can be obtained directly, since the x and y coordinates have the same ratio of 0.4375 mm/pixel. Other calculations must be taken into account, such as the processing time of the OD model used. This time must take into consideration the fact that the candy is collected in motion, and so the processing time will result in an unaccounted deviation between the initial position of the candy and the final position of the candy after the image has been processed. So, considering that the processing time is 200 ms, it is assumed that this deviation in the calculation of the final coordinate would lead to an error in the x-axis of 200 mm, which is enough for the cobot to miss its target.

It is clear that this deviation can influence the collection and placement action. However, other reasons can affect the operation of the system, including the processing time of the DO model. In a system where the computing capacity is limited, the processing time can be considered constant. Although this hypothesis can be accepted as true, it was found that the processing time of a frame increased slightly over the cycle of operations.

This increase in performance time can perhaps be explained by the thermal throttling effect, since an increase in temperature reduces processor performance. Controlling the computer’s temperature, being an intrinsic function of the computer, is performed by reducing the voltage and, consequently, the clock speed. Reducing the clock speed slows calculations and increases processing time.

The variation in processing time must therefore be properly compensated in each frame. To perform this, a timing function controls the time required for the OD model to process the last image, process_time, and multiplies it by the current speed of the conveyor, conv_speed, to obtain the distance travelled by the candy that must be added to the position obtained previously.

The cobot controller expects the reception pose to be defined in meters. For this reason, the final expression for the base frame coordinates of the x and y axes will be as follows:

\{\begin{matrix} x_{b a s e} = - 0.315 + 0.4375 \times x_{p i x e l} + p r o c e s s_{t i m e} \times c o n v_{s p e e d} \\ y_{b a s e} = - 0.155 - 0.4375 \times y_{p i x e l} \end{matrix},

(2)

3.6. Developed Cell

This section covers the implementation’s functionalities, from sorting and quality inspection to prioritization and route planning algorithms. An analysis is made of the cell’s control state machine, ending with an evaluation of the implementation’s performance.

The solution implemented is based on the principle that the candy is fed onto the conveyor belt in an orderly fashion. This sorting must be carried out one at a time in order to maintain the maximum speed of the conveyor, allowing the cobot to process the unprocessed candy.

3.6.1. Sorting

The sorting cycle was divided into phases, including the DO, movement to the future position of the object to be sorted, and movement to the end point of the object class to be sorted. The DO comprises the arrival of the last image captured by the D435i, the processing of the image by the model, and the post-processing of the results without neglecting the prioritization logic and the inspection of the object’s quality. During this phase, the cobot is in the “Home” position. In this position, the OD continuously performs frame inference until a detection is made. The cobot positions the camera parallel to and centered on the conveyor belt so that the mapping of the pixel value coordinates to the cobot three-dimensional coordinates is as accurate as possible. When an object is detected, the model begins by removing the non-conforming or unpackaged candies, if any, in anticipation of handling by the cobot. This removal frees the cobot to carry out the tasks of collecting the candies, depending on the number of validations greater than 0. The non-detection of compliant candies, with validation equal to 0, places the robot in the “Home” position. On the other hand, confirmed detection, with validation greater than 0, triggers a process of counting the compliant objects and taking appropriate measures, depending on their relative position in the transport. If more than one candy is detected or if they are at the limit of the range, the conveyor speed is reduced. This action ensures that the cobot has enough time to catch the candy before it leaves the working limit area. Otherwise, it can be accelerated, optimizing the system’s efficiency.

Candy collection is prioritized according to its threshold position. Using the Shortest Remaining Path (SRP) or First-In-First-Out (FIFO) algorithms, the pixel coordinates are transformed into 3D coordinates and sent to the cobot via the RTDE interface. The cobot handles all the candies, ensuring that none are lost.

3.6.2. Non-Compliant Candy

Processing the non-conforming class posed a number of challenges, not only because of its deformation but also because of the decomposition of the unwrapped candy. Decomposition causes the candy to stick to the grippers of the cobots and remain attached to the grippers due to the candy’s lightness. On the other hand, the unpackaged class revealed an additional problem, which, in some situations, leads to candy slipping, with enough friction for the candy to get between the tweezers, and ends up being lost during the removal process. To get around these limitations, it was decided not to process them. The movement of the conveyor belt was used to select the non-conforming products. This “natural” selection was made by eliminating this class from the list of items to be processed, concentrating the OD model only on the candies to be processed and mechanically rejecting the unprocessed ones at the end of the conveyor belt. The result was a shorter, smoother and faster cycle and a clean gripper every time.

3.6.3. Prioritization

As already mentioned, prioritization can be carried out by a FIFO or SRP process. While FIFO processes the sweets in the order in which they arrive at the detection zone according to the identification of the OD model, SRP sorts them based on the total cost of the distance associated with each sweet and its end point, i.e., the packing point.

In FIFO, the highest value on the x-axis was considered for processing. The candy with the highest coordinate value on the x-axis corresponds to the one with the least time available to be picked up, as it is closest to the end of the conveyor belt. On the other hand, in the SRP, its positioning is defined according to the distance from the “Home” position to the candy, as well as its distance from the end point of its class, considering the minimum total distance. The decision on which algorithm was most suitable for this project was made by analyzing the behaviors of these two approaches. The SRP concentrates priority on the candies closest to the containers, favoring the minimum time cost associated with selection. Then, if we consider that two candies are detected simultaneously, one at the beginning and the other near the end of the detection area, it is impossible to process them because one of them has left the detection area. The FIFO algorithm was therefore chosen because it processes the candies furthest from the containers first, reducing the risk of losing sweets near the end of the detection area. The choice of FIFO ensured that no candy was lost.

3.6.4. Cobot Trajectory

UR3e allows real-time control of joint positions via URSCript’s servoj function, making it possible to optimize the trajectories between each candy’s location and the respective end points. To plan the trajectory, the Minimum Jerk algorithm was implemented, which, as the name suggests, minimizes the rate of acceleration in each planned movement. The jerk, J, can be described by the following equation:

j = \int_{0}^{t_{f}} {‖\overset{⃛}{x}‖}^{2} d t

(3)

The equation represents the squared movement accumulated over the course of the movement, where t_f is the final time. This equation aims to minimize the value of t, smoothing the trajectory as much as possible. The position, velocity, and acceleration profiles of a minimum impact trajectory can be obtained using Equation (3). Deriving Equation (3) once with respect to time gives the acceleration values for each of the Cartesian axes:

\{\begin{matrix} a_{x} (t) = x_{0} + (x_{0} - x_{f}) (- 120 τ^{3} + 180 τ^{2} - 60 τ) \\ a_{y} (t) = y_{0} + (y_{0} - y_{f}) (- 120 τ^{3} + 180 τ^{2} - 60 τ) \\ a_{z} (t) = z_{0} + (z_{0} - z_{f}) (- 120 τ^{3} + 180 τ^{2} - 60 τ) \end{matrix}

(4)

where a_x(t), a_y(t) and a_z(t) are the accelerations. The variable τ is defined as the ratio between the current time t and the final time t_f, which explains why it varies from 0 to 1 over the duration of the movement.

It should also be added that, despite all possible control of the cobot’s movements, integrating the servoj movements with the conveyor belt tracking proved to be a difficult task. This was mainly due to the frequency restrictions of the conveyor belt control function. The Servoj requires a high frequency that the conveyor belt control cannot provide, which leads to communication failures.

3.6.5. State Machine

The programming solution developed for cobot sorting and quality inspection is divided into several scripts. The state machine that controls the entire cell is executed in a Python script run on a remote PC, containing all the different logical functions and calculations needed to perform the DO, object prioritization, and conveyor speed control. This state machine was implemented using Python’s match–case structure. The main logic behind it is illustrated in Figure 18.

3.7. Candy Dataset

To evaluate correctly and extract valuable information, any OD model requires a balanced, varied, and rich dataset, covering all the nuances that might be presented to it in a real situation. Since this work is based on detecting and collecting packaged candies with different colors and equal dimensions, we chose a manufacturer that met all these requirements. All the datasets are available online [58].

3.7.1. Dataset Construction

The candies are available in four colors, red, green, yellow and blue, i.e., four categories. In addition, two more categories have been added, unpacked and non-conforming, bringing the number of classes to six. The “unpackaged” class is designed to act on any unpackaged candy, i.e., a “naked” candy. In this case, when the system identifies them as such, it must take them back to the packaging line. On the other hand, any deformation detected, either in the shape or the packaging, should classify the bonbon as non-conforming, which implies removing it to a non-conforming area. For each class, as many images as possible were taken to characterize the different shapes of the chocolates so that the model could correctly generalize the limits and boundaries between correct and imperfect chocolates.

The datasets comprise 672 original photographs, all taken with the D435i camera, which was on top of the conveyor belt to obtain aerial images, similar to the frames taken in real time by the cobot in the “Home” position. The setup was assembled and the images captured near a large window to allow light variations. The frames were taken on different days, varying the time of day, which will have an impact on the brightness present at a given frame. In addition, the frames were captured at slightly different distances from the conveyor belt, between 25 and 40 cm, to help the model to generalize the task and ensure its efficiency at various distances between the cobot’s camera and the conveyor belt. The datasets were built with 640 × 480 p images. This choice of resolution took into account the necessary compromise between the quality, time, and computing power required to correctly process each image. Using a low resolution allows the PC to process the image more quickly and has made it possible to obtain excellent performance metrics. The results, without augmentations yet applied, presented in Figure 19, show the non-uniform distributions of instances per class, consisting of 556 instances of non-compliant examples, the most numerous, 315 instances of unpacked candies, and the four available colors represented 260 instances each.

The non-conforming class had higher values, since the randomly chosen deformity could appear exponentially more than those that could be trained. To improve these results, the number of unique and quality images was increased, which caused the mAP of the non-conforming class to converge to the values of the other classes. Figure 20 shows several examples of deformed and cut candy.

Other relevant considerations were taken into account, such as the appropriate variety of candy positions throughout the image area, shown in Figure 21a, and the distribution of instances per image. The histogram shown in Figure 21b reveals the candy instance distribution, including images without candies (0 instances) and a peak of 2–3 instances per image, spanning a range from 0 to 11 instances.

3.7.2. Dataset Annotation

The annotation process was carried out on each of the captured images, except for the images that did not show any object for detection (background images). These background images are crucial for the model to fully understand the context of this task, the conveyor belt, and the steel structure, and to take their characteristics into account when performing inference. A web application called Roboflow was used to annotate the labels in the dataset.

Roboflow is a web platform for building, annotating, and analyzing computer vision datasets. It presents itself as software that allows you to pre-process and augment the dataset developed with a variety of data export formats common to OD, such as PASCAL VOC, COCO, or even YOLOv8, which allows for rapid implementation in any PyTorch or Tensorflow environment. The data we present were created using Roboflow’s free plans according to the dataset creation practices described in [59].

3.7.3. Dataset Augmentation

To fully evaluate the best solution with regard to increasing the dataset, four different versions of the dataset were developed:

Dataset 1: Vanilla—No augmentation was carried out on this dataset. This dataset acted as a ‘control group’, allowing comparisons to be made in relation to augmentation. The original images were divided into training, test and validation sets, with 471, 100, and 101 images, respectively.
Dataset 2: Roboflow Augmentation—The augmentation techniques used consisted of horizontal and vertical flip, the insertion of blur up to 2 pixels, and noise up to 0.3% (in a 640 × 480 p image, this corresponds to 922 pixels affected), as well as brightness. The Roboflow GUI processes all the information and produces an augmented result in YOLO format. A total of 1413 augmented images were obtained for training, with 200 and 201 images for testing and validation, respectively.
Dataset 3: Albumentation Augmentation v1.0—The augmentation techniques chosen for this version were flip horizontal and vertical, randomization of the brightness, contrast, and gamma values, insertion of a Gaussian blur using a 3 × 3 kernel, and the rotate function, which randomly rotates the image by a maximum of 20°. The image’s RGB values were varied with the RGBShift function. The output was configured in YOLOv8 format, maintaining the number of images used in Dataset 2.
Dataset 4: Albumentation Augmentation v2.0 (5×)—The dataset was constructed in a similar way to Dataset 3. The total number of enlargements per image increased from 3 to 5. In this iteration, the flip horizontal and vertical transformations were used, with the randomization of the brightness, contrast, and gamma values slightly reduced compared to Dataset 3. The Gaussian blur function was replaced by the AdvancedBlur function and the MultiplicativeNoise function was implemented. The RGBShift and rotate functions were changed, introducing the PixelDropout function, which effectively reduced the pixel value to 0, a maximum of 0.1% of the image, which translates into 307 pixels. We aimed to analyze the impact of varying the ratio between augmented and original images. For this purpose, we increased the number of augmented images used for training to 2355, while allocating 200 and 201 images for testing and validation, respectively. This approach ensures a balanced dataset while evaluating the effectiveness of augmented data.

In the following Figure 22, a table is presented comparing different versions of the datasets and data augmentation techniques used in the study. Each dataset version is described with the corresponding number of training, validation, and testing images, as well as the applied augmentation techniques.

4. Experimental Results and Analysis

4.1. Object Detection Model Training

Given the widespread use of YOLO models in cutting-edge applications for real-time systems, the YOLO versions were considered in this study. Once the training methodology has been detailed, an exhaustive analysis and comparison of the models follows, evaluating them in terms of accuracy and speed of inference based on the frames excluded from the training datasets. By combining different versions of the model with the different possible training datasets and the various parameters, a large sample of examples was obtained from which conclusions can be extrapolated for implementation.

4.1.1. Training Environment Implementation Augmentation

In order to properly train the models used through a customized Python script, it was necessary to create an environment that used all the parallel computing available on an NVIDIA GPU. To perform this, CUDA software, a parallel computing platform and Application Programming Interface (API), was used, taking advantage of the possibility of processing work using the computer’s GPU instead of the CPU, which allows for an exponential reduction in processing times. The integration of Anaconda Navigator has made it possible to simplify the tasks of creating a fully functional Python environment, guaranteeing full compatibility with different packages and libraries, such as PyTorch, Tensorflow, and CUDA, which significantly speeds up the training of Neural Networks (NNs). On the other hand, you can easily access the integrated development environment (IDE), such as VSCode, saving you the trouble of ensuring the compatibility of each part.

4.1.2. YOLO and RT-DETR

In terms of YOLO versions, YOLOv8 was used first, followed by YOLOv9 and YOLOv10. For YOLOv8, three scales were selected: nano, small, and medium. The heavier scales, which have deeper and deeper layers, were excluded after the initial tests due to limitations of the remote PC and the training GPU, as these models were not able to carry out the inference in a timeframe suitable for the system’s real-time control requirements nor to compensate for the increase in training time with much higher performance.

For the YOLOv9, only version C was selected for training and testing. After the first analysis, it became clear that although this version offered some metric improvements over the YOLOv8 version, the increases in training time and inference time were too significant to justify more training iterations.

With regard to YOLOv10, four different scales were trained: nano, small, medium, and large. The decision to include the large scale in this version was based on its relative position on the hypothetical parameter/speed curve. Despite being a larger model, it remained within the acceptable training time, showing a lower total time than the medium version of YOLOv8, despite being slightly larger (92.0 GFLOPs versus 78.9 GFLOPs, respectively).

As far as RT-DETR is concerned, only two scales of this model with a ResNet101 backbone were available at the time of writing for Ultralytics customized training: large and extra-large. The densest scale available proved to be very time-consuming and required a lot of resources to train, and so only the smallest version available, l, was added to the training procedure.

With regard to the model parameters, several approaches were adopted to configure the parameters in order to obtain satisfactory levels in the metrics monitored. Firstly, key parameters such as the batch size and number of seasons were evaluated. Different numbers of batches were applied to various versions of YOLOv8 and YOLOv10 to compare their performance metrics. Table 2 shows the results obtained.

The results indicate that the default batch size of 16 produced the best performance in both models. This configuration outperformed the smaller and larger batch sizes. The batch size of eight produced the lowest mAP values in all cases. Although this preliminary test was too short to be fully reliable, it allowed us to conclude that the batch size of 16 strikes a satisfactory balance, improving the model’s performance compared to other batch size configurations. The RT-DETR versions were also trained with this batch size.

On the other hand, it is well known that not all OD models can be trained using YOLO and RT-DETR. In order to be able to train the Faster-RCNN and RetinaNet models and compare the different solutions to be implemented, an open-source framework called ‘Detectron2’ was used. This framework simplifies part of the process by making various OD models available. However, for the development of the vision system, the fastest R-CNN was chosen to serve as the control group. RetinaNet was also chosen because it performs satisfactorily in theory and is possibly more efficient from a computational point of view. Training with the Detectron2 framework took place on Google Collab. Figure 23 illustrates the graph containing the loss of the highest mAP RetinaNet model.

4.2. Model Tuning

4.2.1. Hyperparameter Tuning and Training Strategies

To ensure the best model performance, a detailed study of the hyperparameters was conducted, including the learning rate, batch size, number of epochs, and dynamic adjustment strategies. These configurations were optimized based on the validation metrics and the stability of the training process.

4.2.2. Learning Rate

The initial learning rate was set to 0.001. To avoid stagnation in local minima and accelerate convergence, a dynamic scheduler was applied, reducing the learning rate by 50% after 10 consecutive epochs without improvements in the validation loss. This approach ensured that the model dynamically adjusted its learning capacity as training progressed, as evidenced by the total loss curve (Figure 23). The choice of this strategy was based on widely used practices in computer vision, which are proven to improve training stability in deep neural networks.

4.2.3. Batch Size

As shown in Table 2, three different batch sizes (8, 16, and 32) were tested on two datasets: “Vanilla” and “Roboflow”. A batch size of 16 yielded the best results in terms of accuracy (mAP%) and computational efficiency, achieving an mAP of 0.92518 with the YOLOv8m model on the "Roboflow" dataset. Smaller batch sizes (eight) resulted in training instabilities due to increased gradient variance, while larger batch sizes (thirty-two) increased the training time without significant improvements in accuracy.

4.2.4. Optimization Strategy

The Adam optimizer was chosen for training, with parameters configured as beta1 = 0.9 and beta2 = 0.999, due to its ability to adjust gradients effectively in complex tasks. The weight decay was set to 0.0001 to minimize overfitting, while the internal momentum was kept at the optimizer’s default values.

4.2.5. Epoch Configuration

The maximum number of epochs was initially set to 200. However, to avoid computational waste, the Patience parameter was used, which stops training if no significant improvement is detected after 50 consecutive epochs. This strategy reduced the training time without compromising the result quality. At the end of training, the final configuration was saved in the best.pt file, while the last.pt file allowed training to resume if necessary.

4.2.6. Total Loss Curve

Figure 23 shows the evolution of the total loss during training. A consistent decrease in loss was observed until stabilization occurred after approximately 1500 iterations. This behavior reinforces the effectiveness of the hyperparameter configurations, as well as the stability of the model during training. No signs of overfitting were observed, as confirmed by both the validation and training loss curves.

4.3. Object Detection—Model Selection

On the basis of the different inference tests carried out (YOLOv8: n, s e m; YOLOv9c; YOLOv10: n, s, m, and b; RT-DETR-l; Faster-RCNN: ResNet101; and RetinaNet: ResNet50 and ResNet101), it was possible to define the most suitable model implementation lines for the project under study. Analyzing these metrics showed that the worst results were recorded for Faster R-CNN, followed by both versions of RetinaNet. On the other hand, all the other models generally showed values above the mAP50-95 threshold of 90 percent, with a particular emphasis on the scales of the YOLOv8 and YOLOv10 models [60,61,62,63,64].

The use of image magnification has a particular impact on the different models and scales in a specific way. Although it improved OD accuracy by 1.37% for the YOLOv10b case, it improved by more than 3% for YOLOv10n, which may suggest that the image magnification techniques applied may be more effective for smaller-scale YOLO versions. For other models, image enhancement improved by a maximum of 1.3% and 1.7% for the RT-DETR and RetinaNet R101 models, respectively, with the less dense RetinaNet model improving with increasing data. Another important observation can be drawn by comparing the results obtained with Roboflow with those obtained by training a larger dataset built with Albumentation. Although it used more transformations, the models trained with a larger dataset performed worse overall than the models trained with a smaller dataset. This led to the conclusion that there was no linear relationship between the performance of a specific model and the size of the dataset used to train it.

Regarding the variance of accuracy within the model scales, as expected, the models with the smallest scales showed the lowest performance values. However, the difference in performance between the smaller versions, such as the nano version of YOLO, and the larger versions, such as medium and large, was less significant than initially anticipated. For example, with the Roboflow dataset and the YOLOv10 model version, the variation between scales ranged from 0.92046% for v10n to 0.92434% for v10m, which corresponded to an overall improvement of around 0.5 %.

YOLOv8m achieved the highest accuracy of 0.92518%; in comparison, the best accuracy for version v10 was 0.92434% with YOLOv10m trained on the Roboflow dataset, and 0.91844% when using the Albu v1.0 dataset to train YOLOv9c. It was also found that only eight versions had a mAP higher than 92%: YOLOv8-s using the Roboflow dataset, YOLOv8-m on the Roboflow and Albu. v1.0 datasets; YOLOv10n trained on the Roboflow dataset; and YOLOv10m and YOLOv10b trained on all datasets except Vanilla. Although YOLOv8m performed best in terms of accuracy, the final decision must also take in consideration the inference speed values for each version to obtain a complete understanding of the necessary trade-offs between accuracy and speed.

In Table 3, the model performance across all custom-trained models is presented.

The following figure illustrates the mean average precision (mAP) performance (50–95) for various model architectures, including YOLOv8, YOLOv9, YOLOv10, RT-DETR, Faster-RCNN, and RetinaNet, across four datasets: Vanilla, Roboflow, Albu v1.0, and Albu v2.0.

The following points summarize key observations regarding model performance and the dataset’s impact on precision metrics:

RetinaNet and Faster R-CNN—demonstrated lower accuracy with a 5% decrease in performance;
Augmentation impact—significantly improved metrics, especially in less dense scales;
Best dataset for training—the “Roboflow” dataset yielded the best results;
Top-performing model—YOLOv8-m trained on the “Roboflow” dataset achieved the highest metrics;
YOLO versions—outperformed other models, demonstrating the best overall precision.

4.4. Speed Analysis

The speed results were obtained by performing the inference on twenty individual images and monitoring the results of the last five. The images selected were the same for each version in order to minimize any variation that could lead to incorrect conclusions. In each test, the total processing time was recorded, made up of three different components: pre-processing, inference, and post-processing.

As expected, Faster R-CNN had by far the longest processing time, over 0.5 s, reflecting the idea mentioned earlier that single-phase detectors could not compete with the speed offered by single-phase models. With regard to the single-phase OD models, the various scales available in YOLOv10 proved to be significantly faster than the other models trained. The densest version of YOLOv10, with a processing time of 135.5 ms, was still inferior to any other version or scale under study, with the nano version of YOLOv8 coming closest. The fastest value, just 49.1 ms, was recorded by YOLOv10n, which represented a decrease of more than 90% in total processing time compared to the highest time recorded. The RT-DETR personalized model, the YOLOv8 small and medium scales, and the RetinaNet models performed similarly in terms of processing time, with the former being slightly faster, with an average total processing time of 173.8 ms, and a percentage reduction for Faster-RCNN of 66.7%. This substantial improvement in the YOLOv10 processing time when compared to other OD models and previous versions of YOLO was mainly due to a much faster post-processing phase, which was evident when comparing the post-processing values of models prior to YOLOv10 and those using YOLOv10. When moving from YOLOv8m to YOLOv10m, there was a total reduction in processing time of 44.0%, from 202.95 ms to 113.7 ms, which was largely due to the reduction in post-processing time from 83.6 ms to 20.2 ms, i.e., a reduction of 75.8%.

On the other hand, when comparing the smallest neural networks available in each version, YOLOv8n and YOLOv10n, there was a total reduction of 31.8% in processing time, from 179.9 ms to 122.6 ms. The same was true when comparing models with a similar computational load, such as RT-DETR-l and YOLOv10m. The decrease was from 173.8 ms to 113.7 ms, which represented an improvement of 60.1%. Similarly, when comparing YOLOv10b with the customized YOLOv9c model, the total processing time decreased from 222.4 ms to 135.5 ms, which represented a reduction of approximately 60%. As far as inference times were concerned, the biggest discrepancy was seen in the YOLOv9c results, which showed an average inference time of 140.9 ms, almost 5% more than the trained YOLOv10 model, which is more computationally expensive but less dense.

Finally, it can be seen that pre-processing techniques do not contribute significantly to the total time spent on inference, as they usually do not even account for 1% of the total processing time.

In Table 4, the model inference times, including pre-processing, inference, post-processing, and total processing time (in milliseconds), are presented for various model architectures, including YOLOv8, YOLOv9, YOLOv10, RT-DETR, Faster-RCNN, and RetinaNet. Additionally, the percentage difference per minute (∆%/Min) is provided, highlighting the efficiency of each model in real-time applications. The Δ/Min (%) column represents the percentage reduction in total processing time per minute compared to the reference model (Faster R-CNN). This metric highlights the relative speed improvement of each model, where higher values indicate faster performance. The following observations highlight the processing time efficiencies and relative performance of different object detection models, focusing on inference speed and reduction percentages:

Faster R-CNN—significantly slower compared to other models;
YOLOv10—achieved the fastest inference times among all models, primarily due to the removal of Non-Maximum Suppression (NMS);
Top-performing model in terms of speed—YOLOv10-n, with a 90.6% reduction in processing time;
Consistency of YOLOv10—demonstrated superior speed across all scales;
Similar reductions—YOLOv8, RT-DETR, and RetinaNet showed comparable reductions in processing time (60–70%) relative to Faster R-CNN.

Table 4. Inference performance of all custom trained models.

Model	Pre (ms)	Inference (ms)	Post (ms)	Total (ms)	Δ/Min (%)
YOLOv8 (v8-n)	1.0	74.0	74.4	149.4	71.3
YOLOv8 (v8-s)	1.0	93.7	78.3	173.0	66.8
YOLOv8 (v8-m)	1.25	118.1	83.6	202.95	61.1
YOLOv9 (v9-c)	1.4	140.9	80.1	222.4	57.3
YOLOv10 (v10-n)	1.0	30.9	17.2	49.1	90.6
YOLOv10 (v10-s)	1.0	58.4	18.1	85.1	85.1
YOLOv10 (v10-m)	1.2	92.8	20.2	113.7	78.2
YOLOv10 (v10-b)	1.0	115.1	19.4	135.5	74.0
RT-DETR	-	-	-	173.8	66.7
Faster-RCNN	-	-	-	521.2	0.0
RetinaNet (ResNet50)	-	-	-	175.2	66.4
RetinaNet (ResNet101)	-	-	-	192.6	63.0

4.5. Selection of the Model

From the results analysis, we can infer that YOLOv10 showed a substantial improvement in post-processing speed compared to previous versions, which is significant for minimizing any potential delays. RetinaNet, YOLOv8, and RT-DETR-l also recorded processing times of less than 200 ms. Between the possible model versions and scales, the compromise between precision and speed becomes a fundamental requirement for the success of the project. In this case, image processing time is a crucial aspect of the project, since the cobot has speed and acceleration limits. Therefore, considering the evaluations carried out on the eight models, with performance exceeding 92%, it was decided to use the YOLO 10n version. The performance metrics of the YOLOv10n model, trained with a Roboflow dataset, are shown in Figure 24.

From the analysis of the metrics shown in Figure 24, it can be concluded that the results were as expected. An initial exponential phase was followed by a longer and more stable phase, with parameters such as Weight Decay and Momentum effectively reducing the impact of each epoch on the overall configuration of the model and confirming the stability of the Patience parameter, since there was no significant increase in the values of the loss function. On the other hand, the normalized confusion matrix (Figure 25) corroborates the idea that the model is competent at identifying the different classes available, even in non-conformity. Although the normalized confusion matrix is satisfactory, it is possible to see some lighter squares around the non-compliant class, indicating the occurrence of errors. This was to be expected, since the model’s main source of error is the boundary between what is a non-compliant example and a compliant example, albeit an imperfect one.

The F1-score reinforces the idea that the selected model is a highly capable solution, showing a maximum F1-score of 0.98 at a threshold of 0.635, while also showing an F1-score of over 0.9 when close to the 0.95 mark, which illustrates the great balance between precision and recovery achieved by this OD model (Figure 26). Furthermore, the balance achieved by this model also corroborates the idea that the datasets constructed, although subject to expansion, correctly detail the possible candy configurations that could be presented in real time, allowing the model to internalize the task with a high degree of generalization.

4.6. Performance Evaluation

The performance of the implemented robotic cell was evaluated by carrying out an operational test with 100 candies: 20 of each color, including 10 examples of non-compliant candy with various degrees of deformation and 10 examples of unwrapped candy. These were placed on the conveyor belt randomly along the conveyor belt’s y-axis to ensure the cell’s ability to successfully classify the candies, regardless of their initial positions.

The results of this experiment are shown in Table 5. In the table, the “Success” column indicates the number of instances of each class that were correctly detected and treated by the cobot. The “Selection error” and “Detection error” columns represent the number of cases in which selection errors occurred, either due to incorrect positioning of the cobot arm or an error in detecting the correct class of sweets, respectively.

Overall, the test showed satisfactory results. The green class presented an error considered to be a false “green” detection caused by a shadow or poor lighting. With regard to the non-compliant class, the error was a false detection for the non-compliant class, since the less damaged candies are difficult to detect. On the other hand, the simpler instances, those of large size or with color variation, were all correctly detected by the YOLO model.

The time cost analysis for the robotic movements is summarized in Table 6 below, detailing the average distance covered and the corresponding time spent for each task. The data include movements such as transitioning from “Home” to the candy (horizontal), vertical approach to the candy, transportation to the endpoint, and the return to the “Home” position. Additional time costs for object detection (OD) and gripping are also accounted for, resulting in a total average time of 5.7 s to complete the task, covering an average distance of 91.5 cm.

Horizontal movement and transportation: Most of the time is spent on horizontal movements (“Home” to candy, candy to endpoint, and endpoint to “Home”), totaling 5.1 s out of the 5.7 s overall. This indicates that linear movements represent the main time consumption in the task.
Vertical movement and auxiliary operations: Vertical movements (candy approach) and auxiliary operations (object detection and gripping) are fast, contributing only 0.6 s to the total time. This suggests that these aspects have minimal impacts on overall performance.
Overall task efficiency: With a total time of 5.7 s and an average distance of 91.5 cm, the task can be considered efficient. However, improvements could be explored in horizontal movements to further optimize performance.
Limited impacts of OD and gripping: The time costs for object detection (OD) and gripping are negligible (0.2 s in total), indicating that these processes are already well optimized.

5. Conclusions and Future Work

This study successfully developed and tested an automated chocolate packaging system that integrates a collaborative robot (UR3e) with an advanced vision system. The results demonstrate the viability and scalability of the proposed solution for industrial applications requiring real-time precision and flexibility. The main contributions of this work are as follows:

Efficiency and accuracy in real-Time applications—The YOLOv10-nano model was employed to handle object detection tasks with high accuracy and a low inference time. The system achieved a classification accuracy of 98% and a picking accuracy of 100% during real-world testing, processing over 600 candies per hour. These results confirm the reliability and effectiveness of the solution in dynamic industrial environments.
Optimized model performance—By testing various object detection model scales and dataset configurations, the study identified a balance between accuracy and inference time, which is essential for real-time tasks. Advanced data augmentation techniques, such as CutMix and PixelDropout, enhanced the robustness of the model, ensuring consistent performance under variable conditions.
Insights into the newest deep-learning models—Practical testing of newer object detection models, including YOLOv9 and YOLOv10, provided valuable insights into their capabilities, particularly for detecting small objects. This contribution fills a gap in the current literature and establishes a benchmark for future research.
Simplified and accessible framework—The implementation of this system using a Python-based framework on Windows, instead of ROS, reduced technological barriers and made the solution more accessible for diverse industries, including Small and Medium-sized Enterprises (SMEs). The modular design simplifies integration, maintenance, and scalability.
Adaptability to industrial applications—The system’s design includes precise force control for handling delicate items like chocolates without causing damage. It operates efficiently in high-speed production lines while adapting to dynamic factory layouts and varying lighting conditions.
Cost-effectiveness and accessibility—The use of low-cost cameras with adequate precision, combined with advanced vision algorithms, ensures that the system is affordable and accessible to SMEs. Easy-to-use APIs and low maintenance requirements further enhance its practicality.
Multi-class detection and sorting—The system’s ability to detect compliant, non-compliant, and unpackaged items improves quality assurance and reduces waste, offering significant value to industries focused on production efficiency.
Dynamic environmental adaptation—The solution adapts to changes in factory layouts and operating conditions, maintaining high-quality standards and safety during operations.

Although the UR3e collaborative robot does not match the speed of Cartesian or Delta robots, it compensates with added safety features, flexibility, and a smaller workspace footprint. This makes it particularly suitable for environments where collaboration and versatility are prioritized. The successful processing of over 600 candies in testing underscores its applicability in tasks requiring both precision and adaptability.

Also, the YOLOv10-nano model showed its ability to handle OD tasks effectively, particularly in scenarios involving small objects. Its efficiency and reduced inference time make this version highly practical for real-time industrial applications, especially when computational resources are limited. The results confirmed that YOLO models perform well even when working with datasets containing small objects.

Although the UR3e cobot cannot match the processing speed of Cartesian or Delta robots, it makes up for this with added safety features, flexibility, and a smaller workspace. Despite not being designed for rapid sorting tasks, the UR3e proved its value in such applications, as demonstrated by its success in processing over 600 candies during tests.

By implementing the system using Python instead of ROS, we lowered technological barriers and provided a more accessible solution for deployment. The system processed over 600 candies per hour, achieving a classification accuracy of 98% and a picking accuracy of 100% in real-world tests. This demonstrates the system’s reliability and effectiveness in industrial automation.

While using Windows simplifies deployment and expands access to industrial solutions, it has limitations in real-time applications due to its lack of determinism. The following are some potential improvements that could be incorporated to overcome these limitations:

Explore a integration of a real-time operating system (RTOS) for critical tasks while maintaining Windows for non-critical operations to balance accessibility and precision;
Adopt advanced protocols like DDS or ZeroMQ to ensure reliable and low-latency data transmission;
Leverage FPGA or GPU-based processing to offload computationally intensive tasks and improve system responsiveness;
Fine-tune the Windows scheduler to prioritize critical tasks, mitigating the impact of non-deterministic process handling.

Although the implementation used the YOLOv10 model as its base, the study explored potential improvements to enhance the algorithm performance for specific applications. Future work for an improved algorithm may include the following:

Replacing Non-Maximum Suppression (NMS) with Soft-NMS for better handling of overlapping objects;
Incorporating attention mechanisms, such as CBAM, to improve feature selection and focus on relevant regions;
Implementing real-time online data augmentation pipelines to adapt dynamically to environmental variations;
Exploring lightweight backbones, such as MobileNet, to reduce the inference time while maintaining high accuracy;
Enhancing small object detection using Spatial Pyramid Pooling (SPP) or Path Aggregation Networks (PANet).

Future work will explore adapting the proposed system to parallel robots, such as Delta or SCARA robots, to address high-speed industrial scenarios. This transition involves the following:

Integrating high-speed robots with real-time communication protocols, such as Ethernet/IP or Profinet, to ensure seamless synchronization;
Deploying high-frame-rate cameras (>120 fps) and optimizing detection pipelines with hardware acceleration (e.g., TensorRT or FPGA) to minimize latency;
Developing predictive trajectory models to preemptively calculate object positions and enable precise pick-and-place tasks;
Enhancing the detection algorithm to handle multiple objects within a single frame, reducing cycle times and improving throughput;
Incorporating advanced lighting systems and data augmentation techniques to ensure robustness under high-speed conditions;
This adaptation will extend the scalability and applicability of the system to meet the demands of fast-paced production lines, such as those in the packaging, food, and pharmaceutical industries.

Another future direction could involve utilizing the camera’s inertial measurement unit (IMU) alongside other sensors to improve frame acquisition, enabling the system to capture multiple perspectives of the conveyor belt. Updating the detection algorithm to handle multiple objects within a single frame could further reduce the cycle time. Lastly, adding a “Hand” class to the dataset could expand the system’s capabilities to include tasks such as gesture recognition and control.

Future work also could explore implementing periodic re-training or incremental learning mechanisms to address model degradation over time. This approach provides a foundation for investigating these strategies in the context of collaborative robotic systems.

In conclusion, this study successfully demonstrated the feasibility of integrating collaborative robotics and advanced vision systems to address real-time industrial challenges. The findings and methodologies presented serve as a foundation for future research and practical implementations across a wide range of applications.

Author Contributions

Conceptualization, N.T., F.P., A.A.S., A.M.L. and A.R.S.; validation, N.T., F.P. and A.R.S.; investigation, N.T.; writing—original draft preparation, A.A.S. and F.P; writing—review and editing, N.T., F.P., A.R.S., A.F.d.S., A.M.L., L.A.C., T.C.A., F.B. and J.M.; visualization, N.T., F.P., A.R.S., A.F.d.S., L.A.C., T.C.A., F.B. and J.M.; supervision, F.P., A.R.S., A.F.d.S., A.M.L., L.A.C., T.C.A., F.B. and J.M.; project administration, N.T., F.P., A.R.S., A.F.d.S., A.M.L., L.A.C., T.C.A., F.B. and J.M.; funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the project New frontiers in adaptive modular robotics for patient-centered medical rehabilitation–ASKLEPIOS, funded by the European Union–NextGenerationEU and Romanian Government under the National Recovery and Resilience Plan for Romania, contract no. 760071/23.05.2023, code CF 121/15.11.2022, with the Romanian Ministry of Research, Innovation and Digitalization, within Component 9, investment I8.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions (e.g., privacy, legal or ethical reasons).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Santos, A.A.; Pereira, F.; Felgueiras, C. Optimization and improving of the production capacity of a flexible tyre painting cell. Int. J. Adv. Manuf. Technol. 2024. [Google Scholar] [CrossRef]
Santos, A.A.; Haladus, J.; Pereira, F.; Felgueiras, C.; Fazenda, R. Simulation Case Study for Improving Painting Tires Process Using the Fanuc Roboguide Software. In Proceedings of the Flexible Automation and Intelligent Manufacturing: Establishing Bridges for More Sustainable Manufacturing Systems, Porto, Portugal, 18–22 June 2023; Springer: Cham, Switzerland, 2023; pp. 517–524. [Google Scholar]
Pereira, F.; Magalhães, L.; Santos, A.A.; da Silva, A.F.; Antosz, K.; Machado, J. Development of an Automated Wooden Handle Packaging System with Integrated Counting Technology. Machines 2024, 12, 122. [Google Scholar] [CrossRef]
Mendez, E.; Ochoa, O.; Olivera-Guzman, D.; Soto-Herrera, V.H.; Luna-Sánchez, J.A.; Lucas-Dophe, C.; Lugo-del-Real, E.; Ayala-Garcia, I.N.; Alvarado Perez, M.; González, A. Integration of Deep Learning and Collaborative Robot for Assembly Tasks. Appl. Sci. 2024, 14, 839. [Google Scholar] [CrossRef]
Wang, D.; Han, C.; Wang, L.; Li, X.; Cai, E.; Zhang, P. Surface roughness prediction of large shaft grinding via attentional CNN-LSTM fusing multiple process signals. Int. J. Adv. Manuf. Technol. 2023, 126, 4925–4936. [Google Scholar] [CrossRef]
Borboni, A.; Reddy, K.V.V.; Elamvazuthi, I.; AL-Quraishi, M.S.; Natarajan, E.; Azhar Ali, S.S. The Expanding Role of Artificial Intelligence in Collaborative Robots for Industrial Applications: A Systematic Review of Recent Works. Machines 2023, 11, 111. [Google Scholar] [CrossRef]
Chen, Q.; Wan, L.; Pan, Y.-J. Object Recognition and Localization for Pick-and-Place Task using Difference-based Dynamic Movement Primitives. IFAC-Pap 2023, 56, 10004–10009. [Google Scholar]
Kumar, A.A.; Zaman, U.K.U.; Plapper, P. Collaborative Robots. In Handbook of Manufacturing Systems and Design, 1st ed.; Taylor & Francis: London, UK, 2023; pp. 89–106. [Google Scholar]
Colgate, J.E.; Wannasuphoprasit, W.; Peshkin, M.A. Cobots: Robots for Collaboration With Human Operators. In Proceedings of the ASME 1996 International Mechanical Engineering Congress and Exposition. Dynamic Systems and Control, Atlanta, GA, USA, 17–22 November 1996; pp. 433–439. [Google Scholar]
Magalhaes, P.; Ferreira, N. Inspection Application in an Industrial Environment with Collaborative Robots. Automation 2022, 3, 258–268. [Google Scholar] [CrossRef]
Jennes, P.; Minin, A.D. Cobots in SMEs: Implementation Processes, Challenges, and Success Factors. In Proceedings of the 2023 IEEE International Conference on Technology and Entrepreneurship (ICTE), Kaunas, Lithuania, 9–11 October 2023; pp. 80–85. [Google Scholar]
Santos, A.A.; Schreurs, C.; da Silva, A.F.; Pereira, F.; Felgueiras, C.; Lopes, A.M.; Machado, J. Integration of Artificial Vision and Image Processing into a Pick and Place Collaborative Robotic System. Intell Robot. Syst 2024, 110, 159. [Google Scholar] [CrossRef]
Sun, R.; Wu, C.; Zhao, X.; Zhao, B.; Jiang, Y. Object Recognition and Grasping for Collaborative Robots Based on Vision. Sensors 2024, 24, 195. [Google Scholar] [CrossRef]
Gomes, N.M.; Martins, F.N.; Lima, J.; Wörtche, H. Reinforcement Learning for Collaborative Robots Pick-and-Place Applications: A Case Study. Automation 2022, 3, 223–241. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
Taye, M.M. Theoretical Understanding of Convolutional Neural Network: Concepts, Architectures, Applications, Future Directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Phan, P.H.; Nguyen, A.Q.; Quach, L.-D.; Tran, H.N. Robust Autonomous Driving Control using Auto-Encoder and End-to-End Deep Learning under Rainy Conditions. In Proceedings of the 2023 8th International Conference on Intel-ligent Information Technology, ICIIT ’23, Da Nang, Vietnam, 24–26 February 2023; pp. 271–278. [Google Scholar]
Zamora-Hernández, M.-A.; Castro-Vargas, J.A.; Azorin-Lopez, J.; Garcia-Rodriguez, J. Deep learning-based visual control assistant for assembly in Industry 4.0. Comput. Ind. 2021, 131, 104385. [Google Scholar] [CrossRef]
Kuswantori, A.; Suesut, T.; Tangsrirat, W.; Schleining, G.; Nunak, N. Fish Detection and Classification for Automatic Sorting System with an Optimized YOLO Algorithm. Appl. Sci. 2023, 13, 3812. [Google Scholar] [CrossRef]
Trujillo, J.L.A.; Zarta, J.B.; Serrezuela, R.R. Embedded System Generating Trajectories of a Robot Manipulator of Five Degrees of Freedom (D.O.F). KnE Eng. 2018, 3, 512–522. [Google Scholar] [CrossRef]
Silveira, M.; Sun, R.; Santos, A.; Pereira, F.; da Silva, A.F.; Felgueiras, C.; Ramos, A.; Machado, J. 3D Vision Object Identification Using YOLOv8. Int. J. Mechatron. Appl. Mech. 2024, 17, 7–15. [Google Scholar]
Chen, Z.; Zou, H.; Wang, Y.; Liang, B.; Liao, Y. A vision-based robotic grasping system using deep learning for garbage sorting. In Proceedings of the 2017 36th Chinese Control Conference (CCC), IEEE, Dalian, China, 26–28 July 2017; pp. 11223–11226. [Google Scholar]
Alberto, M.; Filipe, P.; Adriano, S.; José, M.; Miguel, O. Modelling and Simulation of a Pick&Place System using Modelica Modelling Language and an Inverse Kinematics. Int. J. Mechatron. Appl. Mech. 2024, 16, 7–17. [Google Scholar]
Yun, J.P.; Shin, W.C.; Koo, G.; Kim, M.S.; Lee, C.; Lee, S.J. Automated defect inspection system for metal surfaces based on deep learning and data augmentation. J. Manuf. Syst. 2020, 55, 317–324. [Google Scholar] [CrossRef]
Xiao, Y.; Wu, J.; Lin, Z.; Zhao, X. A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 2018, 153, 1–9. [Google Scholar] [CrossRef]
Attri, I.; Awasthi, L.K.; Sharma, T.P.; Rathee, P. A review of deep learning techniques used in agriculture. Ecol. Inform. 2023, 77, 102217. [Google Scholar] [CrossRef]
Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N. Muhammad Ramzan, and Muzamil Ahmed. Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
Aldhyani, T.H.H.; Alkahtani, H. Cyber Security for Detecting Distributed Denial of Service Attacks in Agriculture 4.0: Deep Learning Model. Mathematics 2023, 11, 233. [Google Scholar] [CrossRef]
Liu, H.; Zhou, L.; Zhao, J.; Wang, F.; Yang, J.; Liang, K.; Li, Z. Deep-Learning-Based Accurate Identification of Warehouse Goods for Robot Picking Operations. Sustainability 2022, 14, 7781. [Google Scholar] [CrossRef]
Pajila, P.J.B.; Sheena, B.G.; Gayathri, A.; Aswini, J.; Nalini, M.; Siva, S.R. A Comprehensive Survey on Naive Bayes Algorithm: Advantages, Limitations and Applications. In Proceedings of the 2023 4th International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 20–22 September 2023; pp. 1228–1234. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Pereira, F.; Lopes, H.; Pinto, L.; Soares, F.; Vasconcelos, R.; Machado, J.; Carvalho, V. A Novel Deep Learning Approach for Yarn Hairiness Characterization Using an Improved YOLOv5 Algorithm. Appl. Sci. 2025, 15, 149. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Ju, R.-Y.; Cai, W. Fracture detection in pediatric wrist trauma X-ray images using YOLOv8 algorithm. Sci. Rep. 2023, 13, 20077. [Google Scholar] [CrossRef]
Taesi, C.; Aggogeri, F.; Pellegrini, N. COBOT Applications—Recent Advances and Challenges. Robotics 2023, 12, 79. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024. ECCV 2024, Milan, Italy, 29 September–4 October 2024; Lecture Notes in Computer, Science. Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; Volume 15089, pp. 1–21. [Google Scholar] [CrossRef]
Mao, M.; Lee, A.; Hong, M. Efficient Fabric Classification and Object Detection Using YOLOv10. Electronics 2024, 13, 3840. [Google Scholar] [CrossRef]
Banduka, N.; Tomić, K.; Živadinović, J.; Mladineo, M. Automated Dual-Side Leather Defect Detection and Classification Using YOLOv11: A Case Study in the Finished Leather Industry. Processes 2024, 12, 2892. [Google Scholar] [CrossRef]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern Recognit. 2023, 137, 109347. [Google Scholar] [CrossRef]
Monigatti, L. Cutout, Mixup, and Cutmix: Implementing Modern Image Augmentations in PyTorch. 2023. Available online: https://towardsdatascience.com/cutout-mixup-and-cutmix-implementing-modern-image-augmentations-in-pytorch-a9d7db3074ad?gi=00cd267eb9d8 (accessed on 12 January 2025).
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6023–6032. [Google Scholar]
Montini, E.; Daniele, F.; Agbomemewa, L.; Confalonieri, M.; Cutrona, V.; Bettoni, A.; Rocco, P.; Ferrario, A. Collaborative Robotics: A Survey From Literature and Practitioners Perspectives. J. Intell. Robot. Syst. 2024, 110, 117. [Google Scholar] [CrossRef]
Anjum, M.U.; Khan, U.S.; Qureshi, W.S.; Hamza, A.; Khan, W.A. Vision-Based Hybrid Detection For Pick And Place Application In Robotic Manipulators. In Proceedings of the 2023 International Conference on Robotics and Automation in Industry (ICRAI), Peshawar, Pakistan, 3–5 March 2023; pp. 1–5. [Google Scholar] [CrossRef]
Tarabini, M. AI Solutions for Grilled Eggplants Sorting: A Comparative Analysis of Image-Based Techniques. In Proceedings of the 2024 IEEE International Workshop on Metrology for Industry 4.0 & IoT (MetroInd4.0 & IoT), Florence, Italy, 29–31 May 2024; IEEE: New York, NY, USA, 2024; pp. 48–53. [Google Scholar]
Patil, S.; Vasu, V.; Srinadh, K.V.S. Advances and Perspectives in Collaborative Robotics: A Review of Key Technologies and Emerging Trends. Discov Mech. Eng. 2023, 2, 13. [Google Scholar] [CrossRef]
Muhammad, H.Z.; Even, F.L.; Sanfilippo, F. Exploring the Synergies between Collaborative Robotics, Digital Twins, Augmentation, and Industry 5.0 for Smart Manufacturing: A State-of-The-Art Review. Robot. Comput.-Integr. Manuf. 2024, 89, 102769. [Google Scholar]
Hameed, A.; Ordys, A.; Możaryn, J.; Sibilska-Mroziewicz, A. Control System Design and Methods for Collaborative Robots: Review. Appl. Sci. 2023, 13, 675. [Google Scholar] [CrossRef]
Mariscal, M.A.; Ortiz Barcina, S.; García Herrero, S.; López Perea, E.M. Working with collaborative robots and its influence on levels of working stress. Int. J. Comput. Integr. Manuf. 2023, 37, 900–919. [Google Scholar] [CrossRef]
Sultanov, R.; Sulaiman, S.; Li, H.; Meshcheryakov, R.; Magid, E. A Review on Collaborative Robots in Industrial and Service Sectors. In Proceedings of the 2022 International Siberian Conference on Control and Communications (SIB-CON), Tomsk, Russia, 17–19 November 2022; IEEE: New York, NY, USA, 2022; pp. 1–7. [Google Scholar] [CrossRef]
Liu, L.; Guo, F.; Zou, Z.; Duffy, V.G. Application, Development and Future Opportunities of Collaborative Robots (Cobots) in Manufacturing: A Literature Review. Int. J. Hum. Comput. Interact. 2022, 40, 915–932. [Google Scholar] [CrossRef]
Universal Robots. Universal Robots e-Series User Manual, Version 5.7, US Version. Available online: https://s3-eu-west-1.amazonaws.com/ur-support-site/68217/99454_UR3e_User_Manual_en_US.pdf (accessed on 18 November 2024).
OnRobot. 2FG7 OnRobot Gripper Datasheet. Available online: https://onrobot.com/sites/default/files/documents/Datasheet_2FG7_v1.0_EN_0.pdf (accessed on 18 November 2024).
OMRON. Encoder E6B2-CWZ6C 360P/R 2M. Available online: https://assets.omron.eu/downloads/latest/datasheet/en/q085_e6b2-c_incremental_rotary_encoder_40_mm_datasheet_en.pdf?v=8 (accessed on 18 November 2024).
IntelREALSENSE. Intel® RealSense™ Self-Calibration for D400 Series Depth Cameras-Web Documentation. Available online: https://dev.intelrealsense.com/docs/self-calibration-for-depth-cameras (accessed on 19 November 2024).
Terras, N.; Pereira, F.; Ramos, A.; Machado, J.; Santos, A.A.; Lopes, A.; Silva, A.; Cartal, L.; Apostolescu, T.; Badea, F. Candy Dataset—Faculty of Engineering—University of Porto—Department of Mechanical Engineering. Mendeley Data 2025, 1. [Google Scholar] [CrossRef]
Pereira, F.; Pinto, L.; Soares, F.; Vasconcelos, R.; Machado, J.; Carvalho, V. Online Yarn Hairiness– Loop & Protruding Fibers Dataset. Data Brief 2024, 54, 110355. [Google Scholar]
Kee, E.; Chong, J.J.; Choong, Z.J.; Lau, M. Development of Smart and Lean Pick-and-Place System Using EfficientDet-Lite for Custom Dataset. Appl. Sci. 2023, 13, 11131. [Google Scholar] [CrossRef]
Thin Jun, E.L.; Tham, M.-L.; Kwan, B.-H. A Comparative Analysis of RT-DETR and YOLOv8 for Urban Zone Aerial Object Detection. In Proceedings of the 2024 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Shah Alam, Malaysia, 29–29 June 2024; pp. 340–345. [Google Scholar] [CrossRef]
Nguyen, N.-D.; Do, T.; Ngo, T.D.; Le, D.-D. An Evaluation of Deep Learning Methods for Small Object Detection. J. Electr. Comput. Eng. 2020, 3189691. [Google Scholar] [CrossRef]
Ramasubramanian, A.K.; Kazasidis, M.; Fay, B.; Papakostas, N. On the Evaluation of Diverse Vision Systems towards Detecting Human Pose in Collaborative Robot Applications. Sensors 2024, 24, 578. [Google Scholar] [CrossRef] [PubMed]
Patalas-Maliszewska, J.; Dudek, A.; Pajak, G.; Pajak, I. Working toward Solving Safety Issues in Human–Robot Collaboration: A Case Study for Recognising Collisions Using Machine Learning Algorithms. Electronics 2024, 13, 731. [Google Scholar] [CrossRef]

Figure 1. Diagram of the research workflow as outlined in the article.

Figure 2. Pick-and-place with vision system developed in an experimental scenario (adapted from [12]).

Figure 3. Grasping system developed by Ruohuai Sun et al. (adapted from [13]).

Figure 4. Experimental (a) and simulated (b) setups developed by Natanael Magno Gomes et al. (adapted from [14]).

Figure 5. Architecture of the YOLOv10 model (adapted from [39]).

Figure 6. Example representation of the bounding boxes used for the IoU calculation using the bonbon packaging line.

Figure 7. Pixel-level augmentation changes the individual value of each pixel in an image without altering any of the image’s geometric parameters.

Figure 8. Spatial-level augmentation. These transformations alter the geometric properties of the image without changing any particular pixel value.

Figure 9. Structural-level augmentation. The most common techniques are MixUp, CutOut, CutMix, or other cutting-edge strategies, such as mosaic augmentation (adapted from [43,44]).

Figure 10. Base and TCP references of the UR3e cobot.

Figure 11. Layout of the implemented solution. (1) Cobot; (2) RealSense D435i; (3) 2FG7 OnRobot gripper; (4) candy container; (5) conveyor belt; and (6) encoder.

Figure 12. Wiring between the encoder and the cobot’s I/O board.

Figure 13. Communication chart of the developed system.

Figure 14. Flowcharts of (a) the RTDE communication pipeline and (b) D435i communication pipeline.

Figure 15. Flowchart of the Arduino serial communication pipeline.

Figure 16. (a) Overview of the end-effector and intel D435i camera; (b) A4 marker texture provided by Intel [57].

Figure 17. Schematic of the camera’s field of view and relevant frames.

Figure 18. Diagram of the cobot’s state machine.

Figure 19. Dataset’s health check information.

Figure 20. Different examples of non-compliant examples, with a compliant example of each color for comparison.

Figure 21. Representation of candy position diversity and instance distribution in the dataset. (a) Heatmap of the dataset: yellow indicates the most frequently used areas and blue represents less utilized positions. (b) Histogram containing the candy instance distribution.

Figure 22. Different versions of the datasets and data augmentation techniques used in the study.

Figure 23. Total loss curve relative to the training of RetinaNet.

Figure 24. Training and validation losses and metrics for the selected YOLOv10n model.

Figure 25. Confusion matrix for the YOLOv10n model.

Figure 26. F1-score for the YOLOv10n model.

Table 1. Comparison between all systems mentioned above in the state of the art.

Identified Gaps in Existing Systems	Subchapter/System
Dependence on controlled lighting Restriction to laboratory environments Lack of deep learning techniques	2.1
Small dataset (100 images), limiting generalization Difficulty dealing with variable lighting conditions KINOVA robot not very suitable for industrial applications	2.2
Time-consuming training due to the use of DRL Lack of gripper force control, limiting industry/food applications Insufficient RTDE integration for real-time synchronization	2.3
CNNs depend on specific images, reducing adaptability Lack of advanced image processing techniques Restricted to generic tasks, does not meet specific needs in the food industry	2.4

Table 2. Comparison between YOLO models with different batch sizes.

YOLO Version	Dataset	Batch Number (mAP%)
YOLO Version	Dataset	8	16	32
YOLOv8m	1—Vanilla	0.90074	0.91419	0.90453
YOLOv8m	2—Roboflow	0.91499	0.92518	0.91622
YOLOv10m	1—Vanilla	0.88762	0.91291	0.90127
YOLOv10m	2—Roboflow	0.89956	0.92434	0.90897

Table 3. Model performance across all custom trained models.

Version	Vanilla mAP (50–95)	Vanilla Δ (%)	Roboflow mAP (50–95)	Roboflow Δ (%)	Albu v1.0 mAP (50–95)	Albu v1.0 Δ (%)	Albu v2.0 mAP (50–95)	Albu v2.0 Δ (%)
YOLOv8 (v8-n)	0.906	2.0	0.919	0.7	0.911	1.5	0.913	1.3
YOLOv8 (v8-s)	0.903	2.4	0.922	0.4	0.912	1.4	0.919	0.6
YOLOv8 (v8-m)	0.914	1.2	0.925	-	0.921	0.4	0.917	0.9
YOLOv9 (v9-c)	0.911	1.6	0.919	0.7	0.918	0.5	0.919	0.7
YOLOv10 (v10-n)	0.897	3.1	0.920	0.5	0.907	2.0	0.918	0.7
YOLOv10 (v10-s)	0.902	2.5	0.919	0.6	0.918	0.8	0.917	0.9
YOLOv10 (v10-m)	0.912	1.5	0.924	0.1	0.914	1.2	0.923	0.3
YOLOv10 (v10-b)	0.913	1.4	0.921	0.4	0.920	0.6	0.921	0.4
RT-DETR	0.898	1.6	0.910	1.5	0.905	1.6	0.896	2.7
Faster-RCNN	0.856	5.8	0.863	6.2	-	-	-	-
RetinaNet (ResNet50)	0.868	4.6	0.869	5.6	-	-	-	-
RetinaNet (ResNet101)	0.866	4.8	0.881	4.4	-	-	-	-

Table 5. Sorting performance comparison across different classes.

Class	Instances	Success	Selection Error	Detection Error
Blue	20	20	0	0
Green	20	19	0	1
Red	20	20	0	0
Yellow	20	20	0	0
Non-Compliance	10	9	0	1
Unpacked	10	10	0	0
Total	100 [%]	98 [%]	0 [%]	2 [%]

Table 6. Average distance and time spent for each robotic movement and task.

Movement	Avg Distance (cm)	Avg Time Spent (s)
“Home” to Candy (Horizontal)	15.1	1.1
Candy Approach (Vertical)	8.4	0.4
Candy to Endpoint	37.8	2.3
Endpoint to “Home”	30.2	1.7
OD Time Cost	-	0.1
Gripping Time Cost	-	0.1
Total	91.5	5.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Terras, N.; Pereira, F.; Ramos Silva, A.; Santos, A.A.; Lopes, A.M.; Silva, A.F.d.; Cartal, L.A.; Apostolescu, T.C.; Badea, F.; Machado, J. Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications. Appl. Sci. 2025, 15, 1336. https://doi.org/10.3390/app15031336

AMA Style

Terras N, Pereira F, Ramos Silva A, Santos AA, Lopes AM, Silva AFd, Cartal LA, Apostolescu TC, Badea F, Machado J. Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications. Applied Sciences. 2025; 15(3):1336. https://doi.org/10.3390/app15031336

Chicago/Turabian Style

Terras, Nuno, Filipe Pereira, António Ramos Silva, Adriano A. Santos, António Mendes Lopes, António Ferreira da Silva, Laurentiu Adrian Cartal, Tudor Catalin Apostolescu, Florentina Badea, and José Machado. 2025. "Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications" Applied Sciences 15, no. 3: 1336. https://doi.org/10.3390/app15031336

APA Style

Terras, N., Pereira, F., Ramos Silva, A., Santos, A. A., Lopes, A. M., Silva, A. F. d., Cartal, L. A., Apostolescu, T. C., Badea, F., & Machado, J. (2025). Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications. Applied Sciences, 15(3), 1336. https://doi.org/10.3390/app15031336

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integration of Deep Learning Vision Systems in Collaborative Robotics for Real-Time Applications

Abstract

1. Introduction

2. Literature Review

2.1. Integration of Cobots with Computer Vision and Object Detection

2.2. Integration of Deep-Learning Techniques with Collaborative Robots

2.3. Collaborative Robot with Reinforcement Learning for Pick-and-Place Applications

2.4. Identifed Gaps and Proposed Solutions

3. Materials and Methods

3.1. Deep Learning

One-Stage Detectors

3.2. Performance Metrics Used

3.3. Image Augmentation

3.4. Collaborative Robot

3.4.1. Reference Frames

3.4.2. Collaborative Robot UR3e

3.4.3. OnRobot Gripper

3.4.4. Encoder

3.4.5. Cell Layout

3.4.6. Cobot Calibration

3.4.7. Cell Communication Pipelines

3.5. Vision System

3.5.1. Intel RealSense D435i

3.5.2. Intrinsic Calibration

3.5.3. Depth Stream and Coordinates

3.6. Developed Cell

3.6.1. Sorting

3.6.2. Non-Compliant Candy

3.6.3. Prioritization

3.6.4. Cobot Trajectory

3.6.5. State Machine

3.7. Candy Dataset

3.7.1. Dataset Construction

3.7.2. Dataset Annotation

3.7.3. Dataset Augmentation

4. Experimental Results and Analysis

4.1. Object Detection Model Training

4.1.1. Training Environment Implementation Augmentation

4.1.2. YOLO and RT-DETR

4.2. Model Tuning

4.2.1. Hyperparameter Tuning and Training Strategies

4.2.2. Learning Rate

4.2.3. Batch Size

4.2.4. Optimization Strategy

4.2.5. Epoch Configuration

4.2.6. Total Loss Curve

4.3. Object Detection—Model Selection

4.4. Speed Analysis

4.5. Selection of the Model

4.6. Performance Evaluation

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI