Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles

Mok, Chunghyup; Baek, Insung; Cho, Yoon Sang; Kim, Younghoon; Kim, Seoung Bum

doi:10.3390/app112411808

Open AccessArticle

Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles

by

Chunghyup Mok

¹,

Insung Baek

¹

,

Yoon Sang Cho

¹

,

Younghoon Kim

^2,* and

Seoung Bum Kim

^1,*

¹

School of Industrial and Management Engineering, Korea University, 145 Anamro, Seongbuk-gu, Seoul 02841, Korea

²

School of Mathematics, Statistics, and Data Science, Sungshin Women’s University, 2 Bomun-ro 34da-gil, Seongbuk-gu, Seoul 02844, Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2021, 11(24), 11808; https://doi.org/10.3390/app112411808

Submission received: 18 November 2021 / Revised: 6 December 2021 / Accepted: 8 December 2021 / Published: 12 December 2021

Download

Browse Figures

Versions Notes

Abstract

:

As the need for efficient warehouse logistics has increased in manufacturing systems, the use of automated guided vehicles (AGVs) has also increased to reduce travel time. The AGVs are controlled by a system using laser sensors or floor-embedded wires to transport pallets and their loads. Because such control systems have only predefined palletizing strategies, AGVs may fail to engage incorrectly positioned pallets. In this study, we consider a vision sensor-based method to address this shortcoming by recognizing a pallet’s position. We propose a multi-task deep learning architecture that simultaneously predicts distances and rotation based on images obtained from a visionary sensor. These predictions complement each other in learning, allowing a multi-task model to learn and execute tasks impossible with single-task models. The proposed model can accurately predict the rotation and displacement of the pallets to derive information necessary for the control system. This information can be used to optimize a palletizing strategy. The superiority of the proposed model was verified by an experiment on images of stored pallets that were collected from a visionary sensor attached to an AGV.

Keywords:

multi-task learning; deep learning; forklift; AGV

1. Introduction

Automated guided vehicles (AGVs) are one of the most important elements of manufacturing systems. They have developed steadily since the 1950s as they provide a flexible and reliable alternative to conventional material transportation. For repetitive transportation, AGVs were previously guided by wires embedded in the floors. In recent years, laser navigation has allowed AGVs to travel more diverse and accurate routes and has been extended in logistics systems to guide automated forklifts. The advantage of AGVs is that they can increase the speed of logistical circulation through automation and increase productivity through efficient use of inventory space. Moreover, the risk of workplace accidents can be greatly reduced because work is performed along a planned route. Consequently, AGVs can be used in warehouses in which repetitive work is frequent, others that require unmanned operation 24 h a day, and in those whose narrow spaces demand efficient use of space. In particular, AGVs play essential roles in smart factories where all facilities and devices are interconnected by wireless communication.

Pallets are used to transport objects stably, and moving them has become the main role of AGVs. An AGV moves to a specified pallet, lifts it using forks, and transports it to a specific location. All processes are performed based on an assumption that the pallets are centered and squared so an AGV can move straight forward to lift the pallet. However, repeated lifting of pallets with different weights can change the position of the pallets. If this goes unrecognized, an accident may occur during the lifting or transporting; even if the AGV can recognize the problem, human assistance is still required to execute the lift accurately. If an AGV is to solve such problem alone, it requires the capability not only to recognize a wrongly positioned pallet, but also the capability adjustments to ensure correct lifting.

Several studies have been conducted with visionary sensors with the goal of building an automated forklift system incorporating pallet recognition and forklift control. Garibotto et al. presented a vision-based system for intelligent forklifts including pallet recognition [1]. An intelligent forklift called ROBOLIFT moves using vision navigation and recognizes pallets through a visionary sensor. It became the basis for developing pallet recognition technology related to computer vision. Seelinger and Yoder presented a vision-guided control method called mobile camera-space manipulation (MCSM) [2]. It enabled the movement of an autonomous forklift based on its actual current location and also recognized the position of pallet by using feedback from the vision sensor in the autonomous forklift. Both studies assumed the pallet was placed on the floor and focused on the control of the forklift.

Several studies have focused on pallet recognition. For more accurate pallet recognition, they used images from various sources for feature extraction. Chen et al. proposed a pallet image detection method using a hue saturation value (HSV) space which is different from the red green blue (RGB) space [3]. They used edge detection to classify pallets in photos and used a coordinate system to obtain their location information. They calculated the position of pallets relative to the forklift based on pallet midpoints determined by a camera space model that uses coordinate systems to construct the relationship between the image and real-world spaces. Varga and Nedevschi suggested a system of pallet detection based on the recognition of candidates of pallet edge [4]. Furthermore, to be invariant and robust to certain image brightnesses, they used a gray image scale. Xiao et al. also proposed pallet detection and localization using a red green blue depth (RGB-D) camera [5]. Multiple pallets can be recognized at once by using the depth information from an RGB-D camera. Obtaining and processing images in pallet detection was intensively studied before research began on a deep learning approach.

A few researchers have recently adapted a framework that combines machine learning and deep learning for use in pallet detection for a forklift. Syu et al. proposed a pallet detection methodology using a haar-like feature detection and Adaboost algorithms [6]. They achieved 95% accuracy by using Adaboost for pallet detection after selecting candidates using haar-like detection. Li et al. proposed a pallet detection model using single shot multibox detector (SSD) algorithms [7]. To recognize pallets in various warehouse situations, an experiment was conducted with 4,620 images, similar to real situations collected from open sources. The experimental results achieved 92.7% accuracy. However, this approach was limited by its inability to detect the forklift pocket. Zaccaria et al. suggested a methodology of simultaneously detecting pallets and pallet pockets with RGB images [8]. They compared several object detection algorithms such as SSD, faster region-based convolutional neural networks (Faster R-CNN), and you only look once version four (YOLOv4). Experiments were conducted on images with multiple pallets and pallet pockets. Overall, the faster R-CNN performed best. Object detection and segmentation using deep neural networks showed high accuracy in finding the pallets in images. However, accurate determination of proper pallet placement continued to be a problem.

In this study, we propose a method using a visionary sensor to guide a forklift. This method is used in a precision recognition situation before the AGV lifts the pallet. When the AGV arrives in front of the pallet, the visionary sensor takes an image. This image is then used to calculate and predict the angles and distances of AGV movement for the AGV to proceed to lift the pallet.

The main contributions of this study can be summarized as follows:

(1) To the best of our knowledge, this research is the first attempt to build a deep learning model to predict and guide the movement of an AGV with pallet images. Each image permits measurements of the angle and distance the AGV should move.

(2) We propose a multi-task learning framework for a deep learning model that predicts the angle and distance simultaneously. It shares the rest of the network except for classifiers, and takes fewer parameters and less time to learn. The two labels (angle and distance) are related to each other, enabling better feature learning when learning at the same time.

(3) To demonstrate the usefulness and applicability of the proposed method, we used real-world image data to compare the proposed methods within a single-task framework and multi-task framework in terms of classification accuracy. In addition, we used data with the lights turned on and with flipped images for efficient data collection to further improve the detecting accuracy.

The remainder of this paper is organized as follows. Section 2 describes the details of the proposed method. Section 3 presents the experimental results. Finally, Section 4 contains our concluding remarks.

2. The Proposed Method

Our multi-task learning framework for pallet recognition performs two tasks. One is the classification of the angle, and the other is the classification of the distance class. We propose a deep learning model with multi-task learning as shown in Figure 1. Unlike the method using each classification model for two tasks as shown in Figure 2, our proposed method uses hard parameter sharing for all layers except the classifiers. There are two labels for each input image, and both labels (angle, distance) are used simultaneously for training. In this section, we describe the proposed methodology in detail.

2.1. Labels

The goal of pallet recognition is the centering of the AGVs in front of a pallet. Therefore, we defined two guides for the AGV as shown in Figure 3. One is the angle guide for rotation and the other is the distance guide to move horizontally. For angle labels, 11 classes were defined at intervals of three degrees, from 15 degrees west to 15 degrees east. These angle labels allow the AGV to move to a position parallel to the pallet. For distance labels, seven classes were defined at intervals of ten centimeters, from 30 cm left to 30 cm right. Distance labels allow the AGV to move to a position centered in front of the pallet. These two types of guides allow the AGV to engage a pallet correctly, even if the pallet is in the wrong position.

We defined the position of the pallet in several classes, and we used a deep learning model that predicts these classes with the images taken from the AGV. We conducted experiments with five popular convolutional neural network (CNN) architectures: AlexNet by Krizhevsky, Sutskever, and Hinton [9], VGG11 by Simonyan and Zisserman (2014) [10], and 18-layer, 50-layer, and 101-layer residual networks (ResNet-18, ResNet-50, and ResNet-101) by He, Zhang, Ren, and Sun [11]. All performed superbly in ImageNet, an annual image competition, in the year of their publication [12]. AlexNet was the first to achieve a huge performance improvement by using CNNs. VGGNet showed good performance by using only small filters, and ResNet used residual blocks to transfer gradients to deep layers. These models had in common the use of a convolutional layer and a fully connected layer. The convolutional layer extracts features of the image that then are classified into several classes in the fully connected layer. AlexNet and VGG11, which do not use a global average pooling layer, require numerous parameters to connect the convolutional layer and classifier. Therefore, as the size of the extracted features of the image increases, the parameters rapidly increase. ResNet-18, ResNet-50, and ResNet-101, which already have a global average pooling layer after the last convolutional layer, have a shallow classifier. ResNet models are made up of the same convolutional blocks, but the depth of the convolutional layer differs depending on the model number. As shown in Table 1, because the ResNet models have a small number of parameters despite using deep layers, we used three ResNet models with varying depths. A classification model that predicts each label was trained using these five deep learning model architectures. We performed training for a total of 100 epochs with a batch size 8. We used stochastic gradient descent (SGD) optimizer with a learning rate of 0.005, and a momentum factor of 0.9.

2.2. Multi-Task Learning

Multi-task learning has been used successfully in machine learning and deep learning frameworks. It is commonly used in hard or soft parameter sharing [13,14]. In hard parameter sharing, some parameters are shared between tasks, and the rest are different for each task. Soft parameter sharing allows all parameters to be jointly constrained via a Bayesian prior or joint dictionary [15,16]. In particular, there are many studies on hard parameter sharing multi-task learning, and it also showed good performance in computer vision and natural language processing [17,18,19,20,21,22,23]. In particular, several studies used hard parameter sharing multi-task learning to predict several elements simultaneously when recognizing objects [17,18,20]. Learning two tasks at once could increase the performance of the recognition problem.

The two labels are defined to represent the position of the pallet. The tasks for predicting the two labels are related, so it is intuitive to assume that the two tasks share some common feature representation [24]. In the feature learning approach, a better representation can be learned by the data common to all the tasks. Therefore, for better representation learning, we propose a multi-task model that shares the convolutional layers. Our proposed model has the same convolutional layers as the general deep learning models in Section 2.2 but has a multi-head structure that has two classifiers. By sharing the hard parameters, we can reduce the risk of overfitting and reduce the number of parameters in the model [25]. Specifically, because it has a shallow classifier, the multi-task model using the ResNet structure does not differ significantly from the single-task model in terms of the number of parameters. As shown in Table 2, multi-task models using the ResNet structures have half the parameters of the two single models. The loss function for the multi-task model with two tasks (angle and distance) is defined as follows:

L_{m u l t i t a s k} = λ L_{a n g l e} + (1 - λ) L_{d i s t a n c e}

(1)

where

λ \in [0, 1]

. The hyperparameter

λ

balances the effect of the two loss terms. As the lambda becomes larger, the model focuses more on the angle class, and conversely, as the lambda becomes smaller, the model focuses more on the distance class. Table 3 shows that the accuracy of each class varies according to the lambda. We set

λ = 0.6

, which showed the best performance. Two tasks were trained at once using this multi-task process, and the hyperparameters used for training are the same as in Section 2.1.

3. Experiments

3.1. Datasets

There are two types of labels; 11 classes for angle and seven classes for distance (i.e., a total of 77 labels for each combination.) We obtained a total of 385 images taken from five types of AGV workplaces. By moving the AGV in each place, we obtained 77 images for angles and distances. For the test dataset, 77 images were sampled by stratified random sampling. Two images with the same label were not included in the test set as shown in Figure 4a,b so that all 77 images have different labels. As for the remaining images, 38 and 270 images, respectively, were used as the validation and training datasets. Because it is generally known that a deep learning model requires a large training dataset, we increased the training data in two ways. One was to take one more picture with an additional light turned on in the same place. The new images have the same labels as the original images. The second method was to flip the images horizontally, and these images have also flipped labels. By combining these two methods, a total of four image types are possible as shown in Figure 5. To verify the effect of each, the datasets were created in four ways as shown in Table 4.

3.2. Evaluation of the Performance of Pallet Recognition

The deep learning models have two classification tasks (angle and distance), and both must be correct to function in real-world situations. If even one label is wrong, the AGV cannot engage the pallet. Thus, we used an exact match ratio (subset accuracy), indicating the percentage of samples that have all their labels classified correctly. The exact match ratio is defined as the following product of two indicating functions of each label:

e x a c t m a t c h r a t i o = \frac{1}{n} \sum_{i = 1}^{n} I (Y_{a n g l e}^{(i)} = {\hat{Y}}_{a n g l e}^{(i)}) I (Y_{d i s t a n c e}^{(i)} = {\hat{Y}}_{d i s t a n c e}^{(i)}),

(2)

where

Y^{(i)}

is a true label and

{\hat{Y}}^{(i)}

is the predicted label of

Y^{(i)}

for ith sample. I is an indicator function that returns one if the condition in the parenthesis is true and zero otherwise. The exact match ratio is a strict measure because zero is returned if any of the predicted labels is wrong.

Although our tasks are classification problems, each label can be considered as a continuous one. Therefore, we used the following mean absolute errors that measure the amount of difference between the actual and predicted values:

m e a n a b s o l u t e e r r o r = \frac{1}{n} \sum_{i = 1}^{n} |Z^{(i)} - {\hat{Z}}^{(i)}|,

(3)

where

Z^{(i)}

and

{\hat{Z}}^{(i)}

are values of the predicted and true labels for ith sample.

We conducted experiments with multi-task models and single-task models in five deep learning architectures. As reported in Table 5, the multi-task model outperformed the single-task models in terms of exact match ratios. For task-specific accuracy, the results are different as shown in Table 6. In the case of Task 1, both models showed high accuracy, which meant the task is easy to learn. The single-task model showed slightly better performance than the multi-task model, but the difference is less than 2%. On the other hand, in Task 2, both models were less accurate than in Task 1, and the multi-task model showed better accuracy. The multi-task structure showed benefits in learning Task 2, which is relatively difficult to learn. As shown in Figure 6, training and validation losses for Task 1 had similar curves regardless of the model. However, for Task 2, the training losses of the multi-task model converged earlier, and the validation losses had a lower value. Compared with the single-task model, the gap between training and validation losses is smaller, which is consistent with the characteristics of a multi-task model that is robust to overfitting. As shown in Figure 7, the same results are shown in terms of accuracy.

We conducted experiments with the proposed model using additional training data. Lighted and flipped images constituted the additional data, and experiments were conducted in four cases as shown in Table 7 and Table 8 to examine the effects of each image dataset. In most cases, the additional data delivered improved performances with the most improvement associated with lighted images. The best performance was achieved when both images were used.

3.3. Soft Parameter Sharing

Additional experiments were conducted to confirm the effect of the regularization parameters of the deep multi-task learning. Our proposed model uses hard parameter sharing in which some parameters are shared between all tasks. On the other hand, in the soft parameter sharing, each task has its own model with corresponding parameters. The distance of some parameters of each model is regularized and trained similarly to each other by a regularization term. When this distance is zero, it is equivalent to hard parameter sharing. To examine the effect of regularization, the results of two single models, soft parameter sharing, and hard parameter sharing with ResNet architectures were compared. For soft parameter sharing, convolutional layers were regularized using L2 regularization. As shown in Table 9, soft parameter sharing has intermediate accuracy between two single models and hard parameter sharing. As shown in Table 10, in the task-specific accuracy, Task 1 shows high accuracy because of using many data. However, in the case of Task 2, the performance of the single-task model decreases as the layer goes deeper. On the other hand, the soft parameter sharing method using regularization maintains the performance.

4. Conclusions

In this study, we conducted experiments with real-world datasets to predict the positions of the pallets. To safely fork pallets requires accurate relative angles and distances between AGVs and pallets. We proposed a deep learning model with multi-task learning to simultaneously predict the angles and distances. Further, we compared the performance of the proposed multi-task model with single-task models in five deep learning architectures. We observed that the multi-task model maintained its performance in the easy tasks and outperformed the other models in the difficult tasks. As a result, the proposed model outperformed other models in the pallet recognition problem, which required high performance in both tasks. In addition to these promising results, the multi-task model had almost half of the parameters of any of the two single models. We also proposed an efficient method for training images. We observed that the performance increased when the lighted images and flipped images were used as additional training data. The proposed model may have a limitation in that the performance can vary according to the weights assigned to the task. Because the multi-task loss consists of a weighted sum of each task’s loss, an additional process of finding an appropriate ratio is required.

There are several directions for the expansion of this research. In the present study, our task can be limited to the navigation situation of the AGV because we used images with limited angles and distances. This issue can be extended to predict all ranges of angles and distances in an open space. Further, we can combine other tasks such as estimating the distance between pallets and AGVs or heights of the pallet. This expanded research is expected to be useful not only in AGV, but also in various real-world autonomous driving systems. Another further direction is developing task balancing of the loss function. Various methods are available including a heuristic method, a method using uncertainty, and a method using gradient surgery. With proper balancing, we can obtain better performance with the same datasets and models.

Author Contributions

Conceptualization, C.M., I.B., Y.C. and S.K.; methodology, C.M., I.B., Y.C. and Y.K.; investigation, C.M., I.B., Y.C. and S.K.; writing—original draft preparation, C.M., I.B. and Y.C.; writing—review and editing, Y.K. and S.K.; supervision, Y.K. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

National Research Foundation of Korea: BK FOUR; National Research Foundation of Korea: NRF-2019R1A4A1024732; Institute of Information & Communications Technology Planning & Evaluation: IITP-2020-0-01749; Korea Creative Content Agency: R2019020067.

Data Availability Statement

The data presented in this study are not publicly available due to privacy and legal restrictions.

Acknowledgments

This research was supported by the Brain Korea 21 FOUR, Ministry of Science and ICT (MSIT) in Korea under the ITRC support program (IITP-2020-0-01749) supervised by the IITP, the National Research Foundation of Korea grant funded by the MSIT (NRF-2019R1A4A1024732), and the Ministry of Culture, Sports and Tourism and Korea Creative Content Agency (R2019020067).

Conflicts of Interest

The authors declare no conflict of interest.

References

Garibotto, G.; Masciangelo, S.; Bassino, P.; Ilic, M. Computer vision control of an intelligent forklift truck. In Proceedings of the Conference on Intelligent Transportation Systems, Boston, MA, USA, 12 November 1997; pp. 589–594. [Google Scholar]
Seelinger, M.; Yoder, J.D. Automatic pallet engagment by a vision guided forklift. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 4068–4073. [Google Scholar]
Chen, G.; Peng, R.; Wang, Z.; Zhao, W. Pallet recognition and localization method for vision guided forklift. In Proceedings of the 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, China, 21–23 September 2012; pp. 1–4. [Google Scholar]
Varga, R.; Nedevschi, S. Robust Pallet Detection for Automated Logistics Operations. In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications—Volume 4: VISIGRAPP (4: VISAPP), Rome, Italy, 27–29 February 2016; pp. 470–477. [Google Scholar]
Xiao, J.; Lu, H.; Zhang, L.; Zhang, J. Pallet recognition and localization using an RGB-D camera. Int. J. Adv. Robot. Syst. 2017, 14, 1729881417737799. [Google Scholar] [CrossRef]
Syu, J.L.; Li, H.T.; Chiang, J.S.; Hsia, C.H.; Wu, P.H.; Hsieh, C.F.; Li, S.A. A computer vision assisted system for autonomous forklift vehicles in real factory environment. Multimed. Tools Appl. 2017, 76, 18387–18407. [Google Scholar] [CrossRef]
Li, T.; Huang, B.; Li, C.; Huang, M. Application of convolution neural network object detection algorithm in logistics warehouse. J. Eng. 2019, 2019, 9053–9058. [Google Scholar] [CrossRef]
Zaccaria, M.; Monica, R.; Aleotti, J. A Comparison of Deep Learning Models for Pallet Detection in Industrial Warehouses. In Proceedings of the 2020 IEEE 16th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 3–5 September 2020; pp. 417–422. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021. [Google Scholar] [CrossRef]
Bakker, B.; Heskes, T. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res. 2003, 4, 83–99. [Google Scholar]
Yang, Y.; Hospedales, T.M. Trace norm regularised deep multi-task learning. arXiv 2016, arXiv:1606.04038. [Google Scholar]
Bilen, H.; Vedaldi, A. Integrated Perception with Recurrent Multi-Task Neural Networks. arXiv 2016, arXiv:1606.01735. [Google Scholar]
Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
Kokkinos, I. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6129–6138. [Google Scholar]
Rudd, E.M.; Günther, M.; Boult, T.E. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 19–35. [Google Scholar]
Huang, J.T.; Li, J.; Yu, D.; Deng, L.; Gong, Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7304–7308. [Google Scholar]
Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
Dong, D.; Wu, H.; He, W.; Yu, D.; Wang, H. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1723–1732. [Google Scholar]
Standley, T.; Zamir, A.; Chen, D.; Guibas, L.; Malik, J.; Savarese, S. Which tasks should be learned together in multi-task learning? In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 9120–9132. [Google Scholar]
Baxter, J. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach. Learn. 1997, 28, 7–39. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed multi-task learning method for pallet recognition.

Figure 2. Overview of the single-task classification models for pallet recognition.

Figure 3. Definition of two classes: angle, distance.

Figure 4. (a,b) are correctly placed pallets in different workplaces. (c) is an example of a pallet moved 20 cm to the right. (d) is an example of a pallet moved 20 cm to the left and rotated three degrees.

Figure 5. Four types of images in training dataset.

Figure 6. Illustrations of training and validation losses for ResNet-50: (a,c,e) are Task 1’s loss plots for different training data; (b,d,f) are Task 2’s loss plots for different training data.

Figure 7. Illustrations of accuracy for ResNet-50: (a,c,e) are Task 1’s accuracy plots for different training data; (b,d,f) are Task 2’s accuracy plots for different training data.

Table 1. Comparison of the number of parameters and layers by deep learning model architectures.

Architectures	Number of Layers	Number of Parameters
Architectures	Number of Layers	CNN	Classifier	Total
AlexNet	8	2.470 M	54.580 M	57.050 M
VGG11	11	9.226 M	119.591 M	128.817 M
ResNet-18	18	11.186 M	0.005 M	11.191 M
ResNet-50	50	23.531 M	0.022 M	23.553 M
ResNet-101	101	42.500 M	0.022 M	42.522 M

Table 2. Comparison of the number of parameters between the multi-task model and two single models.

Architectures	Number of Parameters
Architectures	Two Single Models	Multi-Task Model (Proposed)	Ratio (%)
AlexNet	114.08 M	111.61 M	97.83
VGG11	257.62 M	248.39 M	96.41
ResNet-18	22.37 M	11.19 M	50.02
ResNet-50	47.05M	23.54 M	50.03
ResNet-101	85.04 M	42.54 M	50.02

Table 3. Comparison of the hyperparameter

λ

for multi-task model. Comparisons of results using single-task models and multi-task models, in terms of exact match ratios. The best values for each architecture are in bold. The mean exact match ratio and standard deviations (in parentheses) are reported.

Table 3. Comparison of the hyperparameter

λ

for multi-task model. Comparisons of results using single-task models and multi-task models, in terms of exact match ratios. The best values for each architecture are in bold. The mean exact match ratio and standard deviations (in parentheses) are reported.

	Multi-Task Model (VGG11)
$λ$	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
Exact match ratio	0.65 (0.08)	0.68 (0.06)	0.67 (0.06)	0.70 (0.09)	0.70 (0.07)	0.79 (0.08)	0.77 (0.07)	0.72 (0.08)	0.60 (0.09)

Table 4. The number of images used for training, validation, and test dataset.

Training Dataset	Validation Dataset	Test Dataset
270 (Default)	38	77
540 (Default + light on)	38	77
540 (Default + flipped)	38	77
1080 (Default + light on + flipped)	38	77

Table 5. Comparisons of results using single-task models and multi-task models, in terms of exact match ratios. The best values for each architecture are in bold. The mean exact match ratio and standard deviations (in parentheses) are reported.

Methods	Architectures
Methods	AlexNet	VGG11	ResNet-18	ResNet-50	ResNet-101
Single-task	0.58 (0.09)	0.67 (0.08)	0.75 (0.07)	0.70 (0.08)	0.68 (0.05)
Multi-task (proposed)	0.71 (0.08)	0.79 (0.08)	0.81 (0.06)	0.81 (0.07)	0.77 (0.11)

Table 6. Comparisons with single models and multi-task models, in terms of task-specific accuracy. The best values for each architecture are in bold. The mean accuracy and standard deviation are reported.

Methods	Architectures
	AlexNet		VGG11		ResNet-18		ResNet-50		ResNet-101
	Task 1	Task 2	Task 1	Task 2	Task 1	Task 2	Task 1	Task 2	Task 1	Task 2
Single-task	0.95 (0.02)	0.61 (0.09)	0.97 (0.02)	0.69 (0.08)	0.97 (0.03)	0.78 (0.06)	0.97 (0.02)	0.72 (0.07)	0.96 (0.02)	0.71 (0.06)
Multi-task (proposed)	0.93 (0.03)	0.75 (0.08)	0.95 (0.03)	0.83 (0.06)	0.96 (0.02)	0.84 (0.06)	0.96 (0.03)	0.84 (0.06)	0.97 (0.02)	0.79 (0.11)

Table 7. Comparisons of results using the different training datasets, in terms of exact match ratio. The best values for each architecture are in bold and the best performing model among the architectures is underlined.

Methods	Training Dataset Number of Samples	Architectures
Methods	Training Dataset Number of Samples	AlexNet	VGG11	ResNet-18	ResNet-50	ResNet-101
	Default 270	0.71 (0.08)	0.79 (0.08)	0.81 (0.06)	0.81 (0.07)	0.77 (0.11)
	Default + light on 540	0.93 (0.03)	0.93 (0.04)	0.94 (0.04)	0.93 (0.04)	0.94 (0.03)
Multi-task (proposed)	Default + flipped 540	0.72 (0.09)	0.75 (0.06)	0.86 (0.06)	0.81 (0.06)	0.81 (0.07)
	Default + light on + flipped 1080	0.94 (0.03)	0.93 (0.03)	0.95 (0.03)	0.95 (0.03)	0.96 (0.03)

Table 8. Comparisons of results using the different training datasets, in terms of mean absolute error. The best values for each architecture are in bold and the best performing model among the architectures is underlined.

Training Dataset Number of Samples	Methods	Architectures
		ResNet-18		ResNet-50		ResNet101
		Task 1 (degree)	Task 2 (cm)	Task 1 (degree)	Task 2 (cm)	Task 1 (degree)	Task 2 (cm)
Default 270	Single-task	0.13 (0.23)	1.50 (1.04)	0.09 (0.12)	2.61 (1.98)	0.15 (0.23)	2.44 (1.45)
Default 270	Multi-task	0.10 (0.11)	0.52 (0.24)	0.12 (0.14)	0.50 (0.39)	0.06 (0.07)	0.66 (0.51)
Default + light on 540	Single-task	0.04 (0.09)	1.37 (1.00)	0.12 (0.16)	2.28 (1.93)	0.20 (0.43)	2.32 (1.87)
Default + light on 540	Multi-task	0.04 (0.09)	0.11 (0.14)	0.08 (0.11)	0.17 (0.21)	0.02 (0.05)	0.10 (0.17)
Default + flipped 540	Single-task	0.38 (0.38)	4.64 (3.43)	0.40 (0.31)	5.22 (1.57)	0.51 (0.27)	6.40 (2.97)
Default + flipped 540	Multi-task	0.37 (0.21)	0.92 (0.40)	0.50 (0.28)	1.62 (0.63)	0.45 (0.24)	1.87 (1.10)
Default + light on + flipped 1080	Single-task	0.05 (0.13)	0.59 (0.65)	0.02 (0.06)	1.10 (0.81)	0.03 (0.06)	1.31 (1.09)
Default + light on + flipped 1080	Multi-task	0.01 (0.03)	0.09 (0.09)	0.02 (0.04)	0.07 (0.10)	0.01 (0.03)	0.09 (0.13)

Table 9. Comparisons of results using the different methods of multi-task learning with three ResNet architectures, in terms of exact match ratio. The best values for each architecture are in bold.

Methods	Training Dataset Number of Samples	Architectures
Methods	Training Dataset Number of Samples	ResNet-18	ResNet-50	ResNet-101
Single-task		0.88 (0.05)	0.80 (0.04)	0.81 (0.06)
Soft parameter sharing	Default + light on + flipped 1080	0.88 (0.06)	0.85 (0.04)	0.85 (0.04)
Hard parameter sharing (proposed)		0.95 (0.03)	0.95 (0.03)	0.96 (0.03)

Table 10. Comparisons of results using the different methods of multi-task learning with three ResNet architectures, in terms of task-specific accuracy. The best values for each architecture are in bold.

Methods	Training Dataset	Architectures
		ResNet-18		ResNet-50		ResNet-101
		Task 1	Task 2	Task 1	Task 2	Task 1	Task 2
Single-task		0.99 (0.02)	0.88 (0.05)	0.98 (0.01)	0.81 (0.04)	0.99 (0.01)	0.82 (0.06)
Soft parameter sharing	Default + light on + flipped 1080	0.99 (0.01)	0.89 (0.05)	0.98 (0.02)	0.87 (0.04)	0.99 (0.01)	0.87 (0.04)
Hard parameter sharing (proposed)		0.99 (0.01)	0.95 (0.02)	0.98 (0.01)	0.95 (0.03)	0.99 (0.01)	0.96 (0.03)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mok, C.; Baek, I.; Cho, Y.S.; Kim, Y.; Kim, S.B. Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles. Appl. Sci. 2021, 11, 11808. https://doi.org/10.3390/app112411808

AMA Style

Mok C, Baek I, Cho YS, Kim Y, Kim SB. Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles. Applied Sciences. 2021; 11(24):11808. https://doi.org/10.3390/app112411808

Chicago/Turabian Style

Mok, Chunghyup, Insung Baek, Yoon Sang Cho, Younghoon Kim, and Seoung Bum Kim. 2021. "Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles" Applied Sciences 11, no. 24: 11808. https://doi.org/10.3390/app112411808

APA Style

Mok, C., Baek, I., Cho, Y. S., Kim, Y., & Kim, S. B. (2021). Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles. Applied Sciences, 11(24), 11808. https://doi.org/10.3390/app112411808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pallet Recognition with Multi-Task Learning for Automated Guided Vehicles

Abstract

1. Introduction

2. The Proposed Method

2.1. Labels

2.2. Multi-Task Learning

3. Experiments

3.1. Datasets

3.2. Evaluation of the Performance of Pallet Recognition

3.3. Soft Parameter Sharing

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI