1. Introduction
In the field of vision research, the recognition of target objects using Artificial Intelligence (AI) is a highly active research area. In general, recognizing an object using an image sensor or video camera is the task of processing a set of semantic cases and detecting various features of the target object within the image.
Traditional gradient-based object recognition methods distinguish target objects by detecting characteristic changes that exist in the image through manipulating the local information, such as the target image’s brightness, color, and texture [
1]. Previous research has been conducted in the direction of utilizing characteristic changes, such as edge detection [
2], blob detection [
3], and corner detection [
4], to improve the image processing method for object recognition. Recently, various artificial intelligence techniques, based on Convolutional Neural Network (CNN), for automatically recognizing objects in fields of digital image processing have been applied. The high-performance detection models have been implemented in various forms, from model structures such as R-CNN [
5], Fast-RCNN [
6], Faster-RCNN [
7], and RetinaNet [
8], to model algorithms such as SSD [
9] and YOLO [
10].
In general, the object detection model uses a detection algorithm to determine the recognition area (object-box) that contains the recognition object within the detection area and classifies the target object within the recognition area. To conduct the object recognition process using the AI algorithm, select objects from the background image and compare the location (x, y), height, and width (h, w) information in the recognition area with the features of the pre-trained object. Then, determine the target object. When the recognition process is completed, the target object’s values of the location (x, y), height, and width (h, w) are ensured as feature information to recognize a person, as shown in
Figure 1.
The general AI recognition model classifies one object as a person, as shown in
Figure 2a, or an area where two or more objects overlap as a person, as shown in
Figure 2b.
Objects can be overlapped or blurred, especially in real-time images of people crowded close together. This is a major cause of the recognition error, making it difficult to accurately classify objects and damaging the feature information of the object. In addition, nonsensical information, such as lines around objects, things, brightness, and shadows, act as negative factors to distinguish objects.
This study uses various real-time player images from football videos as target objects. In the target objects, groups of overlapping people are separately classified as recognizable target objects. With this, we study the performance-enhancing structures and methods of AI models for recognizing people within groups of similar people. Therefore, in image sensing applications using AI models, the performance of object detection can be quantified by examining the recognition errors that occur while individually classifying a person in the recognition area.
When a target object is a person, numerous factors can cause recognition errors. The characteristic of a person in the recognition area is, approximately, a 1.5 fixed aspect ratio, and the range of scale changes, according to perspective, is also substantial. Moreover, the camera’s shooting angle and a person’s behavioral characteristics change the features of a person’s object. Therefore, various object recognition methods for distinguishing object information from surrounding information have been studied.
For the target object’s recognition, in the case of overlap between players, the target objects could be distinguished by recognizing each player’s uniform number through images from various angles, using multiple camera viewpoints [
11]. However, in classifying a specific target object in the overlapping area, the recognition method by changing the camera angle is not appropriate, as shown in
Figure 3. This is because the higher the similarity of feature information in the overlapping object, the more often the target object cannot recognize individually.
As a supplementary method, depth information was added to the object feature information (RGB data) collected by multiple camera viewpoints, using cameras such as Kinect or stereo settings [
12]. By implementing a lightweight single-pass convolutional neural network architecture with a fused information source, the detection accuracy and location tracking performance are improved compared to single-view camera images. In addition, feature extraction methods utilizing body-worn inertial measurement units (IMU) [
13] and LIDAR sensors [
14] have frequently been studied. However, the methods mentioned so far are not suitable for real-time object detection environments, such as football games, due to the limitations of the moving speed, distance measurement range, and lighting environment. In addition, the zoom-in-out range manipulation of the camera on the football image changes the size of the target object and the recognition accuracy together. For this reason, when an AI detection model is trained with limited feature information, such as distant objects, as shown in
Figure 4a, or small objects, as shown in
Figure 4b, various object recognition errors occur.
In real-time football images, correctly recognizing a player as an individual is a valuable issue. However, recognition errors frequently occurred when classifying target objects of one person in a crowded area. In the error case of object recognition, the same identification is assigned to a similar player according to the frame change, as shown in
Figure 5.
When various motion changes occur before and after the overlapped target object with similar characteristics in the two-dimensional space, the object recognition model using AI has a high rate of misrecognition and non-recognition errors in the real-time object recognition process. To improve this, we implemented a multi-class object recognition model with the HSV color space conversion processing and compared the recognition performance with the general AI models. In addition, if the target object has a similar shape within the corresponding recognition area or overlap, it becomes the main factor of misrecognition and non-recognition. Therefore, by devising and applying the HSV module to the processing structure of the general AI recognition model, we reduced the misrecognition and non-recognition errors of objects with similar shapes. Then, characteristics within groups of similar objects were added to the HSV model as unique data for learning classes. In this paper, the final AI model for recognizing multi-class objects improved the recognition errors caused by rapid changes and overlaps of similar objects.
2. Methods
2.1. Preparation of the Training Dataset
In general, image preprocessing methods are used to prepare training data to improve the learning effects of AI models. The image data acquired in a limited time is insufficient for model learning, which increases the cost function value and reduces its predictive performance. Image preprocessing methods, such as the standardization of images and clarification of recognized results, are used in the general-purpose, low-performance hardware-based detection model to overcome the environmental limitations of image acquisition.
In this research, we extract the unique feature information of objects from images limited by geometric transformation methods and use it as new data for AI models to learn. Image geometric transformation includes simple data reinforcement methods such as flipping, cropping, rotation, translation, color space, and noise injection. According to
Table 1, image cropping is the most accurate geometric transformation method of image manipulation.
As shown in the evaluation results [
15], reported in terms of Top-1 and Top-5 accuracy, the cropping significantly improves the performance of the CNN tasks. Accuracy is also called Top-1 accuracy to distinguish it from Top-5 accuracy, common in Convolutional Neural Network evaluation [
16].
We selected image crop tools as a data preprocessing method to prepare an efficient training dataset. Yolo Mark [
17] is the object bounding box cropper from images to extract efficiently object feature information. The experiment datasets were tested through the GEFORCE RTX 3060 D6 12G GPU calculation, based on 1280 × 720 resolution in the K3 Korean national football game video. We also set the COCO [
18] mean average precision (mAP50) at 55.3% and 30 FPS to compare the object recognition errors in the implemented AI models.
The proposed AI models were selected randomly within 10% of the 3482 football images as training data, and the remaining 90% were used as test data. We labeled the training data with four type classes (A, B, C, D) in the Yolo Mark. Through this data segmented process, in addition to players (A, B) and referee (C) detected based on the uniform’s colors, overlapped objects were marked as a new class (D) and unlabeled objects were re-marked.
As shown in
Figure 6a, the characteristic data is extracted by marking the bounding box according to the color of individual player’s uniforms. Then, various overlapped objects are selected, as shown in
Figure 6b, and the obtained reinforced training datasets are denoted by new classes.
2.2. Modification and Implementation of AI Models
In a football game, players play a complex role as individual performance and tactical team members, and referees play their role as game operator. While they played their part, various errors occurred in the object detection, and this became a topic to be solved in this study. In addition, it can be seen from
Table 2 that the detection model based on the YOLO algorithm, which has the highest response speed and accuracy, is suitable when considering frequent changes in the movement of players to recognize objects in real-time.
According to the Yolov3 tech report [
19], Yolov3-320, 416, and 608 models are fast and accurate compared with other detection models. The three types of Yolov3 detection models have different performance characteristics depending on the application target environment. The selected Yolov3-416 model was the best-performing model in this study. This is because speed, accuracy, and the target image size for recognition are the criteria of choice in real-time object recognition, such as for football games. TheYolov4 and Yolov5 models were released with no significant change in their algorithms and structure. However, performance differences depend on GPU computing resources at the time of release. In this study, we focused on improving object recognition by revising the model structure and method in limited hardware resources rather than applying the newly released AI model.
Among the various versions of YOLO-based detection models, the Yolov3-416 model structure is shown in
Figure 7.
The YOLO detection model aggregates pixels in the convolution layer to form object-specific features and make predictions based on the loss function output at the network end. We changed this to detect only one person class among 80 class objects. Therefore, the general AI model’s architecture consists of an algorithm that recognizes players and referees as a person.
2.2.1. Structural Modification, Yolov3-HSV Model
In the RGB images, the object information is represented by three unique color values of red, green, and blue properties. In addition, to detect a specific object in the image, all color values of R (0~255), G (0~255), and B (0~255) must be considered. On the other hand, the HSV image is represented by information based on human color perception with three properties: Hue, Saturation, and Value [
21]. The expression range of the information for classifying the uniqueness of an object in the HSV image is H (0~360), S (0~1), and V (0~1). This color space conversion improves the object recognition accuracy by making it easier to classify colors than in RGB images.
The Yolov3-HSV model recognizes players with HSV color information by masking three color types of uniforms [
22]. It is a similar object recognition model to the Yolov3-416 model’ structure modified, as shown in
Figure 8.
We made it easy to distinguish object information within the image through color mask processing that limits the range of specific colors, as shown in
Figure 9.
By uniquely specifying the target objects’ minimum and maximum color ranges within the image only once, we compared whether the players’ H, S, and V color values were in the range. Depending on the presence in the range, the mask matrix element value 1 or 0 is determined correspondingly. Through this process, three color mask matrices were created and applied as a mask to the football images.
The players’ color information accurately represented the pixel value in the image as color and intensity through the color space conversion from RGB to HSV, as shown in
Figure 10b. Then, they were divided into three classes based on the uniform’s color. The object color information was extracted by filtering the players with masks for red, blue, and white. As a result, it was classified into three colors (red: Class A, blue: Class B, white: Class C), as shown in
Figure 10c.
2.2.2. Class Augmentation, Yolov3-Augment Model
In the overlap area, various changes were implemented in the recognition and detection situations, such as front and rear relationships, number of objects, and color contrast occur according to the player’s movement. Consequently, the AI model learning is limited in recognizing and classifying overlapping objects using only the person object, as shown in
Figure 11a. Therefore, setting the overlap area as a new single object reduced the uncertainty of the object detection by grouping numerous variables and subdividing them into additional recognition areas.
In the object class augmentation model shown in
Figure 11b, we added the recognition object class to the Yolov3-HSV model by classifying the overlapping areas of the players as class D. As a result, the Yolov3-Augment model improves the recognition performance between similar objects in various object detection situations by supplementing the object’s feature information through recognition class augmentation.
The object recognition procedure of the proposed AI model is shown in
Figure 12. The AI models evaluated the recognition results in the process of classifying objects (person, player, and overlap player) via unique training weights with different average loss, as shown in
Figure 13. Finally, we compared the recognition error reduction performance of the Yolov3-Augment model, including similar objects with many errors as recognition categories, with the general AI recognition model. We evaluated the object recognition performance of the Yolov3-416, Yolov3-HSV, and Yolov3-Augment models in the same real-time football images.
2.3. Error Criteria and Evaluation Items
There is a generalized measure methodology for evaluating recognition performance, according to a class classification method and class types that constitute a recognition model [
23]. However, we do not evaluate the generalized recognition accuracy of the classes themselves. In addition, this study does not include a classification method according to the type of recognition algorithm. The reason for this is that the three AI models with the same recognition algorithm have different procedures and structures for object recognition; therefore, the features of the occurred errors are important.
In this study, we evaluate how many different recognition errors can occur under the same conditions for three types of AI recognition models that have completed model training for an object class with a similar shape to the defined classification method.
In the problem of statistical classification, the error matrix [
24] is a classification table layout that evaluates the performance of an object recognition AI model. The unit-object recognition is divided into two stages. The error stage can be classified, as shown in
Table 3. It is the result of subdividing each error category into YES or NO, according to the clarity of the object recognition and classification.
We divide object recognition errors into a False Positive that is incorrectly recognized and a False Negative that is non-recognized in the classification category. However, we did not define True Positive and True Negative categories as recognition errors. The reason for this is that a True Positive is an object correctly recognized, and a True Negative is a non-object that it is not recognized. The experiment includes all of the errors that occur in the process of recognizing objects (predicted class) and classifying unit objects (actual class) within the object recognition area (object-box).
False Positive Errors are recognition errors in which the object detection model incorrectly predicted the actual object as another object. It is the result of object misrecognition, in which overlap areas or long distances predict target objects differently or additionally. False Negative Errors are recognition errors that do not predict objects because the object detection model cannot detect the objects. It is the result of object non-recognition that does not predict target objects in areas where object overlap, and object separation has occurred or is at a long distance.
The performance of the object detection model was evaluated by the Precision function (1), related to object misrecognition, and the Recall function (2), related to object non-recognition. Subsequently, it comprehensively evaluated the F1 score function (3), the Accuracy function (4), the Error Rate function (5), and the Specificity (6).
These are model evaluation functions:
Precision indicates how accurate the predicted class is.
Recall indicates how well the actual class was predicted.
F1 Score is the harmonic mean of
Precision and
Recall.
Accuracy is the probability that the prediction class is correct in all data.
The
error rate is the probability that the prediction class is incorrect in all data.
Specificity is known as True Negative Rate (TNR).
4. Conclusions
In this study, the detection criteria were supplemented and included in the recognition target for main errors caused by a lack of unique features during image processing and object recognition using artificial intelligence. First, we detected target objects through structural modifications during image processing of the general AI recognition model. It converts the RGB image into the HSV color space, extracts the object features from accurate information, and then performs the image filtering process with a color mask. Second, we enhanced the training dataset by using an object image cropper. This allowed the augmenting of overlapped objects as a new class to differentiate from the general AI recognition model.
As a result, errors such as non-recognition and misrecognition of the general AI model were recognized as detection targets. The reason for this is that we limited specific objects to the classification, detection, and recognition areas. In addition, it became the strategical basis for diverse approaches to changes in time and space according to the target object movement in the overlapped objects, classified as class D. This was also the result of subdividing the research area so that similar objects can be re-recognized. Therefore, we confirmed that the AI recognition model with structural modifications and object class augmentation effectively reduces object recognition errors. In future work, we will propose a method and algorithm for tracking objects individually in areas where objects overlap, and further improve the effectiveness of this study.
Recognizing a player as an individual, and recognizing players as a team in a football game, is an important monitoring task for analyzing players’ performance. After recognizing players without error, the proposed AI model can be extended to include tracking players’ movement changes, analyzing activity, and automatic statistical analysis.
In the future, in football and other field sports, training data augmenting methods designed to reduce recognition errors by improving the uniqueness of similar objects and proposed artificial intelligence models could be used for analyzing player activity and assisting referees’ judgment. We also expect to apply the effectiveness of this study by extending its scope to detect a variety of target objects and minimize loss in real-time (e.g., monitoring and data acquisition on traffic, animal activities, environment monitoring and etc.).