Next Article in Journal
A Comprehensive Review of Indoor Localization Techniques and Applications in Various Sectors
Previous Article in Journal
Assessing the Relationship Between Cerebral Metabolic Rate of Oxygen and Redox Cytochrome C Oxidase During Cardiac Arrest and Cardiopulmonary Resuscitation
Previous Article in Special Issue
The Development and Validation of an Artificial Intelligence Model for Estimating Thumb Range of Motion Using Angle Sensors and Machine Learning: Targeting Radial Abduction, Palmar Abduction, and Pronation Angles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games

Department of AI and Software, Gachon University, Seongnam-si 13120, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(3), 1543; https://doi.org/10.3390/app15031543
Submission received: 20 December 2024 / Revised: 24 January 2025 / Accepted: 28 January 2025 / Published: 3 February 2025
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)

Abstract

:
This study presents an AI-based sports broadcasting system capable of real-time game analysis and automated commentary. The model first acquires essential background knowledge, including the court layout, game rules, team information, and player details. YOLO model-based segmentation is applied for a local camera view to enhance court recognition accuracy. Player’s actions and ball tracking is performed through YOLO algorithms. In each frame, the YOLO detection model is used to detect the bounding boxes of the players. Then, we proposed our tracking algorithm, which computed the IoU from previous frames and linked together to track the movement paths of the players. Player behavior is achieved via the R(2+1)D action recognition model including player actions such as running, dribbling, shooting, and blocking. The system demonstrates high performance, achieving an average accuracy of 97% in court calibration, 92.5% in player and object detection, and 85.04% in action recognition. Key game events are identified based on positional and action data, with broadcast lines generated using GPT APIs and converted to natural audio commentary via Text-to-Speech (TTS). This system offers a comprehensive framework for automating sports broadcasting with advanced AI techniques.

1. Introduction

Recently, studies related to broadcasting sports games, especially with Deep Learning methods to track players’ movements in real time and combining TTS (Text-To-Speech) to make an AI-based broadcast, have been conducted a lot. Applications like Naver Sports supply users with text-form information about the events that have occurred during the game, and Fotmob allows users to know the score of the game in real time. NC soft suggested an application called PAIGE, which combines baseball games with TTS so that users can easily access realistic, emotional broadcasting of the game [1].
Moreover, pro sports clubs enhance their strategies and analyze opponents by using video analysis techniques, leading them to produce a better performance. Over the last few years, coaches and crowds have shown a vast amount of interest in game video analysis and it plays a huge part in analyzing and developing strategies. Famous clubs or leagues with a lot of capital support a vast number of cameras to gather the data during games. However, some leagues of clubs with a lower amount of capital have difficulty in gathering these data. Thus, it is necessary to develop a technique that allows them to recognize, track, and analyze strategy with videos without a high cost.
To broadcast or analyze a sports game video with AI, recognizing the court, players, referee, balls (if included), actions of players, and referee and tracking them in real time are required. Park et al. suggested a method to detect highlights in an E-sports video by using LSTM multimodal training with highlight videos based on audio and chats [2]. To do this, the study used videos from the streaming site “Twitch”, chat logs and audio data, and other highlight video data from YouTube and gaming companies. With this, they trained the model to detect the highlights of the video. As a result, the model showed a 90.68% accuracy of detecting the highlight scenes with a 0.722 F1 score.
Kang and Lee suggested a model to detect highlight scenes in E-Sports videos [3]. They used total score and object score information on screen. With the data provided by the gaming companies, they analyzed them with OpenCV and CNN and obtained an overall 89.9% accuracy. Fu et al. suggested a model to detect highlight scenes using a CNN model and chat logs and video screens [4]. This model showed an accuracy of 74.7%.
Furthermore, integrating these single techniques into one and building it into a system is required. This study allows us to recognize the court, player actions and tracking, and game events and generate according sentences based on generative AI to make it possible for an AI to broadcast a sports game. To extract the features from each frame in a video and match them for panorama image shift, the SIFT (Scale-Invariant Feature Transform) algorithm is applied, and the homography is generated with panoramic images and is combined with the videos from multi-view cameras in a local view [5]. Furthermore, for court recognition, segmentation based on the YOLO model is applied. Player action and ball detection is performed through YOLO algorithms. YOLO integrates techniques such as dynamic anchor assignment and adaptive task prioritization to improve its architecture, enhance accuracy, and achieve faster inference speeds [6,7]. For object tracking, we propose the following algorithm: In each frame, the YOLO detection model is used to identify the bounding boxes of players. Subsequently, the Intersection over Union (IoU) is calculated between the bounding boxes from the current frame and those from the previous five frames. The bounding boxes with the highest IoU are then linked together to track the movement paths of the players. In recognizing players in a game, since there are multiple teams in a game, team colors, the number of players and referees, and the layout of the court are saved and used in domain knowledge. Finally, based on the information extracted from the video, the ChatGPT-4.0 Turbo API was used to generate a sports broadcast.
The main contributions of this study lie in addressing the gaps in sports analytics by advancing beyond mere object recognition to encompass tracking and the extraction of key events for interpreting and generating commentary for sports matches. While previous YOLO-based models have significantly improved the accuracy of ball and human detection, they have not been applied to analyze matches or generate broadcast commentary. This study proposes a method to optimize sports commentary generation by fine-tuning existing models, demonstrating the feasibility of achieving high-quality results. Additionally, recognizing the unique challenges in the sports domain—such as uncontrolled environments, complex backgrounds, inconsistent lighting, and fast-moving players—this study incorporates domain knowledge specific to basketball to distinguish between teams, players, and referees, while also recognizing in-bounds and out-of-bounds events. By implementing an algorithm tailored to these challenges, including the detection of blurry ball pixels and the differentiation of participants, this study effectively bridges the gap between object recognition and comprehensive sports game analysis, enabling robust commentary generation. Finally, through this study, it is expected that by recognizing players’ actions and play information from video input alone and automatically generating sports broadcasts, the method can be applied to less popular sports that are not currently broadcast.

2. Related Works

To broadcast a sports game based on videos and AI, methods for recognizing the court and players, the referees, and the ball, along with analyzing their movements and tracking them, are required.

2.1. Playground Detection

Pei-Chih Wen et al. suggested a basketball broadcasting system based on the players’ trajectory tracking [8]. They made an application to broadcast basketball games based on tracking players by using a Video Panorama Image Shift and Kanade–Lucas–Tomasi (KLT) features to track the homography of the video. They showed an accuracy of 80.5% for overall IoU detection recognition. Sha, L. and Hobbs et al. suggested a model dividing the court based on an end-to-end model and obtaining homography through camera pose estimation [9]. To track players, and to detect players and objects separately, models like YOLO have been constantly studied. They suggested a model to predict a homography by dividing the court and estimating the camera pose according to an end-to-end model. They showed an accuracy of 83.2%.
Jonas Theiner et al. introduced a study for matching a football field [10]. This paper does not rely on the key features of a video. Instead, it suggests applying calibration by using DeepLabV3 and ResNet to obtain segmentation of points, lines, and circles and calculate the Segment Reprojection Loss iteratively to find the optimized camera parameters and lens distortion coefficient.
In previous studies on recognizing sports arenas, algorithms were applied to accurately determine locations by generating panoramic images from camera footage collected at various angles and calculating the homography for each section. However, homography-based methods require sufficient information about the entire view during panoramic image generation. When such information is lacking, significant errors occur in recognizing straight court areas. Additionally, when spectators are present within the court area, it becomes challenging to accurately identify the boundaries. This necessitates either manually removing the spectators or applying straight lines to refine the boundary recognition, which poses additional challenges. In this study, we adopted a global view approach to recognize the entire arena and subsequently applied YOLO segmentation to enable detailed recognition of local views. This approach minimizes errors introduced by panoramic image generation.

2.2. Athletic Recognition and Tracking

Human action recognition has been studied in two main streams. One of them is using sensors and the other is using videos. Ayokunle Olalekan Ige et al. ran a survey on studies related to tracking human action based on wearable sensors [11]. Basically, data obtained from environmental sensors are cheaper and more accurate, considering the potential. According to this, the study suggested a method to use low price action recognition like using smartphone sensors. Kasteren et al. used data from environmental sensors to build a system for activity action recognition and proposed a model that shows about 95.6% accuracy [12].
Chen and Gao each conducted a study to integrate wearable sensor data and IoT, with environmental sensor data, to make a hybrid sensor for action recognition. Chen’s paper suggested a dataset containing depth camera data, which allows to the Multimodal Sensor to be utilized more easily [13]. The paper ran a study of action recognition based on hybrid sensors and expanded the usability of the dataset by including RGB videos, skeleton images, and depth videos. W Gao et al. suggested DanHAR, a new dual-attention method which supports channel and temporal attention on a residual network [14]. DanHAR can improve feature expression ability for sensor-based HAR works. The dual-attention, channel attention and temporal attention, focuses on target object and target activity. With an opened HAR dataset and weakly-labeled HAR dataset, the experiment was conducted. As a result, compared to the WISDM dataset, UNIMIBSHAR dataset, PAMAP2 dataset, OPPORTUNITY dataset, and weakly-supervised HAR dataset on normal ConvNet, they obtained a better performance by 2.02%, 4.20%, 1.95%, 5.22%, and 5.00%.
Muhammad Attique Khan et al. developed a method to recognize human activity using features extracted based on the Histogram of Oriented Gradients and deep features [15]. The method is applied with two stages. The first stage, in which a saliency-based method is applied, extracts motion and geometric features, and the second stage calculates the Chi-square distance with the extracted features and threshold-based minimum distance features and integrates them with an extracted deep CNN and hand-crafted features to obtain the final vector. This study was approved with the Weizmann, UCF11 (YouTube), UCF Sports, IXMAS, and UT-Interaction datasets and the outcomes were each labeled to specific actions. By applying the IXMAS dataset, which includes a total of 14 action labels, including watch, walk, etc., to the model, a result of 98.6% accuracy was obtained. In the case of the UT-Interaction dataset, which is a dataset to support complicated human actions in a consecutive video, a result of 99.8% was obtained. In the case of the UCF11 dataset, which contains data based on sports-related action from YouTube, a result of 99.3% was obtained. The Weizmann dataset, which has action labels including crouch, run, walk, etc., obtained a result of 99.2% accuracy.
ZN Khan et al. suggested an Attention-Included Multi-Head Convolutional Neural Network to form a light-weighted deep learning network with unlabeled training samples [16]. This framework contains three light-weighted convolution heads, with each head designed with a single dimension CNN to extract features from sensory data. To enhance the CNN, the light-weighted head model uses attention to select important features automatically, leaving unimportant features. To evaluate the model, the study used two open benchmark datasets, WISDM and UCI HAR. The WISDM dataset, which is a dataset from users with their smartphones, showed a 0.972 F1 score. The UCI HAR dataset, which is a dataset from users with their smartphones, focusing on six specific actions, showed a 95.3% accuracy.
However, for sports videos, not only are detection and tracking of the players required but also of the referees and the ball. Recently, models like YOLO have been used in a study to distinguish humans and objects. You Only Look Once (YOLO) is an algorithm first proposed by Redmon in 2016. It is a real-time object detection framework that processes an entire image in a single neural network pass to predict bounding boxes and class probabilities in a single evaluation. By framing object detection as a global optimization problem, YOLO minimizes background errors and combines feature extraction, bounding box prediction, and classification into a unified pipeline. This makes it highly suitable for real-time applications in various domains, and as of now, it has been released up to version 11. In this work, we use YOLO, which integrates techniques such as dynamic anchor assignment and adaptive task prioritization to improve its architecture, enhance accuracy, and achieve faster inference speeds.
Matija Burić et al. launched a study detecting players and the ball in a handball sports game using the YOLOv2 model [17]. The ball only takes a few pixels in an image, which makes it complicated to detect, and it can shift its shape according to motion blurs. Furthermore, detecting handball players is challenging due to their positions in front of the camera, often hidden behind each other, and it is important to detect who is carrying the ball currently. This study applied indoor/outdoor game play data of handball and the COCO dataset to prevent overfitting and generalization. The model mainly focused on detecting the ball and players instead of the whole view, and outperformed other models by values from 0.94% to 7.34%.
Soroush, Babaee Khobdeh et al. suggested a method to detect and track basketball players with the combination of YOLO and a Deep Fuzzy LSTM Network [18]. Detecting players in a basketball game is challenging with many constraints like calibration from various cameras, occlusion from on-court and off-court cameras, and inconsistent lights. This study applied YOLO to track players in each frame and applied LSTM and Fuzzy logic to distinguish each player’s actions. Fuzzy Logic supports LSTM to overcome its limits to make a precise prediction system. The suggested model was validated on the dataset SpaceJam and the Basketball-51 dataset.
Even though YOLO models have brought about huge development for ball tracking, there are still some challenging issues like occlusion and 3D tracking. Especially in the sports field, the environment is uncontrolled and there is some domain knowledge required for the model to detect human activities during the game. Complicated backgrounds and inconsistent lights, detecting players moving at a fast tempo, and distinguishing players and referees based on prior information about the team are also some reasons that make this task challenging. Moreover, detecting the ball and players in sports is also required but challenging due to the blurry pixels of the ball in a frame.
To tackle these challenges, our study focuses on developing a robust algorithm for detecting and tracking players, referees, and balls in a sports game video with an uncontrolled dataset. In particular, our study suggests a method to optimize sports commentary generation by fine-tuning existing Yolo models, demonstrating the feasibility of achieving high-quality results. For object tracking, we propose the following algorithm. In each frame, the YOLO detection model is used to identify the bounding boxes of players. Subsequently, the Intersection over Union (IoU) is calculated between the bounding boxes from the current frame and those from the previous five frames. The bounding boxes with the highest IoU are then linked together to track the movement paths of the players. Additionally, this study incorporates domain knowledge specific to basketball to distinguish between teams, players, and referees while also recognizing in-bounds and out-of-bounds events. By implementing an algorithm tailored to these challenges, including the detection of blurry ball pixels and the differentiation of participants, this study effectively bridges the gap between object recognition and comprehensive sports game analysis, enabling robust commentary generation.

3. Proposed Method

3.1. Overall Architecture

As shown in Figure 1, our study is focused on developing AI-based sports broadcasting. First, the model obtains background knowledge required for the game. The background knowledge contains the layout of the court, type of game, number of players, number of referees, player numbers, team names, team colors, and rules of the game. Secondly, the model derives homography from real-time videos by tracking relative positions on court. It extracts features from frames and the applied SIFT (Scale-Invariant Feature Transform) algorithm to match frames during panoramic image shift. SIFT (Scale-Invariant Feature Transform) is an algorithm used to detect and represent significant local features in an image, providing feature extraction that is invariant to scale and rotation. This algorithm identifies key points in the image and generates local descriptors around each key point, enabling tasks such as object matching, image stitching, and object recognition [5]. The position of the camera for broadcasting the main stream of the game is located and rotates in a fixed position for the game. Our model changes the view of the court into a bird’s-eye view based on the camera calibration result. For this, with OpenCV, it detects the court by shifting parts of the videos frames into panoramic images, and calculates the error with the actual court size. It then calculates the homography of the actual court size to produce bird’s-eye view images. To generate the bird’s-eye view image, the left and right videos covering the entire area were utilized, allowing the panorama to be created with a minimal number of images. Third, the model detects and tracks players with the YOLO model. The YOLO8-cls model is fine-tuned to extract the court domain and player positions, allowing detection of the players and the ball in real time. For object tracking, we propose the following algorithm to optimize the system performance. In each frame, the YOLO detection model is used to identify the bounding boxes of players. Subsequently, the Intersection over Union (IoU) is calculated between the bounding boxes from the current frame and those from the previous five frames. The bounding boxes with the highest IoU are then linked together to track the movement paths of the players. Each player’s actions, specifically, stop, run, dribble, shoot, and blocking, are detected with the R(2+1)D action recognition model. R(2+1)D is a deep learning model developed for action recognition in videos, which separates the spatial (2D) and temporal (1D) components of 3D convolutional neural networks (CNNs) for processing. By incorporating information about sequentially occurring actions during recognition, it enhances the accuracy of action recognition and was applied in this study to analyze the actions of athletes [19]. Fourth, with the information derived from the video, key events are recognized. Key events contain players, teams, positions, and action information, and GPT-4.0 turbo API allows the generation of a broadcast line, switching it to audio with TTS for natural broadcasting.

3.2. Court Calibration

Figure 2 shows the process of court calibration. To calibrate the court, the whole information about the court is required. Our study developed the model to generate a bird’s-eye view of global court images with the following method. First, generate panoramic images from the clips that contain the court region in a video. Second, shift the channel of the image into a YCrCb channel and generate a 16 × 16 histogram of the image. Third, limit the color of the image based on the mode color of the image and shift it into a binary image. The region of the court is represented as the biggest contour in the middle of the binary image. The model calculates the convex hull of this region and clusters each degree of the convex hull clockwise. After that, PCA is applied to the clustered result to represent straight lines. The points where different lines cross represent the vertices of the court. Each coordinate of the vertices is used to calculate the homography between the result and the actual court. Through this method, homography can be obtained from a sports clip. Some unstable courts in clips are replaced with the most accurate homography among clips. Finally, the total homography is calculated for the whole video, which allows a bird’s-eye view of the court to be formed. After that, to detect the partial field of the court according to the movement of the screen, the YOLO8-seg model distinguishes the court region, three-point line, paint area, center circle, and outside of the court. Fine-tuning was applied for better performance. A total of 1362 pieces of image data were used for training from the segmentation dataset [20], 1342 for the basketball court, 1296 for the three-point line, 776 for the paint area under the goal post, and 507 for the center circle, with each labeled. Fine-tuning was performed with 300 epochs with 16 batch sizes for the model update. Detailed areas of the court and its convex hull were calculated in order to detect each player’s position on the court and exclude off-court players.

3.3. Player and Ball Tracking

In this work, we proposed a tracking algorithm for reliable tracking. In the proposed tracking algorithm, the get_player_positions function processes data obtained from the YOLOv8 object detection model to extract player and basketball center coordinates, bounding boxes, and attributes. To improve the accuracy of detecting players and the ball, a total of 14,425 image data, consisting of 53,320 player data, 13,527 referee data, 5591 net data, and 4938 ball data are used for training in fine-tuning, with 300 epochs and 16 batch sizes for the model update from the athlete recognition dataset [21]. Only objects with a confidence score of 0.5 or higher are considered. The basketball is linked to the nearest player by calculating the Euclidean distance between the basketball’s center coordinates and the center coordinates of all detected players. The compute_iou function calculates the Intersection over Union (IoU) between bounding boxes to serve as the basis for matching objects. Objects with an IoU of 0.3 or higher are considered the same entity, while objects with an IoU below this threshold are treated as unmatched. The match_players function generates a cost matrix based on IoU values and employs the Hungarian algorithm to perform optimal matching between bounding boxes from consecutive frames. The track_players function updates the state of tracked objects based on matching results, incrementing a missed frames counter for unmatched objects. Objects that remain unmatched for more than five frames are removed from tracking. Finally, the player_tracking function performs object tracking across all frames, exporting results in the JSON format. The saved data include each object’s ID, position information, bounding boxes, and relationships with the basketball. Through these functions, the algorithm maintains consistent player tracking and object association throughout the video, ensuring robust detection and tracking of players and the basketball.

3.4. Sports Activity Recognition

Figure 3 provides the whole process of Player tracking and action recognition. To distinguish teams with recognized players, training for uniform data is required. YOLO-cls is used with fine-tuning to update the model. A total 3738 image data have been used for training, containing uniforms with black, blue, green, purple, red, and white colors from the uniform color dataset [22]. A total of 100 epochs are taken with 16 batch sizes. The colors are classified by the model from extracted bounding boxes per frames. The color with the highest confidence level has been selected.
After the recognition of players, the R(2+1)D model has been applied to detect certain actions related to basketball like block, pass, run, dribble, shoot, ball in hand, pick, no action, and walk. The R(2+1)D model is a convolutional model which shows a decent performance with only a few parameters. It consists of 3D CNN and ResNet, a type of 2D CNN, to decompose and recognize a video into 2-dimensional space information and single dimensional time information. To fine-tune the action recognition model, the SpaceJam dataset [23] has been used with 25 epochs, a 0.001 learning rate, and eight batch sizes. The SpaceJam data consist of 2000 labeled gif images, containing actions like block, pass, run, dribble, shoot, ball in hand, defense, pick, no action, and walk. Data augmentation was also applied, leading a total 49,901 data for training. Through this process, the model can extract key events with player recognition, team recognition, player position, and action for broadcasting.

3.5. Commentary Generation

This study applied ChatGPT-4.0 Turbo API to generate according commentary for the event based on the player, team, position and action. For generation, a basic message based on player information, team information, position, and action information in Json form is generated and prompted, allowing it to be regenerated as a form of natural commentary. The example of the commentary is shown in the following Figure 4. To analyze the naturalness of the generated commentary, the extracted examples were compared with actual broadcast commentary. It was confirmed that the generated sentences were created at a level similar to actual broadcast commentary.

4. Experimental Result

4.1. Implementation

The suggested system architecture is shown on Figure 5. The system for providing AI-based broadcast commentary is structured as follows. Users upload videos, which are stored in an S3 bucket called Input Video, and view the generated commentary through a web page. The system includes another S3 bucket named Model Results for saving the results of model executions. Events triggered by storing or updating videos in the Input Video bucket or results in the Model Results bucket are detected by Model Execute Triggers, an AWS Lambda trigger, which initiates the execution of the subsequent model. Model Endpoints leverage a GPU endpoint to run the R2+1D and YOLO models, implemented using FastAPI, and are proxied through Nginx to manage access to each model’s endpoint. The Model Weights component consists of an S3 bucket that stores the fine-tuned model weights after training. When the weights are updated, Model Weight Triggers, implemented as an AWS Lambda function, automatically update the .pt file of the corresponding model locally. The system operates as follows. When a user uploads a video, it is stored in the S3 bucket, triggering a Lambda function to execute the next step in the model pipeline. Once a model completes its process, the results are stored in the S3 bucket, and the process repeats. Ultimately, the commentary results are stored in the S3 bucket and retrieved by the frontend page to display them to the user.
Figure 6 shows the screenshot of the system provided to the user. The broadcasting service has information about the teams playing in the current game on top along with the player and ball detection result on the right side of the screen. The commentaries generated with the detected team, players, ball, and player positions are located at the bottom part of the screen. The current performance indicates that processing a 20-s video, including the entire pipeline, takes approximately 1 min. To generate commentary, the system requires a minimum of 16 frames per second, which is sufficient to produce results.

4.2. Experimental Result

This study calculated the IoU to analyze the accuracy of the generated bird’s-eye view court image. Image labeling was performed with the LabelImg tool, consisting of the Left Court Paint Zone (Left box), Right Court Paint Zone (Right box), full court (Full Court), Left Court 3 Point Line (Left Semicircle), Right Court 3 Point Line (Right Semicircle), Left Half Court (Left Court), Right Half Court (Right Court), and Court Center Circle (Mid Circle). With the ROI (Region of Interest) coordinate, a polygon-shaped BitMask is generated, and the intersection and union of the BitMask is calculated, allowing derivation of the IoU by Function 1. The calibrated result of the court is shown on Figure 7. The colored lines are the calibrated result of the model.
IoU = Overlapping   Area Union   Area
At the calibration stage, the target value and the outcome value have been compared. As a result, the IoU value for recognizing the whole court showed about 97%, the left court 3 point line showed 96%, the left court paint zone showed about 94%, the left court half line showed about 95%, the right court half line showed about 95%, the court center circle showed about 95%, the right court 3 point line showed 94%, and the right court paint zone showed about 94% accordant accuracies each. The average court accordant IoU showed about 95.4%. The whole result is shown on Table 1. Partial IoU Ranges represent the area of courts and IoUpart represents its accuracy. This result outperformed the previous study of Wen et al. which showed an accuracy of 80.5% [8]. It also outperformed the previous study of Long Sha et al., which had been studied with End-to-End process and showed an accuracy of 83.2% [9]. Overall, our model improved the result compared to the previous studies from about 12.2% to 14.9%.
For the action recognition model, 10% of the dataset was randomly allocated as test data. The models trained with YOLO used split data from each Roboflow dataset. Specifically, for athlete recognition, 87%/6%/6% of the data were allocated for training, validation, and testing, respectively. For court segmentation, 87%/9%/4% of the data were used, and for uniform classification, 80%/13%/7% of the data were applied. The results were analyzed based on these allocations. The segmentation result is shown in Figure 8. The center circle, paint zone, basketball court, and 3-point line showed an average of 92.5% accuracy and, in particular, the segmentation result for the paint zone and 3-point line showed a high 98% accuracy, which indicates that the model showed a significant impact. For the background, our study applied an additional algorithm to recognize the non-court part as a background, so no additional accuracy calculation has been taken. Overall, excluding the background, the model performed with an average of 92.46% accuracy of court segmentation.
The detection and classification for player uniforms, teams, and actions are shown in Figure 9 and Figure 10. Figure 9 shows the confusion matrix result for player actions and results in about an 85.04% average accuracy for detecting the actions. In particular, the model shows a high classification result for specific important actions in games, such as blocking or shooting. The cases with the lowest accuracy in the analysis results are ’Ball in Hand’, ’Defense’, and ’No Action’. For ’Ball in Hand,’ the low accuracy might be attributed to the fact that this action often overlaps with dribbling or passing, as players tend to hold the ball momentarily during these actions. Therefore, it is anticipated that accuracy can be improved by further analyzing the action recognition results in conjunction with temporal information, specifically by considering the duration of ball possession to more accurately distinguish simple ball-holding actions. Figure 10 shows the confusion matrix result for team classification with their uniform colors. While most of the uniforms are classified well, the purple-colored uniform shows the least accuracy with 67% and 57.14%. The low recognition rate for the purple color is likely due to the limited data available for purple uniforms. This issue can be addressed by improving the dataset through data augmentation techniques.
The generated commentary results are shown in Figure 11. The positions and actions of all players, as well as the ball’s location in the video, are extracted in JSON format as keywords. Based on this extracted information, commentary is generated using the ChatGPT-4.0 Turbo API. Currently, the generated commentary has been analyzed subjectively. Three individuals were surveyed to assess whether the extracted sentences accurately described the scenes, and in most cases, the explanations were deemed appropriate for the given video.
In Example 1, however, the location of the shooting action is incorrectly recognized as being in the paint zone, although it is not. This issue seems to arise because the YOLO-based model recognizes individuals using bounding boxes but fails to accurately determine the precise standing position of the player. To address this, future improvements should focus on stabilizing the recognition of foot positions by using the lowest point of the bounding box as the reference. In the example, there were no errors in the scene description, but the commentary followed a similar pattern to Example 1. This seems to be a result of the limited input format, which makes it challenging to generate a diverse understanding of the situation. Therefore, prompt engineering should be utilized in the future to add more context to the descriptions, or methods like the Chain of Thought approach should be explored to enable a more varied commentary for different situations. Additionally, in future studies, we aim to complement the subjective analysis by objectively evaluating the generated commentary through comparisons with existing professional commentaries.

5. Discussion

This paper proposes a study on generating sports commentary by analyzing and tracking broadcast videos using deep learning technologies and generating commentary based on generative AI techniques. To generate sports broadcasts, the integration of court recognition, player behavior recognition and tracking, game event recognition, and generative AI-based commentary generation is essential. This study presents a complete process for commentary generation through the integration of these individual technologies. Analyzing sports videos requires stable recognition and tracking technologies, considering domain knowledge about the game, complex backgrounds, occluded movements, and irregular lighting conditions. Additionally, the rapid pace of the game results in frequent movements of the camera and players, making detection challenging. Pre-existing knowledge of teams, referees, and players is also necessary to differentiate between them accurately. In this study, domain knowledge is utilized, including information on team colors, the number of players and referees, team associations, and court layout, to support player recognition. Homography is calculated using panoramic images, and multi-view camera footage is aligned to enable tracking from a bird’s-eye view. Furthermore, court recognition is enhanced by applying YOLO-based segmentation to identify court areas. For player actions and ball recognition, the YOLO algorithm is fine-tuned, and recognition is performed using the R(2+1)D action recognition model. The system demonstrates high performance, achieving an average accuracy of 97% in court calibration, 92.5% in player and object detection, and 85.04% in action recognition. Key game events are identified based on positional and action data, and broadcast commentary is generated using GPT APIs and converted into natural audio commentary through Text-to-Speech (TTS). The limitations of this study lie in its focus on basketball games and rules, which constrain the scope of predictions. Future research will explore domain adaptation to apply the proposed approach to various sports. Additionally, while some examples of generated commentary were evaluated to assess naturalness, future work will include quantitative evaluations and detailed analyses of the naturalness of the generated commentary.

6. Conclusions and Future Works

To sum up, our study suggests a method to make a system for basketball game broadcasting with player and object tracking along with AI-generated commentary. For the basic setup for detecting players and objects, our study used camera calibration to generate bird’s-eye view images of the court. After the court calibration, we detect players and objects with YOLO models, and furthermore, player’s actions are classified with R(2+1)D models. Our results show a better performance compared to the previous results found by other studies, reaching an average of 97% in court calibration, 92.5% in player and object detection, and 85.04% in action recognition. Further work can be undertaken to generalize the model to other kinds of sports and to produce better AI-generated commentary by using other LLM models or fine-tuning.

Author Contributions

Conceptualization, S.J. and A.C.; methodology, S.J., H.K., H.P. and A.C.; software, S.J. and H.P.; validation, S.J., H.K., H.P. and A.C.; formal analysis, S.J., H.K., H.P. and A.C.; investigation, H.K. and A.C.; resources, A.C.; data curation, S.J., H.K. and H.P.; writing—original draft preparation, H.K. and A.C.; writing—review and editing, A.C.; visualization, S.J., H.K., H.P. and A.C.; supervision, A.C.; project administration, A.C.; funding acquisition, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Gachon University research fund of 2024 (GCU-202404610001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code used to implement this work will be made available by A.Y.C. upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. NCSOFT PR Center—News Detail. 2024. Available online: https://about.ncsoft.com/news/article/prosody-control-ai-20201210 (accessed on 1 December 2024).
  2. Park, G.M.; Hyun, H.I.; Kwon, H.Y. Multimodal Learning Model Based on Video–Audio–Chat Feature Fusion for Detecting E-Sports Highlights. Appl. Soft Comput. 2022, 126, 109285. [Google Scholar] [CrossRef]
  3. Kang, S.K.; Lee, J.H. An E-Sports Video Highlight Generator Using Win-Loss Probability Model. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, Brno, Czech Republic, 30 March–3April 2020; pp. 915–922. [Google Scholar]
  4. Fu, C.Y.; Lee, J.; Bansal, M.; Berg, A.C. Video Highlight Prediction Using Audience Chat Reactions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; pp. 972–978. [Google Scholar]
  5. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  7. Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. arXiv 2023, arXiv:2304.00501. [Google Scholar] [CrossRef]
  8. Wen, P.C.; Cheng, W.C.; Wang, Y.S.; Chu, H.K.; Tang, N.C.; Liao, H.Y.M. Court Reconstruction for Camera Calibration in Broadcast Basketball Videos. IEEE Trans. Vis. Comput. Graph. 2016, 22, 1517–1526. [Google Scholar] [CrossRef] [PubMed]
  9. Sha, L.; Hobbs, J.; Felsen, P.; Wei, X.; Lucey, P.; Ganguly, S. End-to-End Camera Calibration for Broadcast Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13627–13636. [Google Scholar]
  10. Theiner, J.; Ewerth, R. TVCALIB: Camera Calibration for Sports Field Registration in Soccer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1166–1175. [Google Scholar]
  11. Ige, A.O.; Noor, M.H.M. A Survey on Unsupervised Learning for Wearable Sensor-Based Activity Recognition. Appl. Soft Comput. 2022, 127, 109363. [Google Scholar] [CrossRef]
  12. Van Kasteren, T.; Noulas, A.; Englebienne, G.; Kröse, B. Accurate Activity Recognition in a Home Setting. In Proceedings of the 10th International Conference on Ubiquitous Computing, Seoul, Republic of Korea, 21–24 September 2008; pp. 1–9. [Google Scholar]
  13. Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Quebec, QC, Canada, 27–30 September 2015; pp. 168–172. [Google Scholar]
  14. Gao, W.; Zhang, L.; Teng, Q.; He, J.; Wu, H. DanHAR: Dual Attention Network for Multimodal Human Activity Recognition Using Wearable Sensors. Appl. Soft Comput. 2021, 111, 107728. [Google Scholar] [CrossRef]
  15. Khan, M.A.; Sharif, M.; Akram, T.; Raza, M.; Saba, T.; Rehman, A. Hand-Crafted and Deep Convolutional Neural Network Features Fusion and Selection Strategy: An Application to Intelligent Human Action Recognition. Appl. Soft Comput. 2020, 87, 105986. [Google Scholar] [CrossRef]
  16. Khan, Z.N.; Ahmad, J. Attention Induced Multi-Head Convolutional Neural Network for Human Activity Recognition. Appl. Soft Comput. 2021, 110, 107671. [Google Scholar] [CrossRef]
  17. Burić, M.; Pobar, M.; Ivašić-Kos, M. Adapting YOLO Network for Ball and Player Detection. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods, Prague, Czech Republic, 19–21 February 2019; Volume 1, pp. 845–851. [Google Scholar]
  18. Khobdeh, S.B.; Yamaghani, M.R.; Sareshkeh, S.K. Basketball Action Recognition Based on the Combination of YOLO and a Deep Fuzzy LSTM Network. J. Supercomput. 2024, 80, 3528–3553. [Google Scholar] [CrossRef]
  19. Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
  20. ZY-VEVVI. Court Segmentation Dataset. 2025. Available online: https://universe.roboflow.com/zy-vevvi/court-segmentation/dataset/4 (accessed on 14 January 2025).
  21. Annotations, A. AI Sports Analytics System Dataset—Training Split. 2025. Available online: https://universe.roboflow.com/asas-annotations/ai-sports-analytics-system/dataset/7/images?split=train (accessed on 14 January 2025).
  22. Test-0v7fp. NBA Uniform Color Classification Dataset. 2025. Available online: https://universe.roboflow.com/test-0v7fp/nba-uniform-color-classify/dataset/4 (accessed on 14 January 2025).
  23. Francia, S. SpaceJam—A Dataset for AI Training. 2025. Available online: https://github.com/simonefrancia/SpaceJam (accessed on 14 January 2025).
Figure 1. Overall procedure of the system.
Figure 1. Overall procedure of the system.
Applsci 15 01543 g001
Figure 2. Court segmentation in local view.
Figure 2. Court segmentation in local view.
Applsci 15 01543 g002
Figure 3. Player tracking and action recognition.
Figure 3. Player tracking and action recognition.
Applsci 15 01543 g003
Figure 4. Prompt example for game commentary generation.
Figure 4. Prompt example for game commentary generation.
Applsci 15 01543 g004
Figure 5. Overall system architecture.
Figure 5. Overall system architecture.
Applsci 15 01543 g005
Figure 6. Application screenshot.
Figure 6. Application screenshot.
Applsci 15 01543 g006
Figure 7. Calibrated result of court.
Figure 7. Calibrated result of court.
Applsci 15 01543 g007
Figure 8. Confusion matrix for court segmentation.
Figure 8. Confusion matrix for court segmentation.
Applsci 15 01543 g008
Figure 9. Confusion matrix for R(2+1)D action recognition.
Figure 9. Confusion matrix for R(2+1)D action recognition.
Applsci 15 01543 g009
Figure 10. Confusion matrix for YOLO results.
Figure 10. Confusion matrix for YOLO results.
Applsci 15 01543 g010
Figure 11. AI commentary generation result.
Figure 11. AI commentary generation result.
Applsci 15 01543 g011
Table 1. Partial IoU ranges and IoUpart (mean, %).
Table 1. Partial IoU ranges and IoUpart (mean, %).
Partial IoU RangesIoUpart (Mean, %)
Full Court97.0
Left Semicircle96.0
Left Box94.0
Left Court95.0
Right Court95.0
Mid Circle95.0
Right Semicircle94.0
Right Box95.0
Average95.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jung, S.; Kim, H.; Park, H.; Choi, A. Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games. Appl. Sci. 2025, 15, 1543. https://doi.org/10.3390/app15031543

AMA Style

Jung S, Kim H, Park H, Choi A. Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games. Applied Sciences. 2025; 15(3):1543. https://doi.org/10.3390/app15031543

Chicago/Turabian Style

Jung, Sunghoon, Hanmoe Kim, Hyunseo Park, and Ahyoung Choi. 2025. "Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games" Applied Sciences 15, no. 3: 1543. https://doi.org/10.3390/app15031543

APA Style

Jung, S., Kim, H., Park, H., & Choi, A. (2025). Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games. Applied Sciences, 15(3), 1543. https://doi.org/10.3390/app15031543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop