1. Introduction
Video surveillance systems are broadly used in many countries to monitor traffic volume and to detect accidents at an early stage to prevent secondary accidents [
1,
2,
3]. Typically, these accidents involve vehicles or pedestrians on the road, making it essential to implement early object detection systems based on CCTVs. This requirement is mandatory in most countries. Based on the detected object information, it becomes possible to develop evaluation algorithms or accident response processes for CCTV surveillance systems. For this purpose, a number of studies have been conducted in various academic fields, such as emergency transportation methods [
4], road vehicle accident responses [
5], railway crossing surveillance [
6], and monitoring systems of construction vehicles at construction sites [
7,
8].
In tunnels, there is a much higher probability of road traffic accidents in entering or exiting zones because of illumination changes which sometimes cause temporary light or dark adaptations to drivers [
9,
10,
11,
12]. Additionally, evacuation areas are often limited in the tunnels, and this may lead to severe secondary accidents. To minimize those accidents in such environments, it is crucial to promptly detect and respond to the accidents via video surveillance systems. CCTVs in the tunnels are normally installed at lower positions compared to those on open roads, as illustrated in
Figure 1. Therefore, these CCTV installations lead to severely distorted perspective footage, and this image distortion causes poor recognition of vehicles or humans as the relative distance between the CCTVs and the identified objects increases [
13]. In particular, this mechanism has disadvantages in object detection (OD) performance using computer vision techniques [
14].
Figure 1 shows the captured images from two different CCTVs on a highway and a tunnel, demonstrating the effect of installation height on perspective and overlap. In an open road site, First, as shown in
Figure 1a, the CCTV can be installed at a height of at least 8 m or higher, as there are no space constraints in the vertical direction. Consequently, the perspective effect is less pronounced, as in
Figure 1a. In addition, the lane width is represented by white straight lines, at the end of a distance of about 100 m. The lane width at the far side appears to be approximately 0.5 times smaller than the one at the near side. On the other hand, in the tunnel CCTV (
Figure 1b), the lane width at the far side appears to be 0.09 times smaller than the one at 100 m away, indicating a more severe perspective effect compared to
Figure 1a [
15].
Figure 1.
Comparison of CCTV images based on installation height: (
a) CCTV for open roads [
16]; (
b) tunnel CCTV.
Figure 1.
Comparison of CCTV images based on installation height: (
a) CCTV for open roads [
16]; (
b) tunnel CCTV.
In addition, there is a significant difference in the overlapping of vehicles depending on the CCTV installation height. The higher the CCTV installation height, the closer it is to the top-view image. Therefore, in situations with numerous vehicles in traffic (
Figure 1a), only a portion of them overlap, allowing for a better distinction of each vehicle, as seen in the actual road image in
Figure 1a. In contrast, as illustrated in
Figure 1b, vehicles that are distinct from the CCTV appear to be heavily overlapped, which could potentially obstruct the visibility of rear vehicles.
To address the issue of decreased OD performance due to perspective, Min et al. conducted a study aiming to improve the OD performance for detecting persons in tunnel CCTV images using high-resolution reconstruction [
17]. Min et al. [
17] reported a 2% improvement in the OD performance for detecting persons, increasing from 88% to 90%. However, Min. et al. [
17] did not address any geometrical effect of the high perspective phenomenon which, we believe, might be more sensitive in OD performance than the resolution of CCTV in tunnel environment.
In fact, OD performance in tunnel sites has not been addressed to date, even if a traffic accident in a tunnel causes highly risky secondary accidents with expected low OD performance in a highly disadvantageous tunnel CCTV environment. Therefore, this study demonstrates the difficulties in the aspect of tunnel environment, and is focused on the high-perspective effect due to low CCTV installation. Then, an attempt is made to overcome it by introducing an inverse perspective transform (IPT) technique with some experimental evidence, aiming to achieve size uniformity between distant and close objects. Experiments were conducted on the tunnel CCTV images, which were chosen due to their higher susceptibility to perspective effect compared to open road site images. Furthermore, to exclude other influencing factors aside from the perspective effect in OD on images, a virtual tunnel environment was utilized. Various moving vehicles were generated within the virtual tunnel, and CCTV images were artificially produced using game development software called ‘UNITY (2019.3.9f1)’ instead of using actual tunnel site images. Based on this, a virtual tunnel and moving vehicle video dataset were created, and a deep learning model was built. Subsequently, the effectiveness of the introduced IPT technique was objectively accessed through the review of the appearance characteristic (AC) of vehicle objects and the training experiment of the deep learning model. A study was then conducted to quantitatively identify the improvement effect on OD performance resulting from the application of the IPT technique.
4. Comparative Experiment on Deep Learning Model Performance
In this section, the OD performance is investigated for both the ORG and TRANS image datasets with discrete distance segments. This comparative experiment is facilitated through the deployment of a convolutional neural network (CNN)-based deep learning model [
35,
36,
37,
38,
39].
As usual, it is important to recognize that the efficacy of a deep learning model is intricately influenced by the dimensions of the object under consideration. Smaller object sizes cause difficulty during both the training and inference processes, primarily due to the intricate nature of extracting a pertinent feature map for such objects [
40,
41,
42].
To investigate this aspect in the tunnel, two image datasets derived from multi-vehicles video are employed: ORG on OI and TRANS on TI. It is obvious that inconsistency in OD performance of the moving vehicles in the distance leads to instability in the detection of phenomena such as backward-travelling or stopping vehicles in the tunnel.
4.1. Preparation of Training Dataset
In the process of training our model, the datasets corresponding to vehicle objects in the ORG and TRANS were meticulously annotated. These datasets are derived from the same multi-vehicle video. Consequently, both datasets encompass an identical count of images and objects. Furthermore, the resolution of these image datasets has been standardized to 646 × 324 pixels, which is a resolution inferior to that of the original video image. A comprehensive overview of the labeled data composition is presented in
Table 3. As listed in
Table 3, the number of images is divided into a ratio of 8:2 for training and testing subsets. It is imperative to note that the testing dataset remains untrained during the whole training phase. Furthermore, the training dataset is further partitioned into a ratio of 8:2, creating pre-training and validation subsets. The pre-training dataset is used for the crucial purpose of preliminary training, which is for the calibration of deep learning hyperparameters and the training environment. In fact, the validation dataset, an offshoot of the training dataset, is not used for the pre-training, but is added to the training dataset at the stage of a subsequent primary training.
4.2. Configuration of Training Hyperparameters through Pre-Training
In this paper, Faster R-CNN [
43] is adopted as the deep learning mode for utilization. Faster R-CNN, a product of evolutionary progression within the R-CNN family, gained substantial prominence following its release in 2015, which was attributed to its heightened ability to swiftly and accurately propose potential objects in comparison to the pre-existing Fast R-CNN [
44]. This enhanced capability is attributed to the integration of a region proposal network (RPN) into the architecture. the intricate structure of Faster R-CNN reveals its underlying operational efficacy. It is noted that the adopted deep learning algorithm is not an important factor in this paper, because the algorithm is used for comparative purposes in the same training environment.
The primary input to this model comprises an image, subsequently subjected to the process of feature map extraction via CNN procedures. Within the intermediate phase of processing, the RPN takes center stage, effectively generating potential bounding boxes through a combination of objectness and regression analyses. These initially proposed bounding boxes then undergo a post-processing phase, facilitated by non-maximum suppression (NMS) [
45]. Furthermore, the terminal component of the architecture, the fully connected layer (FC layer), plays a pivotal role in ascertaining the object’s categorical classification and precisely determining the coordinates of its bounding box.
Since this study considers the impact of perspective at low CCTV height, the overlapping phenomenon of many objects is highly visible. Therefore, the hyperparameters, epoch and NMS intersection-over-union (IOU) threshold, which are sensitive to it, were selected for sensitivity analysis and calibration through preliminary training. An epoch is defined as the training unit of a deep learning model, which is one epoch when the entire train dataset is trained once [
46]. In general, the number of training epochs until the loss function of a deep learning model is converged should be determined in advance. The NMS IOU threshold is a hyperparameter used by the NMS module, which checks the overlap ratio between the bounding boxes to be post-processed, and removes them as duplicates when the threshold is exceeded. In this paper, NMS removes the overlapping bounding boxes for the proposals in the RPN stage. In this case, depending on the NMS IOU threshold, bounding boxes with less than 10 to 20% overlap are passed to the FC layer, or bounding boxes with more than 80 to 90% overlap are passed to the FC layer, affecting the weight update of Faster R-CNN. Other layouts of training configurations adopted in this study are summarized in
Table 4.
4.3. Training Environment
The Faster R-CNN training in this paper utilized the mm detection code [
49], and when resizing the proposal to be delivered to the FC layer, ROI align [
50] was used instead of ROI pooling in the original paper.
The training environment for the deep learning model was Ubuntu 20.04, and the training was performed on the following hardware: Intel E5-2660V3 ×2, RAM 128 GB, NVIDIA GTX 1080 ×4. For the training platform, this paper used Python 3.7, pytorch 1.12.1, and numpy 1.21.6. The initial epoch for pre-training was set to 200. The IOU threshold for true/false determination was 0.5.
4.4. Determination of Training Epochs
The training duration for the deep learning model encompassed approximately 53 min/200 epochs for both datasets, ORG and TRANS. Throughout this training process, the values of the loss function were recorded for both the training and the validation datasets at every epoch, as shown in
Figure 8 and
Figure 9.
Figure 8 shows the dynamic progression of the loss function across the ORG dataset. In
Figure 8, it becomes evident that the classification and regression losses exhibit a tendency to converge harmoniously throughout the entirety of the epochs. However, an observation arises from the validation dataset’s loss curves, as depicted in
Figure 8, revealing divergence post approximately 100 epochs in the former, and a stagnation without noticeable decline in the latter. Consequently, a selection of 100 epochs emerges as the optimal training epoch for the ORG case. The analogous approach is applied to the TRANS dataset, as depicted in
Figure 9. Here, similar trends of convergence in classification and regression losses within the training dataset are apparent throughout the epochal evolution. Notably, the validation dataset’s loss curves follow a pattern similar to that observed in
Figure 8, with divergence commencing around 50 epochs in
Figure 9a and a sustained lack of discernible decrease in
Figure 9b. Thus, a proper choice designates 50 epochs as the optimum training epoch for the TRANS case.
4.5. Determination of the NMS IOU Threshold
A sensitivity analysis with the NMS IOU threshold, which is a sensitive hyperparameter in OD tasks involving overlapped objects, is undertaken. For this purpose, a series of deep learning models are trained across a spectrum of epochs and training conditions, as ascertained in prior sections. The prescribed datasets, ORG and TRANS, are enlisted to facilitate the training procedure. Notably, six distinctive IOU threshold values (0.4, 0.5, 0.6, 0.7, 0.8, and 0.9) are selected for comprehensive analysis, and these thresholds subsequently underpin the training regime for the deep learning models. As a result, the trajectory of the loss function dynamics and the intricate interplay of average precision (AP) values are scrutinized [
51], focusing on the diverse facets of the NMS IOU threshold.
Primarily, an insight into the variation of loss function reveals its profound linkage to the NMS IOU threshold. This phenomenon is graphically portrayed in
Figure 10,
Figure 11,
Figure 12 and
Figure 13, encapsulating the nuanced fluctuations across various epochs.
Figure 10 and
Figure 11 affords a comprehensive view of the classification and regression loss trends, with emphasis on the ORG dataset. A notable observation is the discernible divergence in the validation dataset’s classification loss curve, depicted in
Figure 10b, as the NMS IOU threshold escalates, attaining its zenith at a threshold value of 0.9. Similarly, the regression loss curves depicted in
Figure 11b reveal elevated values for NMS IOU thresholds ranging from 0.7 to 0.9, thereby suggesting the prudent selection of thresholds between 0.4 and 0.6 based on loss metrics and their trends.
Further elaboration is carried out for the TRANS dataset, as disclosed by
Figure 12 and
Figure 13. The training dataset manifests convergence in all loss values, aptly depicted in
Figure 12a, while the classification loss curves within the validation dataset demonstrate a proclivity for divergence at NMS IOU thresholds between 0.7 and 0.9, prompting the prudent selection of thresholds within the 0.4 to 0.6 range for the TRANS case.
Subsequently, a study of AP values at the culminating training epoch, as exhibited in
Figure 14, follows the initial loss function assessment within NMS IOU thresholds ranging from 0.4 to 0.6. An assessment of AP values within this range reinforces this notion, leading to the selection of 0.6 as the optimal NMS IOU threshold for the ORG case. In concurrence, the TRANS case aligns with a comparable viewpoint, endorsing an NMS IOU threshold of 0.6. This choice is substantiated by the marginal variation in AP values within the 0.4 to 0.6 range, guided by the advantageous aspects of elevated threshold values, a principle elaborated upon in the introduction.
4.6. Main Training Method and Conditions
Primary training is executed using the previously determined epoch and NMS IOU threshold values. The training dataset provided in
Table 3 is employed. Both cases, ORG and TRANS, contain the same number of images and vehicle objects. The ORG dataset exists in the OI coordinate system, and the TRANS case is generated by geometrically transforming ORG using IPT into the TI coordinate system. The performance of OI and TI is compared in every distance section, illustrated in
Figure 15. To evaluate OD performance for each distance section, we set the ROI at 200 m (
Figure 15), dividing it into four sections of 50 m intervals for AP value comparison. Vehicle objects from ORG and TRANS, the test dataset in
Table 3, are allocated to corresponding sections on OI and TI based on the bounding box center point. Details of the test dataset allocation are shown in
Table 5. AP value comparison is solely performed on the test dataset, adhering to the prescribed distance sections. Notably, objects in ORG and TRANS images vary in size and shape due to IPT. Thus, vehicle objects are annotated for OD for each case. Reference coordinate systems differ for both cases, causing variations in allocated sections, despite having the same object. Consequently, the number of images and vehicle objects in the prescribed sections may differ, as seen in
Table 5.
For the primary training hyperparameters, the epochs and NMS IOU threshold determined during pre-training are employed. The remaining training parameters remain consistent with those detailed in
Table 4. In this instance, case ORG has 100 epochs, case TRANS has 50 epochs, and both cases adopt an NMS IOU threshold of 0.6. Other training conditions align with those established during preliminary training.
4.7. Experiment Results
Consequently, the results of this investigation with AP values are compared for the test dataset independently across each of the four sections, as illustrated in
Figure 16, and subsequently compare the absolute AP values. In Section 1, case ORG and case TRANS yield AP values of 0.985 and 0.956, respectively. While the AP value of case ORG is 0.029 higher, the absolute value itself underscores that both cases exhibit a high level of OD performance. For Section 2, case ORG and case TRANS register AP values of 0.951 and 0.917, respectively, signaling a comparable level of OD performance. Case ORG’s AP value outperforms by 0.034. In this section, both cases can be attributed to having commendable OD performance. However, within Section 3, the AP values for case ORG and case TRANS stand at 0.61 and 0.795, respectively. In comparison to Section 2, the disparity in OD performance becomes notably pronounced, driven by the AP value of case TRANS’ decreasing by 0.122, and that of case ORG decreasing by 0.341. Moving to Section 4, the contrast between case ORG and case TRANS becomes substantially more pronounced, reflecting AP values of 0.069 for case ORG and 0.952 for case TRANS. In comparison to Section 3, the AP value of case TRANS is actually enhanced by 0.157, whereas case ORG experiences a significant drop of 0.541. Given case ORG’s AP value, it becomes evident that detecting vehicles within distant Section 4 is particularly challenging due to the vehicles’ diminutive size.
4.8. Discussion
In summation, the evaluation of vehicle movement within the original image (OI) reveals a notable discrepancy between the AV and the actual velocity, coupled with a rapid diminution in object size as it recedes from the tunnel CCTV. This occurrence is predominantly attributed to a perspective effect, significantly influencing the training and inference phases of deep learning OD applications.
The reason for this can be seen in the change in the object area and shape in distance, as shown in
Figure 7b and
Figure A3. In
Figure 7b, the object area for a single vehicle is larger than 100 pixels
2 within −100 m, which can be trained and inferred by the deep learning OD model, but it shrinks rapidly beyond −100 m and becomes almost as small as a dot, as shown in
Figure A3c,d. This size effect could be a major reason for poor OD performance in the −100 to −200 m range. Nevertheless, application of OI exhibits reasonable OD performance within a relatively proximate range of around 100 m.
However, OD performance experiences a marked decline beyond this 100 m threshold. Consequently, it can be deduced that OD efficacy, relying on the original CCTV image, is confined to a distance limit of 100 m within the tunnel environment. This limitation is attributed to the common installation practice of positioning CCTVs at lower heights due to spatial constraints.
In contrast, the utilization of TI through the IPT introduced in this study indicates that AV tends to closely match the actual velocity, regardless of the distance between the vehicle and the CCTV. This holds true even beyond distances of 100 m, extending at least up to 200 m. Furthermore, since object size remains relatively constant with distance in TI, the deep learning model trained on TI consistently demonstrates improved and stable OD performance across an ROI up to 200 m.
The reason for consistent OD performance up to 200 m for the TI image could be found in
Figure 7b and
Figure A3. As shown in
Figure 7b, the area of the object in the case TRANS tends to increase, and the height of the object also increases slightly as the position of the vehicle increases from 0 m to −200 m in
Figure A3. Considering that deep learning OD models are affected by the size of the object, it can be seen that the experimental results for the case TRANS could be good evidence for overcoming the difficulties of OD in distance in tunnel environments at least up to 200 m.
5. Conclusions
This paper introduces an IPT, aiming to mitigate the pronounced perspective effect inherent in tunnel CCTVs. To ascertain the viability of this technique, both OI and TI were generated from tunnel CCTV footage taken at a simulated virtual tunnel site, all within the same region of interest (ROI).
Subsequently, through a preliminary comparison of AV and vehicle size in relation to the actual position of vehicles moving at a consistent speed, TI demonstrated the capacity to mitigate the influence of velocity and size alterations caused by perspective effect. Specifically, as vehicles maintained a constant velocity, AV experienced a noticeable decline as they moved away from the tunnel CCTV in OI, whereas in TI, AV remained comparatively closer to the constant velocity.
Expanding on this, deep learning datasets were established for both OI and TI to train the deep learning model under identical conditions. The OD performance of this model was then assessed across four distance intervals spanning 50 m to 200 m, noting that 200–250 m is a standard installation distance interval for tunnel CCTVs in South Korea.
From this comprehensive investigation, a number of key conclusions can be drawn:
- (1)
In the tunnel, the CCTV installation distance interval is less than 100 m, and the utilization of both OI and TI yields acceptable OD performance within tunnels. Consequently, existing tunnel CCTV-based accident detection systems predominantly reliant on OI may encounter considerable difficulties not only in OD but also in unforeseen accident detection as well.
- (2)
Conversely, when the spacing between tunnel CCTV installations exceeds 100 m, utilization of TI becomes advantageous. OI-based OD performance does not exhibit reliable performance beyond 100 m from the CCTV installation position. This implies that tunnel sites with CCTV installation intervals of around 200 m, adhering to Korean regulations, can expect acceptable automatic accident detection performance through TI without necessitating additional CCTV installation.
- (3)
In TI, while the size of vehicle objects remains consistent across distances within tunnels, distant objects inherently experience more substantial degradation in sharpness compared to nearby objects. Nevertheless, OD performance analysis utilizing AP metric demonstrates distance-consistent performance in TI, suggesting that the impact of object size versus sharpness on OD performance is negligible.
- (4)
When identifying the direction and velocity of vehicle movement to detect unexpected accidents like abrupt stops and backward travel of vehicles on tunnel roads, ensuring consistent AV evaluation from CCTV images proves notably beneficial because that phenomenon assists the object tracking of driving vehicles [
52]. Nevertheless, a significant discord between AV and actual velocity in OI necessitates intricate algorithmic corrections that account for vehicle distance to prevent false detections. Consequently, a TI-based tunnel CCTV-based accident detection system could offer simplified accident detection implementation with more consistent and stable performance.
In future studies, the effectiveness of TI through IPT will be re-evaluated in real tunnel sites, which introduce more complex backgrounds and a wider range of AV variations, as well as potential vehicle overlapping phenomena. Additional results will be reported in due course.