Route Positioning System for Campus Shuttle Bus Service Using a Single Camera

An, Jhonghyun

doi:10.3390/electronics13112004

Open AccessArticle

Route Positioning System for Campus Shuttle Bus Service Using a Single Camera

by

Jhonghyun An

School of Computing, Gachon University, Seongnam-si 1332, Gyeonggi-do, Republic of Korea

Electronics 2024, 13(11), 2004; https://doi.org/10.3390/electronics13112004

Submission received: 13 April 2024 / Revised: 15 May 2024 / Accepted: 17 May 2024 / Published: 21 May 2024

(This article belongs to the Special Issue Computer Vision Applications for Autonomous Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

A route positioning system is a technology that identifies the current route when driving from one stop to the next, commonly found in public transportation systems such as shuttle buses that follow fixed routes. This is especially useful for smaller-scale services, such as shuttle buses, where using expensive technology and sensors for location tracking might not be feasible. Particularly in urban areas with tall buildings or mountainous regions with lots of trees, relying solely on GPS can lead to many errors. Therefore, this paper suggests a cost-effective solution that uses just one camera sensor to accurately determine the location of small-scale transportation services on fixed routes. For this, this paper uses a single-stage detection network that quickly identifies objects and tracks them using a simple algorithm. These detected features are compiled into a “codebook” using the bag-of-visual-words technique. During actual trips, this pre-created codebook is compared with landmarks that the camera sees. This comparison helps to determine the route currently being traveled. To test the effectiveness of this approach, this paper used the route of a shuttle bus on the Gachon University campus, which is similar to a downtown area with tall buildings or a wooded mountainous area. The results showed that the shuttle bus’s route was recognized with an accuracy of 0.60. Areas with distinct features were recognized with an accuracy of 0.99, while stops with simple, nondescript structures were recognized with an accuracy of 0.29. Additionally, applying the SORT algorithm to enhance performance slightly improved the accuracy from 0.60 to 0.61. This demonstrates that our proposed method can effectively perform location recognition using only cameras in small shuttle buses.

Keywords:

single camera; detection; tracking; YOLO; SORT; bag of words; bag of visual words; route positioning; campus shuttle bus

1. Introduction

As advancements in autonomous driving technology (ADT) and Information and Communication Technology (ICT) speed up, the transportation sector is changing a lot [1]. Changes include new services in many industries thanks to autonomous technologies. The leading force in these changes is the autonomous vehicle (AV) industry. It is growing fast and changing how people think about it. Many big companies like Hyundai, General Motors (GM), Honda, Naver, Alphabet, and Baidu are competing in this industry. They are all working hard on research and development to be leaders in this new market. The idea of autonomous driving is divided into six levels. Levels 3 (conditional automation) and 4 (high automation) are seen as particularly important stages where autonomous vehicles become practical [2].

The rise of driverless taxis and bus services shows how quickly technology is progressing in this field. Companies like GM’s Cruise and Alphabet’s Waymo are leading the way with level 4 autonomous services in San Francisco [3]. In China, Baidu is also offering similar cutting-edge services, showing the huge potential of autonomous driving [4]. In addition, there are pilot programs in South Korea’s Sangam district, showing that this trend is global and everyone is working to make autonomous transportation a reality.

However, there are many challenges to making autonomous driving technology (ADT) widely available and used. One big problem is the high cost of making and using the sensors and computer programs needed for autonomous vehicles [5,6,7,8]. These are very advanced and expensive parts that are crucial for keeping self-driving cars safe and reliable. Because they are so expensive, it is hard to include them in small-scale services. These smaller services are important for helping regular people become used to autonomous driving technology.

In response to these challenges, this paper suggests a new, cheaper solution for small transportation services. A route positioning system is a technology that identifies the current route when driving from one stop to the next, commonly found in public transportation systems such as shuttle buses that follow fixed routes. This is especially useful for smaller-scale services, such as shuttle buses. The system proposes using landmarks to figure out where a vehicle is on its route. It does this using a regular camera and an advanced computer program that can find and track objects. This new idea uses a method called bag of visual words [9,10,11] to find landmarks along a route that have already been mapped out. This makes it easier to figure out exactly where the vehicle is and when it needs to stop. By using this method, we do not have to rely on expensive sensors, but the system still works very well.

This research has big implications, not only for what is happening right now but also for the future. It provides a way for smaller transportation services to use autonomous driving technology without spending too much money. This study offers a practical and affordable solution that could change how we think about using autonomous vehicles. It could make public transportation better and more efficient by using cheaper technology to drive the vehicles. This could completely change how small transportation services operate in the future.

The rest of this paper is structured as follows. In Section 2, we discuss the related works in the field. Section 3 outlines the theoretical formulation of the proposed method. Moving on to Section 4, we describe the dataset configuration and the data logging system and present the experimental results. Finally, Section 5 provides a summary of this paper.

2. Related Works

To understand autonomous driving technology (ADT) better, it is crucial to look at previous research on object detection [12,13], tracking [14,15], vehicle localization [16,17], and lane detection [18,19]. This section explores the existing literature that forms the basis of our proposed method. This paper mainly focuses on using monocular cameras, which are a more affordable option compared to the expensive sensors often used in autonomous vehicles.

The cornerstone of ADT is the accurate detection and tracking of objects in the vehicle’s vicinity. Recent advancements have been significantly propelled by deep learning algorithms. R-CNN (Regions with CNN features) [20] is a model designed for object detection that selects areas within an image where objects are likely to be found and then uses convolutional neural networks (CNNs) to extract features from each region. This model was the first in the object detection field to apply deep learning, significantly improving performance compared to previous models. While R-CNN delivers accurate results, it is criticized for its slow computational speed.

To address the speed limitations of R-CNN, the Fast R-CNN [21] and Faster R-CNN [13] models were proposed. Fast R-CNN enhances speed by extracting features across the entire image with a single CNN computation and employing ROI (Region-of-Interest) pooling to process various areas. However, it still faces slow computational speeds in generating region proposals and cannot apply deep learning in this step. Faster R-CNN evolved from Fast R-CNN by replacing the selective search algorithm used in region proposal generation with a Region Proposal Network (RPN), which predicts object locations, thus providing faster speeds and higher accuracy.

The YOLO (You Only Look Once) [22] algorithm, introduced in 2016, divides the image into a grid and simultaneously predicts bounding boxes and class probabilities for objects in each grid cell. While the Faster R-CNN model achieved 7 FPS, YOLO achieved 45 FPS, enabling real-time object detection. Currently, YOLO has been updated to version 5, YOLOv5, which reduces the model size and adjusts the depth and width multipliers, allowing for selection based on the application environment. It distributes the computational load evenly across layers, eliminating bottlenecks and enhancing both speed and accuracy.

YOLOv5 is renowned for being very fast, user-friendly, and offering almost state-of-the-art results. YOLOX [23], on the other hand, has introduced some architectural innovations compared to its predecessors. YOLOX is built on the foundation of YOLOv3, but it moves away from the anchor-based design that was used heavily in YOLOv4 and YOLOv5. Instead, it adopts an anchor-free approach that directly predicts bounding boxes without using predefined anchors. This anchor-free design allows YOLOX to handle objects of various shapes and sizes more efficiently and flexibly, often leading to reduced processing times because it needs fewer predictions.

Notably, the YOLO series has demonstrated remarkable efficacy in real-time object detection, making it a preferred choice for applications requiring high speed and accuracy. Concurrently, the SORT (Simple Online and Real-Time Tracking) algorithm has gained prominence for its ability to track objects across frames with minimal processing overhead, thereby ensuring the system’s responsiveness and reliability [24]. Following these developments, the SSD (Single Shot Multi-Box Detector) [25] model was introduced, utilizing multi-scale feature maps to detect objects of various sizes and predict bounding boxes and class probabilities from each feature map, achieving high accuracy and fast processing speeds.

Deep learning-based object tracking methods improve tracking robustness by learning complex patterns and features from large datasets. These methods offer real-time processing speeds and high accuracy, allowing for simultaneous detection and tracking through integration with various object detection algorithms. Bewley introduced SORT as a novel approach to object tracking [24]. This method performs tracking using only the object detection information within the current frame. A key feature of SORT is its use of bounding box overlap techniques for real-time tracking, ensuring fast processing times while maintaining high accuracy. Therefore, SORT performs exceptionally well in real-time tracking scenarios. However, SORT may have some limitations in maintaining object continuity. In research conducted by Danelljan [26], a fast-tracking method using KCF (Kernelized Correlation Filter) was proposed. KCF allows for stable tracking even at high frame rates but struggles with issues like overlapping objects or occlusions in certain situations. Considering these limitations, SORT offers superior performance in terms of efficiency and robustness in real-time and online environments, making it suitable for this study. Recently, Wojke proposed an enhanced SORT method integrated with a deep association metric [25]. This integrated approach was used in conjunction with deep learning-based object detection algorithms like YOLO. Wojke’s proposal aimed to more precisely measure the association between objects, improving both the accuracy and speed of object tracking. Notably, the method ensures fast tracking performance while accurately determining the association between objects. Thus, this study employs the SORT method combined with deep learning-based object detection algorithms.

Visual sensor-based location recognition has garnered significant attention due to its potential to enable precise navigation without reliance on extensive sensor arrays. Mur-Artal et al. introduced ORB-SLAM, a versatile and accurate monocular SLAM (Simultaneous Localization and Mapping) system [27]. The study is important for its use of ORB features to achieve real-time performance in various environments, demonstrating the potential of visual sensors in accurately understanding and navigating spaces. Chen et al. explored the use of convolutional neural networks (CNNs) for place recognition [28]. Their method, which leverages deep learning to interpret complex urban scenes from visual data, marks an essential step forward in utilizing visual sensors for location recognition, showcasing improved robustness against environmental changes and occlusions. Lai and Fox proposed an incremental learning approach for visual navigation, addressing the challenge of dynamic environments [29]. Their method dynamically updates the visual recognition model as new data become available, demonstrating an adaptive strategy for visual sensor-based location recognition that evolves. Radwan et al. introduced a novel approach to visual localization that integrates RGB images with depth data for enhanced robustness in various lighting conditions [30]. The study underscores the benefits of cross-modal data integration in improving the accuracy and reliability of location recognition, particularly in environments subject to significant changes. By identifying distinct landmarks from video feeds, their system provides a practical solution for autonomous vehicles and robots to navigate complex cityscapes, highlighting the effectiveness of visual sensors in extracting meaningful navigational cues from the environment. In addition, to use only a single camera, various techniques are sometimes employed. U2D2Net combines dehazing and noise removal into a single end-to-end trainable model, utilizing unsupervised learning techniques. This means it does not require paired training data of hazy and haze-free images, which can be sparse and costly to acquire [31]. Alternatively, by adding separate sensors, it can operate in both spatial and frequency domains, extracting relevant features from IR images. This allows it to capture intricate details and patterns, improving the recognition of complex environments from IR images [32].

3. Proposed Method

This paper proposes a process for recognizing the location of public transportation that repeatedly operates on a fixed route. Although systems such as GPS, which provide absolute location information, are well established, their data may be unreliable in areas with many tunnels, mountainous terrain, and dense urban environments with many buildings. Therefore, this paper proposes a route recognition method for shuttle buses that make repeated trips using only a single camera, such as a black box, mounted on all vehicles. In addition, instead of using features without semantic information, such as SIFT [33], SURF [34], or ORB [35] used in existing BoVW methods, this paper proposes a semantic information-based codebook using the class of an object as a feature. By using this codebook and combinations of objects detected while driving, this process can quickly identify the current position of the shuttle bus on its route. This process is illustrated in Figure 1. First, when a raw image is input, a single-stage object detection network is used to identify objects (object detection). This paper then uses a simple tracking algorithm to track these objects (object tracking). The detected objects are then compared with a previously produced visual codebook (visual codebook comparison). Through this comparison, the current driving route is finally determined.

3.1. Object Detection and Tracking

This paper used the YOLO algorithm, a single-stage detection method, for feature extraction due to its ability to balance high detection performance with rapid computation times, ensuring real-time processing. This efficiency makes it exceptionally well suited for applications requiring immediate responses, such as autonomous vehicle navigation or real-time processing. YOLO’s architecture allows it to simultaneously predict both the classes of objects present and their locations within the image.

To address the challenge of maintaining consistent identification of objects across consecutive frames, the proposed method integrates the SORT (Simple Online and Real-Time Tracking) algorithm with YOLO, as shown in Figure 2. SORT is designed to track objects as they move through a scene, assigning a unique ID to each detected feature. This ensures that an object detected in one frame can be recognized as the same object in subsequent frames, even if its position changes. The algorithm achieves this by analyzing the movement and appearance of each bounding box over time, applying a simple yet effective model to predict the object’s future location. By combining YOLO’s rapid and accurate detection capabilities with SORT’s efficient tracking, our system can continuously monitor and identify objects within a dynamic environment. This integration forms the core of our approach to real-time feature extraction and object tracking, laying the groundwork for applications that require immediate and precise recognition of features in their operational context.

3.1.1. Codebook Creation

One of the key steps in our study involves extracting visual features from objects around shuttle bus routes, tracking using the SORT algorithm, and creating a codebook using the bag-of-visual-words (BoVW) technique. In our proposed system, feature extractors, such as Scale-Invariant Feature Transform (SIFT), Speed-Up Robust Feature (SURF), Oriented FAST, and Rotated BRIEF (ORB), can be utilized to identify key points in images. However, this system is specifically applied to public transportation, such as shuttle buses, that repeatedly traverse the same spaces. Therefore, the classes of objects present near bus stops are used as the features. Therefore, combinations of patterns observable in local images near each bus stop are generated as vectors. These vectors can efficiently summarize the spatial information near the bus stops. The next step involves using the extracted feature vectors to create a codebook. This paper uses the K-means clustering algorithm to group similar feature vectors. The centroids of these clusters, representing visual words, collectively form the codebook. The size of the codebook, or the number of clusters, is determined experimentally, reflecting the complexity and diversity of the environment being analyzed. Finally, for each image (or tracked object), this paper determines which visual word each feature vector belongs to by calculating the distance between the feature vector and each visual word in the codebook. Subsequently, for each image, this paper computes the frequency of each visual word occurring, creating a histogram.

3.1.2. Route Positioning

The next step involves utilizing Term Frequency–Inverse Document Frequency (TF-IDF) for location recognition. TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus [36]. In the context of our system, “words” correspond to the visual patterns identified near bus stops, and “documents” refer to the images or sequences of images captured from specific locations along the shuttle bus’s route.

Term Frequency (TF) assesses the frequency of a visual pattern within a particular image, indicating the significance of that pattern in representing the image’s content. Conversely, Inverse Document Frequency (IDF) measures how common or rare a pattern is across all images taken along the shuttle route. A pattern that frequently appears near many different bus stops may be less informative for distinguishing between those locations, hence its lower IDF value

W_{x, y} = t f_{x, y} \times log (\frac{N}{d f_{x}})

(1)

The TF-IDF weight formula uses

W_{x, y}

for a word in a document, where

t f_{x, y}

represents the term frequency, indicating how often the word appears in the document. IDF (Inverse Document Frequency) reflects the rarity of each word across the document set. N is the total number of documents, and

d f_{x}

represents the number of documents in which the word appears. Through TF and IDF, each codebook is converted into a unique vector.

By combining TF and IDF, the proposed method can assign a weighted value to each visual pattern in an image, reflecting its prevalence within that specific image and its uniqueness across images from different locations. This allows for a more nuanced representation of each image, facilitating more accurate and efficient location recognition. In essence, TF-IDF helps highlight the most characteristic patterns relevant to identifying each bus stop’s vicinity, making it a powerful tool for spatial information analysis in the context of public transportation systems.

S_{C} (A, B) : = c o s (θ) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

(2)

Subsequently, after transforming the data into vectors, the shuttle bus’s location is identified by selecting the codebook that has a cosine similarity close to 1 using (2), where A represents the pre-existing code map and B is the codebook detected during the current driving conditions. The cosine similarity formula is utilized to calculate the similarity between these vectorized codebooks. The vehicle’s location is recognized by selecting the codebook with a cosine similarity value closest to 1.

This process is based on the idea that vectors representing similar places will align better in a space with many dimensions, resulting in a higher cosine similarity score. The closer this score is to 1, the more alike the content of the codebooks. This means that the visual patterns seen from the vehicle’s current location are very similar to those linked to a specific bus stop. This method effectively uses the spatial features stored in the codebooks to accurately pinpoint the vehicle’s location along its route.

3.1.3. Route Dataset

For object detection, this paper directly collected data by riding the campus shuttle and filming videos from fixed positions on the shuttle. This paper recognized the need to account for repetitive scenarios along the specific route of the shuttle under various environmental conditions. To address this, we continuously filmed on the available campus shuttle. We collected video data while considering the diversity of time, season, and weather conditions during different parts of the day, including morning, noon, and afternoon. Also, to understand seasonal effects, data were collected in spring, summer, autumn, and winter. Furthermore, to reduce the impact of weather, we organized the data collection to cover cloudy, clear, and rainy conditions. The videos were then broken down into frames to create training images, and these images were labeled to identify specific objects or classes within them. Therefore, this dataset represents statistical information by season, time zone, and weather, as shown in Figure 3.

Figure 4 presents a chart classifying the predefined classes, including road signs, speed bumps, and other structures, totaling 35 categories that help pinpoint locations. This paper used approximately 6700 images as training data.

This process involved the following steps:

Direct Labeling: This paper manually labeled the captured images to determine the location and category of each object. This crucial step helps to train the object detection model using accurate and real-world data.
Environmental Considerations: During the data collection, This paper accounted for factors that might affect the performance of object detection, such as changes in the landscape with the seasons, the positioning of the camera, and varying lighting conditions throughout the day. This method ensures that the model can reliably operate under a broad range of conditions.
Defining Classes: The dataset created in this paper included 37 different object classes, reflecting the variety of objects found in the target area. This diversity allows the model to learn a wide range of features for precise detection.
Dataset Size: The dataset created in this paper contained 6700 images for training and 776 images for testing. This substantial amount of data provides a strong basis for the model to learn and assess its detection abilities.

4. Experimental Results

This paper evaluated the object detection capabilities using the YOLO model. The data for this study were meticulously curated through the creation of a custom dataset specifically labeled to fit the real-world application of the Gachon University shuttle bus route. The development of this custom dataset was critically informed by the need to summarize the impact of a variety of factors on object detection performance, including seasonal landscape changes, changes in camera position, and variations in lighting conditions at different times of the day.

The dataset compilation process covered a comprehensive period spanning from spring to summer to cover a wide range of environmental settings. In total, the custom dataset contained 37 individual classes created to reflect the diverse array of objects within this study’s operational scope. The dataset consisted of a training set of 6700 images and a test set of 776 images, providing a practical basis for training and evaluation of the object detection model. For learning the constructed dataset, the model used YOLOv5s. The GPU used was an NVIDIA GeForce RTX 3080, with a batch size per GPU of 4. The optimizer used was SGD, and the momentum of SGD was 0.937. All training processes used the same hyperparameters and adjusted them for 100 epochs.

4.1. Object Detection

In this paper, the YOLO algorithm was used for the single-stage detection method for feature extraction due to its ability to balance high detection performance with rapid computation times, ensuring real-time processing. However, when considering the relationship between the number of parameters and performance, the ‘nano’ version of YOLOv5 was found to be the most efficient model, as shown in Table 1. This is because, despite using relatively fewer parameters, the nano version demonstrated comparably high performance. This finding is particularly relevant for the implementation of real-time object detection and tracking systems. The larger the model, the higher the computational cost, which can affect the speed of real-time processing. This paper also used the mean Average Precision (mAP) metric to evaluate the performance of our object detection and tracking system. Given the nature of our research, which necessitates considering not only object detection but also tracking accuracy, this paper compared both mAP_0.5 and mAP_0.5:0.95 criteria. Here, mAP_0.5 represents the mean precision at an Intersection over Union (IoU) threshold of 0.5, whereas mAP_0.5:0.95 calculates the mean precision across IoU thresholds from 0.5 to 0.95 in increments of 0.05. The results comparing various versions of YOLOv5 are presented in Table 2.

The analysis indicated a trend where larger YOLO models corresponded with higher mAP values. This suggests that models with a greater number of parameters can achieve higher precision in object detection. In other words, models incorporating more parameters are capable of learning more complex features, which contributes to the accurate identification of tracking targets. Therefore, the outcomes of this research underscore the importance of not solely focusing on the size of the model when optimizing the performance of object detection and tracking systems. It highlights the necessity of balancing the number of parameters with performance. Especially in applications requiring real-time processing, choosing efficient models like the nano version plays a crucial role in achieving a balance between performance and speed.

4.2. Route Positioning

For route positioning, this study used a bag-of-words (BoW) approach to segment the route into distinct sections, subsequently creating a codebook for each segment. These generated codebooks compiled objects detected over a predetermined time frame, facilitating the comparison of segments based on codebook similarity to ascertain the vehicle’s location. As shown in Figure 5, the route of the shuttle at Gachon University was partitioned into several sections demarcated between shuttle bus stops: ‘Main Gate’, ‘Tunnel’, ‘Education’, ‘Main Library’, ‘Student Center’, ‘AI Building’, ‘Main Library’, ‘Rotary’, and ‘Art College Building’. To evaluate the performance of the proposed method, experiments were conducted using the collected video datasets across the four seasons: spring, summer, fall, and winter. The outcomes of the location estimations amounted to 165 results, which were presented in the form of a confusion matrix.

4.2.1. Object Detection Only

In the initial experiments, This paper observed that objects were accurately detected with a precision of better than 0.5 in six out of nine zones (Main Entrance, Tunnel, Education, Central Library, Student Center, and Roundabout), as shown in Table 3. However, in the AI Building section and the Art Building section, the detection precision was not as high, approximately 0.27 to 0.29. The reason is that the size of the object that could be a feature in the corresponding stop section was smaller than that of other routes, and because it was a building with a simple structure that contained a lot of incorrect false positive information. These errors were judged to have a negative impact on the similarity with the codebook consisting only of object detection frequency information.

4.2.2. Object Tracking

To address this issue, the SORT algorithm was applied to regenerate the codebook based not on the frequency of object detection but on the continuity of object presence. With this revised codebook, the same video data were subjected to another round of experiments. The results from this re-experimentation also yielded 165 location estimations, which are depicted in a confusion matrix format in Table 4. The re-experimentation provided the same number of location estimation results as the initial tests but allowed for more accurate and reliable location estimations using the improved codebook. By applying the SORT algorithm, the system could achieve more precise object detection and tracking based on continuity rather than detection frequency, with reduced influence from the GPU. This improvement is particularly significant in enhancing object detection and tracking performance in complex environments, especially in accurately detecting smaller objects. These outcomes suggest that the proposed method can effectively address various conditions and scenarios in real-world environments. Future research will aim to validate this approach under a broader range of environments and conditions and explore ways to further optimize its performance.

When comparing the experimental results using the codebook with and without SORT, it can be seen that the values of the ‘Education’, ‘EduMain Library’, and ‘AI Building’ sections increased from 0.50 to 0.63 and from 0.67 to 0.79, respectively. However, there were some limitations in recognizing a specific area using a combination of the corresponding codebook and simply tracking features extracted using SORT. Therefore, considering the repetitive nature of shuttle bus routes, this paper incorporated relationship information with the next stop, measured by proceeding from the current stop along the route. For example, after passing the ‘Student Building’ according to the route in Figure 5, the bus has no choice but to pass the ‘AI Building’. Therefore, when passing the ‘AI Building’ on the route, prior information indicating that the previous stop was the ‘Student Building’ was used to reduce misrecognition of the route. Through this, the detection performance of the ‘Main Library’ was increased from 0.27, achieved with simple detection only, to 0.39 by using SORT and prior information of the previous stop together. Similarly, the ‘Art Building’ also saw an improvement in detection performance, increasing from 0.29 to 0.35.

5. Conclusions

The results of this study show that it is possible to create an effective and affordable autonomous driving system for small-scale transportation on fixed routes using only one camera. The proposed method uses a simple camera system combined with straightforward detection and tracking methods. This approach allows us to accurately recognize routes by comparing detected landmarks with a pre-set list of route landmarks. The system was tested on a university campus shuttle bus, confirming its functionality across a route comprising nine sections covering a short distance of approximately 2.5 km. Through this, it can be expected that the method proposed in this paper will work to some extent on buses operating on actual roads. Also, by using the bag-of-visual-words technique, the proposed system was able to keep costs low while still providing a reliable way to identify routes. In conclusion, this study demonstrates a cost-effective approach to implementing autonomous driving technology in small-scale transportation environments. In the future, a machine learning algorithm will be used to improve this method, making it more accurate and versatile in handling more complex paths and dynamic conditions. This could help expand the use of this affordable autonomous driving solution to various types of small-scale transport, making autonomous driving more accessible to everyone.

Funding

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-00165870). This work was also supported by the Gachon University research (GCU-202400470001) and by LIGNEX1.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, H.K. Current Status of Chinese Autonomous Vehicle Industry. Auto J. 2021, 43, 50–52. [Google Scholar]
National Highway Traffic Safety Administration. The Road to Full Automation. Available online: https://www.nhtsa.gov/technology-innovation/automated-vehicles-safety (accessed on 12 October 2023).
Waymo. Waymo Official Website. Available online: https://waymo.com/ (accessed on 12 October 2023).
Apollo, B. Apollo Official Website. Available online: https://www.apollo.auto/ (accessed on 12 October 2023).
Bresson, G.; Alsayed, Z.; Yu, L.; Glaser, S. Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving. IEEE Trans. Intell. Veh. 2017, 2, 194–220. [Google Scholar] [CrossRef]
Moon, Y.G. Trends in LiDAR Sensor Technology for Autonomous Vehicles. 2017. Available online: https://www.earticle.net/Article/A301487 (accessed on 12 October 2023).
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Anguelov, D. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Park, H. Implementation of lane detection of autonomous vehicles using GPS’ location information system. J. Korean Soc. Commun. Stud. 2018, 43, 1152–1162. [Google Scholar]
Name, A. Image Retrieval based on Bag-of-Words model. arXiv 2013, arXiv:1304.5168. [Google Scholar]
Kamath, S.S. A Bag of Visual Words Model for Medical Image Retrieval. In Proceedings of the 7th International Engineering Symposium (IES 2018), Kumamoto University, Kumamoto, Japan, 7–9 March 2018. [Google Scholar] [CrossRef]
Name, A. Image Classification with Classic and Deep Learning Techniques. arXiv 2021, arXiv:2105.04895. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef]
Leal-Taixé, L.; Pons-Moll, G.; Rosenhahn, B. Simple Online and Realtime Tracking with a Deep Association Metric. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 564–576. [Google Scholar]
Kim, J.; Li, Z.; Cipolla, R. Multiple Object Tracking using K-Shortest Paths Optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 591–605. [Google Scholar]
Lucas, B.D.; Kanade, T. A Multi-Sensor Fusion System for Moving Object Detection and Tracking in Urban Driving Environments. In Proceedings of the IEEE Intelligent Vehicles Symposium, Xi’an, China, 3–5 June 2009; IEEE: New York, NY, USA, 2009. [Google Scholar]
Bailey, T.; Nieto, J.; Guivant, J. Probabilistic Vehicle Localization in Urban Environments. In Proceedings of the IEEE International Conference on Robotics and Automation, Orlando, FL, USA, 15–19 May 2006; IEEE: New York, NY, USA, 2006. [Google Scholar]
Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.; Monfort, M.; Muller, U.; Zhang, J.; et al. End-to-End Learning of Lane Following in Urban Environments. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Milan, A.; Rehder, J.; Schindler, K.; Roth, S. A Fast and Accurate Unconstrained Lane Detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1911–1918. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Columbus, OH, USA, 23–28 June 2015. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2001–2010. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
Chen, Z.; Jacobson, A.; Sünderhauf, N.; Upcroft, B.; Liu, L.; Shen, C.; Milford, M. Deep Learning Features at Scale for Visual Place Recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: New York, NY, USA, 2017; pp. 3223–3230. [Google Scholar]
Lai, K.; Fox, D. Incremental Learning for Visual Navigation in Dynamic Environments. J. Field Robot. 2019, 36, 134–156. [Google Scholar]
Radwan, N.; Valada, A.; Burgard, W. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. IEEE Robot. Autom. Lett. 2018, 3, 4407–4414. [Google Scholar] [CrossRef]
Ding, B.; Zhang, R.; Xu, L.; Liu, G.; Yang, S.; Liu, Y.; Zhang, Q. U2D2Net: Unsupervised Unified Image Dehazing and Denoising Network for Single Hazy Image Enhancement. IEEE Trans. Multimed. 2024, 26, 202–217. [Google Scholar] [CrossRef]
Zhang, R.; Xu, L.; Yu, Z.; Shi, Y.; Mu, C.; Xu, M. Deep-IRTarget: An Automatic Target Detector in Infrared Imagery Using Dual-Domain Feature Extraction and Allocation. IEEE Trans. Multimed. 2022, 24, 1735–1749. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision, Washington, DC, USA, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Spärck Jones, K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]

Figure 1. Illustration of the proposed method. First, when a raw image is input, a single-stage object detection network is used to identify objects (object detection). This paper then uses a simple tracking algorithm to track these objects (object tracking). The detected objects are then compared with a previously produced visual codebook (visual codebook comparison). Through this comparison, the current driving route is finally determined.

Figure 2. Schematic of Simple Online and Real-Time Tracking algorithm.

Figure 3. Statistical information by season, time zone, and weather in the constructed dataset.

Figure 4. Classification chart of predefined object classes. (1) 20 km Sign (2) Arrow Sign (3) Bicycle Sign (4) Building Sign (5) Bump Sign (6) Bump (7) Campus Map (8) Campus Map Sign (9) Crosswalk Sign (10) Crosswalk (11) Diamond Marker (12) Disabled Person Sign (13) Plan Card (14) Market Sign (15) Mirror (16) No-Uturn Sign (17) No-Bicycle Sign (18) Rotation Sign (19) Red Rotation Sign (20) Slow Sign (21) Snow Removal Box (22) Station Sign (23) Suttle Bus Station (24) Straight Marker (25) Straight-Left Maker (26) Straight-Right Maker (27) Street-Lamp(Hat-shape type) (28) Street-Lamp(M-shape type) (29) Street-Lamp(L-shape type) (30) Street-Lamp(T-shape type) (31) Trun-Left Maker (32) Trun-Left and Right Maker (33) Trun-Righr Maker (34) University Logo (35) Vehicle-Breaker.

Figure 5. Section of shuttle’s route at Gachon University.

Table 1. Comparison of YOLO network parameters and performance.

	YOLOv5n	YOLOv5s	YOLOX-s	YOLOv5m	YOLOX-m	YOLOv5l	YOLOX-l	YOLOv5x	YOLOX-x
params (M)	1.90	7.20	9.0	21.2	25.3	46.5	54.2	86.7	99.1
FLOPs @640 (B)	4.50	16.5	26.2	49.0	73.2	109.1	155.0	205.7	281.3

Table 2. Quantitative results for detection by YOLO’s various networks.

Model	mAP @0.5	mAP @0.5–0.95	params (M)	FLOPs @640 (B)	mAP/Params @0.5	mAP/Params @0.5–0.95
YOLOv5n	0.9754	0.8153	1.9	4.5	0.5134	0.5134
YOLOv5s	0.9765	0.8162	7.2	16.5	0.1356	0.1356
YOLOv5m	0.9772	0.8170	21.2	49.0	0.0461	0.0461
YOLOv5l	0.9776	0.8174	46.5	109.1	0.0210	0.0210
YOLOv5x	0.9777	0.8175	86.7	205.7	0.0113	0.1130

Table 3. Detection results for route positioning without the SORT algorithm.

w/o SORT		Predict
w/o SORT		Main Gate	Tunnel	Education	Main Library	Student Building	AI Building	Main Library	Rotary	Art Building
Actual	Main Gate	0.83	0.00	0.00	0.00	0.02	0.11	0.00	0.00	0.00
	Tunnel	0.13	0.72	0.00	0.00	0.00	0.14	0.00	0.00	0.00
	Education	0.04	0.00	0.50	0.00	0.00	0.17	0.00	0.18	0.18
	Main Library	0.00	0.00	0.00	0.67	0.00	0.23	0.41	0.00	0.35
	Student Building	0.00	0.00	0.00	0.13	0.98	0.00	0.14	0.00	0.00
	AI Building	0.00	0.05	0.37	0.00	0.00	0.34	0.05	0.18	0.18
	Main Library	0.00	0.00	0.00	0.20	0.00	0.00	0.27	0.00	0.00
	Rotary	0.00	0.06	0.13	0.00	0.00	0.00	0.00	0.64	0.00
	Art Building	0.00	0.17	0.00	0.00	0.00	0.01	0.13	0.00	0.29

Table 4. Detection results for route positioning with the SORT algorithm.

w/o SORT		Predict
w/o SORT		Main Gate	Tunnel	Education	Main Library	Student Building	AI Building	Main Library	Rotary	Art Building
Actual	Main Gate	0.83	0.00	0.00	0.00	0.01	0.14	0.00	0.00	0.00
	Tunnel	0.13	0.72	0.00	0.00	0.00	0.14	0.00	0.00	0.06
	Education	0.04	0.00	0.63	0.00	0.00	0.07	0.00	0.18	0.12
	Main Library	0.00	0.00	0.12	0.79	0.00	0.07	0.43	0.00	0.35
	Student Building	0.00	0.00	0.00	0.00	0.99	0.05	0.04	0.00	0.00
	AI Building	0.00	0.05	0.00	0.00	0.00	0.45	0.00	0.18	0.06
	Main Library	0.00	0.00	0.00	0.21	0.00	0.05	0.39	0.00	0.00
	Rotary	0.00	0.06	0.13	0.00	0.00	0.03	0.01	0.64	0.06
	Art Building	0.00	0.17	0.12	0.00	0.00	0.00	0.13	0.00	0.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, J. Route Positioning System for Campus Shuttle Bus Service Using a Single Camera. Electronics 2024, 13, 2004. https://doi.org/10.3390/electronics13112004

AMA Style

An J. Route Positioning System for Campus Shuttle Bus Service Using a Single Camera. Electronics. 2024; 13(11):2004. https://doi.org/10.3390/electronics13112004

Chicago/Turabian Style

An, Jhonghyun. 2024. "Route Positioning System for Campus Shuttle Bus Service Using a Single Camera" Electronics 13, no. 11: 2004. https://doi.org/10.3390/electronics13112004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Route Positioning System for Campus Shuttle Bus Service Using a Single Camera

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Object Detection and Tracking

3.1.1. Codebook Creation

3.1.2. Route Positioning

3.1.3. Route Dataset

4. Experimental Results

4.1. Object Detection

4.2. Route Positioning

4.2.1. Object Detection Only

4.2.2. Object Tracking

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI