1. Introduction
Autonomous vehicles (AV) have gained significant popularity in recent years due to the vast revolution in modern transportation systems. An autonomous vehicle is a self-driving vehicle that is efficient at perceiving its outer environment and moving without or with very limited human involvement. The various renowned reports and surveys predict that by 2030, autonomous vehicles will be capable and reliable enough to replace maximum human driving [
1,
2]. In this scenario, many new methods are being proposed to facilitate autonomous vehicles’ vision perception, sensing the outer environment, safety aspects, traffic laws and regulations, accident liability, and maintaining the surrounding map [
3,
4,
5].
An autonomous vehicle can rely on multiple sensors, complex algorithms, actuators, machine learning tools, computer vision techniques, and reliable processors to take effect [
6,
7]. The autonomous vehicle perceives the outer environment with the help of numerous sensors and makes the decision by perceiving with the assistance of computer vision [
8,
9]. Each sensor’s configuration and mechanism varies, as an example in [
10], the sideslip angle estimation algorithm for autonomous vehicles is proposed. The algorithm is based on a consensus Kalman filter that fuses measurements from a reduced inertial navigation system (R-INS), a global navigation satellite system (GNSS), and a linear vehicle-dynamic-based sideslip estimator.
Since the last few decades, advanced driver assistance systems (ADAS) are equally appreciated to avoid traffic accidents and to improve driving comfort in autonomous vehicles [
11]. The ADAS systems are safe and secure systems designed to decrease the human error rate. These systems assist the driver through the advanced technologies to drive safely and thus, improve the driving performance. Several state-of-the-art methods have employed the inertial measurement unit (IMU) and a global navigation satellite system (GNSS) for vehicle localization. In [
12], the authors have proposed a method for estimating the sideslip angle and attitude of an automated vehicle using IMU and GNSS. The method is designed to be robust against the effects of vehicle parameters, road friction, and low-sample-rate GNSS measurements. In [
13], the method is proposed for estimating the yaw misalignment of an IMU mounted on a vehicle.
The ADAS systems utilize a combination of multiple sensors to perceive the outer environment and then either offers useful information to the driver or take some necessary actions such as applying the brake, changing the lane, turning left/right, etc. These systems are very helpful to decrease traffic congestion and smoothing traffic movement [
14,
15]. In the last three decades, multiple features of ADAS systems have been proposed, including cruise control, antilock braking system, auto-parking, power steering, lane centering, collision warnings, and others [
16,
17,
18]. In autonomous vehicles, cameras are generally used as vision sensors. The vision-based ADAS utilizes multiple cameras to capture the images, analyze them, and take the appropriate actions whenever needed.
In state-of-the-art methods, multiple advanced features of the ADAS system are proposed. In [
19], Liu et al. proposed a framework using SVM-based trail detection to achieve trail directions and tracking in a real-time environment. The vision-based framework is capable to detect and track the trails as well as scene understanding using a quadrotor UAV operator. Yang et al. [
20] proposed two frameworks that show how the CNNs perceive and process the driving scenes with distinguishing visual regions. Gao et al. [
21] proposed a method for a 3D surround view for ADAS that covers automobiles around the vehicle. The method helps the driver to be aware of the outer environment.
Liu et al. [
22] presented a novel algorithm for detecting tassels in maize using UAV-based RGB imagery. The algorithm named YOLOv5-tassel, is based on the YOLOv5 object detection framework and incorporates several modifications to improve its performance on tassel detection. The authors included the modifications of a bidirectional feature pyramid network to effectively fuse cross-scale features, introduced a robust attention module to extract the features of interest before each detection head, and added an additional detection head to improve small-size tassel detection. Xia et al. [
23] proposed a novel data acquisition and analytics platform for vehicle trajectory extraction, reconstruction, and evaluation based on connected automated vehicle (CAV) cooperative perception. The platform is designed to be holistic and capable of processing sensor data from multiple CAVs.
Wang et al. [
24] proposed a lane-changing model for making decisions to either change the lane or produce trajectories. The model analyzes the vehicle kinematics of different states, their distances, and comfort level requirements. Chen et al. [
25] proposed an instructor-like assistance system in order to avoid collision risk. The driver and the assistance system both assure the recommendation to control the vehicle. Gilbert et al. [
26] proposed an efficient decision-making model which selects the least possible collision for AV. The model combines vehicle dynamics and maneuver trajectory paths to produce simulation results and multi-attribute decision-making techniques. Gao et al. [
27] proposed a new vehicle localization system that is based on vehicle chassis sensors and considers vehicle lateral velocity. The system is designed to improve the accuracy of vehicle stand-alone localization in highly dynamic driving conditions during GNSS outages.
Xia et al. [
10] presented a new algorithm for estimating the sideslip angle of an autonomous vehicle. The algorithm uses consensus and vehicle kinematics/dynamics synthesis to enhance the accuracy of the estimation under normal driving conditions. The proposed algorithm uses a velocity-based Kalman filter to estimate the errors of the reduced inertial navigation system (R-INS) and a consensus Kalman information filter to estimate the heading error. The consensus framework combines a novel heading error measurement from a linear vehicle-dynamic-based sideslip estimator with the heading error from the global navigation satellite system (GNSS) course. Liu et al. [
28] proposed a novel kinematic-model-based vehicle slip angle (VSA) estimation method that fuses information from a GNSS and an IMU. The method is designed to be robust against the effects of vehicle roll and pitch, a low sampling rate of GNSS, and GNSS signal delay.
Since the early days of mechanical vehicles, safety has been one of the key concerns in automotive systems. Several attempts have been made to address safety concerns by developing safe and secure systems to protect the driver as well as prevent injuring pedestrians [
29,
30]. It is one of the safety aspects of an autonomous vehicle when the driver is preoccupied with searching for the desired location to stop. With our proposed system, the safety of AV can be increased drastically since the AV will automatically realize the desired locations in its surrounding. Rather than continually searching for the desired locations, our proposed system will automatically realize the textual cues present in the outer environment and suggest the driver to stop.
While driving on the road, AV performs multiple operations such as lane change, lane keeping, overtaking, and following the traffic rules. Several studies proposed and developed numerous methods for ADAS systems [
31,
32]. It is equally important for an autonomous vehicle to be aware of the textual cues appearing in its outer environment to take some decision or assist the driver, either to stop or drive. Thus, the key concern of this paper is to reduce human intervention in an autonomous vehicles.
In this paper, we propose a novel intelligent system based on the driver’s instruction for finding the desired locations using textual cues present in the outer environment for advanced driver assistance systems. For this, we combine computer vision and natural language processing (NLP) techniques to perceive textual cues. Computer vision methods train the system to interpret and perceive the visual world around the autonomous vehicle and NLP techniques emphasize the system with the ability to read, recognize, and derive the meaning from textual cues appearing in front of an autonomous vehicle. The key contributions of this paper are as follows:
A novel intelligent system is proposed for AVs to find unsupervised locations.
The proposed system is capable of sensing the textual cues that appear in the outer environment for determining desired locations.
The proposed system is a novel development in the list of ADAS features of an autonomous vehicle.
With the proposed system, the driver’s efforts for finding the desired locations will drastically be decreased.
The remainder of the paper is organized as follows. In
Section 2, we describe the proposed system for finding the desired locations with textual cues and their formation as keywords. In
Section 3, the experimental results are defined to show the efficiency and accuracy of the proposed system.
Section 4 concludes the proposed work and presents the future directions.
2. Proposed System
In this section, we propose a novel intelligent system to find the desired locations using textual cues for an autonomous vehicle. Firstly, the driver inputs one or more keywords to the proposed system to find the desired locations. Secondly, the proposed system detects and localizes the textual cues appearing in the outer environment. The system will generate the keywords localized from the outer environment with detection and recognition methods. Finally, the system will execute similarity learning to find the similarity between the input keywords and the localized keywords from outer environment images. The schematic diagram of the proposed intelligent system is shown in
Figure 1.
2.1. Textual Cues Detection
In order to detect, localize, and form the keywords from the outer environment, we employ text detection and localization technique. Firstly, we use affine transformation to deal with global distortion appearing within an input image and to improve the accuracy of the text to a more horizontal text. It takes an input image
with channel
, height
, and width
to produce an output image
. The affine transformation based on the arguments between the input image
and output image
is given as:
where
are the source coordinates of the input image and
are the required coordinates for the output image
. The output image
is further rectified from the input image
using bilinear interpolation, given as:
where
is the pixel value of the rectified image
at the location
and
is the pixel value of the input image
at the location
.
2.1.1. Textual Candidates Detection
The textual candidates detection aims to extract the position of textual regions in the outer environment. Since the text appearing in the outer environment generally has diverse contrast to its relative background and uniform color intensity, the maximally stable extremal region (MSER) technique is the best approach as it is widely used and considered the best region detector [
33]. In order to detect the textual candidates appearing in the outer environment, we adopt the MSER approach for finding the corresponding candidates within the input image
. For finding the extremal regions in the input image, the intensity difference is given as:
where
represents the extracted extremal regions area,
represents the extremal regions,
specifies the increment of each extremal region
, and
shows the area difference between the two regions’ area. After applying the region detector, the obtained extremal regions are shown in
Figure 2.
2.1.2. Textual Candidates Filtering
The textual regions detected in the previous step using the MSER technique are further refined and rectified. First, we validate the size and the aspect ratio using geometric properties for textual candidates filtering, which is given as:
where
h and
w are the height and width of the aligned bounding box of segmented axes, respectively, and
,
,
and
are components to finetune.
The input image
having the size
and the predicted categorized result
and
with uncertain probability sequence
and
is given as:
where
where
D and
K represent the character sequence length.
The input vector
is combined using the following properties:
The above four properties are probability characteristics in which mean represents the overall confidence score and min represents the least likely character.
Furthermore,
where
is the constant parameter. The above two properties are used to normalize the number of characters between 0 and 1, and the following two properties are used for character width calculated as per geometric properties:
where
represents the character width.
The localized regions which satisfy the above properties are then further processed and the remaining regions are discarded, as shown in
Figure 3. The obtained localized regions consist of non-textual regions and may produce a false result for recognition. We further segment textual regions with the stroke responses of each image pixel. The corner points are used as the edges of two strokes. The corner points and stroke points establish the distortion of strokes. For this, we follow the corner detection approach [
34], which applies the following selection criteria. Firstly, the matrix
M for each pixel is calculated as follows:
where
represents the weight at position
for window center,
and
denotes the gradient value of pixel at position
. The eigenvalues
and
of
matrix are calculated as:
To compute the turning point of outer stroke endpoints, we use the following equation [
35]:
where
,
,
, and
are the coordinates of the endpoints of the strokes;
and
are coordinates of the outermost points; and
denotes coordinates of every single point at the curve. The following equation is given to determine outer stroke points:
where
denotes a single point at the
x-axis and
y-axis in word image.
Given the corner point
along with its adjacent corner
, the height
and width
of a moving window is determined as:
where
is a coefficient to normalize the area of the moving region among the corner points and is set between 0 and 1. Moreover, the moving area of outer strokes
for the side length area
is given as:
where
is a coefficient to normalize the moving regions among the outer strokes and is set between 0 and 1. The final filtered localized textual regions are shown in
Figure 4.
2.1.3. Keywords Grouping and Recognition
The localized textual regions in the previous steps consist of individual text characters. In order to recognize and understand the meaning of these textual regions, these individual characters must be combined into text lines. This way, the localized textual regions may represent more meaningful information about the outer environment as compared to the individual characters. For example, the localized textual region consists of the “SCHOOL” versus the individual character set {C,O,L,O,S,H} where its meaning is lost due to the unordered sequence of the word [
36,
37,
38].
In order to form the ordered keywords, we employ the grouping approach [
39]. The key idea is to apply a rectangle
for each connected region having the center
and orientation
. Each associated region is considered to be a keyword candidate. The initial candidate regions having
are refined to be the keyword with the following properties:
- (1)
The two adjacent textual candidates are associated with a new value.
- (2)
The achieved keyword candidate which is the combination of two candidates is obtained with curvilinear.
If the centers of connected regions in
are estimated normally with a
kth order polynomial, then the candidate keyword
is determined as curvilinear:
where
is the rotated point of
and
is the average score. The bounding boxes are applied to the character set of textual regions, as shown in
Figure 5.
The grouped keywords from localized textual regions are further processed for recognition purposes. The cropped word images having width and height consisting of the textual cues are recognized individually. The inputs are the 2D maps resulting in a map for the individual character supposition.
Given the metrics
and the confidence score for individual word supposition
, let
represent the breakpoints amid individual characters, where
initializes the first character and
ends the last character. The breakpoint hypothesis
for the word confidence score is given as:
Each individual hypothesis word
is optimized for breakpoints, and a word having an optimal score is recognized as:
The unary fraction scores given in Equation (22) are determined with the following properties: the distance from outside the image boundaries, the distance from the estimated breakpoint location, the binary fraction score, the non-text class score, and the distance of the first and last breakpoints from the edge of the image. The pairwise score given in Equation (23) is determined with the following properties: non-text scores at character centers, character scores at midpoints amid breakpoints, eccentricity from the normalized character width, and active contributions of the left and right binary responses comparative to character scores.
The bounding boxes are applied to recognized words in order to match the evaluated breakpoints, and the recognized bounding boxes are added to the queue of recognized words. The recognized keywords are shown in
Figure 6.
2.2. Textual Cues Keywords
The localized textual regions are optimized with the OCR and the formal words are recognized, thus providing a sensible meaning. In this step, we utilize the recognized formal words to establish a words model that will be responsible for sequences and boundaries. Since the recognized textual cues may still be missing some characters and may affect finding the desired locations, we employ an n-gram probabilistic language model that will provide evidence for the presence of the actual cues [
40].
An n-gram model is generally used to predict the probability of a given n-gram in any contiguous sequence of words. A better n-gram model predicts the next word in a sentence. For example, given the word ‘park’, the first recognized trigram is ‘par’ and the second recognized trigram is ‘ark’, and then its overlapping characters ‘ar’ suggests that the correctly recognized word is likely to be ‘park’.
Given the word
of length
as a sequence of characters
where each
denotes a character at
position in word
from 26 letters and 10 digits, each recognized word has a varying length
that can be determined at the run time. Therefore, the number of characters in a single word is fixed to 22 with a null character and a maximum length class, which is given as:
For two strings and
, the
represents
as a substring of the word
. An
-gram of
is assumed as substring
having the length
. The dictionary of all grams of word
of length
is given as:
As an example, the dictionary for the word ‘cafe’ is
. Given the recognized
ith n-gram
and its consistent confidence score
, in order to determine the sequence of n-grams with the most confident prediction for the entire sequence of recognized words, the objective function can be given as:
where
Here, is used to achieve the optimal n-gram separation of the given word, and each n-gram word image is recursively recognized.
2.3. Similarity Learning
Similarity learning finds and matches similar images as the user-input keywords [
41,
42,
43]. The proposed intelligent system matches the user-input keywords with the outer environment textual cues. For this, we create a feature vector of user input keywords and the recognized textual cues from the outer environment images.
Given the input keywords , the word is treated as a sequence of characters , where denotes the total number of characters in word , and is considered as the optimal representation of the character of word . Each sequence is interpolated and concatenated with a fixed-length feature and all the features are signified as output features .
The recognized textual cue proposals
and the input keywords
are formed, and the similarity is computed as a similarity matrix
between the input keywords
and recognized textual cues. The score
between both the feature vectors
and
is given as:
where
represents the operator that converts the 2D matrix into a 1D vector. The required similarity matrix
is maintained by the target similarity matrix
. The target similarity
is computed as the Levenshtein distance between corresponding textual pairs
and is given as:
Meanwhile, during implementation for the ranking, the similarity between the input keywords and recognized textual words equals to the maximum value of .