Next Article in Journal
Self-Healing Silicone Materials: Looking Back and Moving Forward
Next Article in Special Issue
Facial Anthropomorphic Trustworthiness Scale for Social Robots: A Hybrid Approach
Previous Article in Journal
Multifunctionality in Nature: Structure–Function Relationships in Biological Materials
Previous Article in Special Issue
The Task Decomposition and Dedicated Reward-System-Based Reinforcement Learning Algorithm for Pick-and-Place
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Principle toward Robust Matching in Human-like Stereovision

School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, Singapore
*
Author to whom correspondence should be addressed.
Biomimetics 2023, 8(3), 285; https://doi.org/10.3390/biomimetics8030285
Submission received: 19 June 2023 / Revised: 27 June 2023 / Accepted: 29 June 2023 / Published: 2 July 2023
(This article belongs to the Special Issue Artificial Intelligence for Autonomous Robots 2023)

Abstract

:
Visual signals are the upmost important source for robots, vehicles or machines to achieve human-like intelligence. Human beings heavily depend on binocular vision to understand the dynamically changing world. Similarly, intelligent robots or machines must also have the innate capabilities of perceiving knowledge from visual signals. Until today, one of the biggest challenges faced by intelligent robots or machines is the matching in stereovision. In this paper, we present the details of a new principle toward achieving a robust matching solution which leverages on the use and integration of top-down image sampling strategy, hybrid feature extraction, and Restricted Coulomb Energy (RCE) neural network for incremental learning (i.e., cognition) as well as robust match-maker (i.e., recognition). A preliminary version of the proposed solution has been implemented and tested with data from Maritime RobotX Challenge. The contribution of this paper is to attract more research interest and effort toward this new direction which may eventually lead to the development of robust solutions expected by future stereovision systems in intelligent robots, vehicles, and machines.

1. Introduction

We are living inside an ocean of signals. Among all the signals, visual signals should be the ones with the upmost importance. This is because without the visual signals, human beings will not be able to undertake many activities in the physical world. Similarly, visual signals are extremely important to today’s autonomous robots, vehicles and machines [1]. Hence, research works on enabling robots, vehicles and machines to gain human-like intelligence from the use of visual signals should never be undermined or shadowed by the development of alternative sensors such as Radar [2] and LiDAR [3].
In this paper, we present a new principle which addresses the most difficult problem in stereovision, which is to achieve stereo matching as robust as possible [4]. The motivation behind our research works comes from projects dedicated to the development of intelligent humanoid robots [5] as well as autonomous surface vehicles [6], as shown in Figure 1.
For both platforms in Figure 1, their intelligence and autonomy greatly depend on the outer loop which consists of perception, planning and control. Among all possible modalities of doing perception, visual perception is a very important one. Especially, the goal toward achieving human-like visual perception must start with the use of binocular vision or stereovision [7]. Hence, research on human-like stereovision should be a non-negligible topic in both artificial intelligence and robotics. Therefore, the purpose of this paper is to present a new principle which advances the current state of the art in developing human-like stereovision for autonomous robots, vehicles, and machines [8].
The remaining part of the paper is organized as follows: Section 2 outlines the biggest challenge faced by stereovision. Section 3 briefly discusses similar works dedicated to stereo matching. Section 4 describes the proposed new principle which addresses stereo matching problem. Section 5, Section 6, Section 7, Section 8, Section 9 and Section 10 present the details of the key steps inside the proposed principle. Section 11 shows some preliminary results. The conclusions are given in Section 12.

2. Problem Statement

Stereovision enables human beings to classify entities, to identify entities, and to localize entities. Clearly, stereovision is a very powerful system or module which provides answers to the following questions:
  • What are the classes of perceived entities?
  • What are the identities of perceived entities?
  • Where are the locations of perceived entities inside related images?
  • Where are the locations of perceived entities inside scenes?
A complete solution, which could fully answer the above questions, actually depends on the availability of the working principles underlying image sampling (NOTE: this is a largely overlooked sub-topic), entity cognition (NOTE: vaguely named as deep learning in literature [9]), entity recognition (NOTE: vaguely named as object detection in literature [10]), and entity matching [11] as shown in Figure 2, etc.
As illustrated in Figure 2, when an entity is placed at location Q in a scene, its stereo images will appear at location a in left image and location b in right image, respectively. Then, the stereo matching problem could be stated as follows:
Given the location of an entity seen by the left camera, how to determine the location of the same entity seen by the right camera? [12].
The above problem implies the following two challenges:
  • How to determine the presence of an entity in the left camera’s image plane?
  • How to find the match in the right camera’s image plane if an entity has been detected in the left camera’s image plane?
It is important to further pay attention to the root causes behind these two challenges. The major root causes include [13]:
  • Variations of entities in size
  • Variations of entities in orientation
  • Variations of entities’ images due to lighting conditions
  • Variations of entities’ images due to occlusions as shown in Figure 3a
  • Variations of entities’ images due to image sampling process as shown in Figure 3b
In practice, we could cope with the issues raised by the variations of images in sizes and orientations if we could afford to have enough computational powers allocated to process images at multiple scales and multiple rotations. Also, we could cope with the variations of lighting conditions if we could make use of cameras with built-in functions of automatic illumination compensation and/or automatic contrast equalization. Hence, our remaining effort should focus on dealing with the issue raised by the occurrence of partial views faced by stereovision.

3. Similar Works on Stereo Matching

Stereo matching is an old problem in computer vision. In literature, there is a tremendous amount of works dedicated to solving the problem faced by stereo matching. For example, there are:
  • Methods which make the attempt of matching points within a pair of stereo images [14].
  • Methods which make the attempt of matching edges or contours within a pair of stereo images [15].
  • Methods which make the attempt of matching line segments within a pair of stereo images [16].
  • Methods which make the attempt of matching curves within a pair of stereo images [17].
  • Methods which make the attempt of matching regions within a pair of stereo images [18].
  • Methods which make the attempts of matching objects within a pair of stereo images [19].
The proposed principle in this paper falls into the category of making the attempt of matching entities within a pair of stereo images. Here, an entity may broadly refer to an object, a person, an animal, a building, or a machine, etc. In literature, the existing solutions in this category focus on the use of deep convolution to do feature extraction which is then followed by the use of artificial neural network to do tuning and prediction. Such methods simply depend on the process of bottom-up optimization (e.g., back propagation algorithm) and the use of features extracted in time-domain.
In contrast, our proposed principle advocates the use of top-down design process in which we promote the use of hybrid features (i.e., features from both time-domain and frequency domain) as well as the use of the improved version of RCE neural network [20]. RCE neural network [21,22,23], which was discovered in 1970s by a research team led by a laureate [23] of 1972’s Nobel prize in Physics, is fundamentally different from artificial neural network. So far, to the best of our knowledge, there is no other better way of designing human-like cognition and recognition than the use of RCE neural network or its improved versions [20,21,22,23].
It is worth acknowledging that despite the huge amount of research works dedicated to stereovision, the achieved results are far behind the performance of human beings’ stereovision. Obviously, we should not stop the continuous investigation which aims at looking for better principles of, or solutions to, human-like stereovision.

4. The Outline of Proposed Principle

Human vision is attention-driven in a top-down manner. The attention could be triggered by the occurrences of reference entities such as appearances of persons, appearances of animals, appearances of objects, appearances of machines, appearances of geometries (e.g., lines, curves, surfaces, volumes, etc.), appearances of photometry (e.g., chrominance and luminance, etc.), appearances of textures, etc. Such reference entities could be learnt by a cognition process incrementally in real-time. However, the occurrences of familiar reference entities should be the responses of an internal recognition process.
Inspired by the innate processes of human vision, we propose a new principle which imitates the attention-driven behavior of human vision. The main idea of the proposed new principle is outlined in Figure 4.
Without loss of generality, we assume that the attention is to be recognized from the video streams of the left camera. The key steps involved in the proposed new principle include:
  • Image acquisition by both cameras.
  • Image sampling on video stream from left camera.
  • Hybrid feature extraction for each image sample.
  • Cognition of image samples if they correspond to the training data of reference entities inside training images.
  • Recognition of image samples if they correspond to the possible occurrences of reference entities inside real-time images.
  • Forward/Inverse processes of template matching, which work together so as to find the occurrence of matched candidate in the right image, if a recognized entity is present in the left image.
In the subsequent sections, we will describe the details of key steps 2 to 6.

5. Top-Down Strategy of Doing Image Sampling

An image may contain many entities of interest. One of the biggest challenges faced by image understanding or image segmentation/grouping is to divide an image into a matrix of image samples, each of which just contains the occurrence or appearance of a single entity. In theory and in practice, there is no solution which could generally guarantee such results expected by the subsequent visual processes in stereovision.
In addition, the problem of finding better ways to do image sampling did not receive enough attention in the research community. One major reason is because many people believe that it is good enough to use a sub-window to scan an input image so as to obtain all the possible image samples. However, this way of doing image sampling has serious drawbacks such as:
  • It is difficult to determine, or to justify, the size of sub-window which is used to scan an input image. If the size of sub-window is allowed to be dynamically changed, then the next question is how to do such dynamic adjustment of sizes.
  • The number of obtained image samples is independent of the content inside an input image. For example, an input image may contain a single entity. In this case, the scanning method will still produce many image samples which will be the input to subsequent visual processes of classification, identification, and grouping, etc. Obviously, irrelevant image samples may potentially cause troubles to these visual processes of recognition.
In this paper, we advocate a top-down strategy which iteratively divides an input image into a list of sets which contain linearly growing numbers of image samples of different sizes. If we denote S k a set which contains k image samples, one way to obtain S k is to uniformly divide an input image into a matrix of d v × d h samples, in which d v × d h = k . For example, if we iteratively divide an input image into:
  • S k with one sample, then k = 1 and d v × d h [ 1 × 1 ] .
  • S k with two samples, then k = 2 and d v × d h [ 1 × 2 , 2 × 1 ] .
  • S k with three samples, then k = 3 and d v × d h [ 1 × 3 , 3 × 1 ] .
  • S k with four samples, then k = 4 and d v × d h [ 1 × 4 , 4 × 1 , 2 × 2 ] .
  • and so on.
This top-down strategy of doing image sampling is suitable for both parallel implementation and sequential implementation.
Before sending the image samples to the next visual process of extracting features, it is necessary to normalize the size of image samples so as to make them to be comparable in size. In practice, it is trivial to scale up or down the size of an image sample to any chosen standard value. By now, we could represent image sample set S k as follows:
S k = I j , r u , v , I j , g u , v , I j , b u , v , u 0 , U 1 , v 0 , V 1 , j 1 , k
where ( r , g , b ) are the three primary color components at index coordinates ( u , v ) inside set S k ’s j th image sample I j ( u , v ) which has the size of V × U . Hence, by default, each image acquisition module in stereovision outputs color images, each of which is represented by a set of three matrices such as I j , r u , v , I j , g u , v , I j , b u , v in Equation (1).

6. Feature Extraction from Sample Image in Time-Domain

Mathematically speaking, the periodicity in space is equivalent to the periodicity in time. Hence, without loss of generality, we consider the spatial axes of an image or image sample as time axes. In this way, we could focus our discussions on how to extract features in time domain as well as in frequency domain.
Feature extraction in time domain has been extensively investigated by the research community of image processing and computer vision. In general, the basic operations include the computations of n th order derivatives where n could be equal to 0, 1, 2, 3, and any other larger value of integer. Here, the zero-order derivatives could refer to the results obtained by the operation of image smoothing for noise reduction (e.g., to use Gaussian filters).
In literature, there are also many advanced studies which explore the use of Laplacian filters, Gabor filters, Wavelet filters, Moravec corner filter, Harris-Stephens corner filter, and Shi-Tomasi corner filter, etc. Hence, feature extraction in time domain is a very rich topic.
From the results shown in Figure 5, it is clear to us that higher order derivatives do not significantly provide extra information. The data of zero-order derivatives and first-order derivatives should be good enough for us to extract meaningful features in time domain.
In practice, the zero-order derivatives could be obtained by convoluting set S k ’s j th image sample I j ( u , v ) with a discrete Gaussian filter such as:
G u , v = 1 16 1 2 1 2 4 2 1 2 1 , v 0,2 , u [ 0,2 ]
If we represent the results (i.e., a matrix of zero-order derivatives of all the color components) of zero-order derivatives as follows:
S k , 0 = I j , r 0 u , v , I j , g 0 u , v , I j , b 0 u , v , u 0 , U 1 , v 0 , V 1 , j 1 , k
Then, the first-order derivatives could be obtained by convoluting each matrix in S k , 0 with the following two Sobel filters:
C h u , v = + 1 + 2 + 1 0 0 0 1 2 1 , v 0,2 , u [ 0,2 ]
and
C v u , v = + 1 0 1 + 2 0 2 + 1 0 1 , v 0,2 , u [ 0,2 ]
Clearly, the convolution with filter in Equation (4) will result in the horizontal components of the first-order derivatives while the convolution with filter in Equation (5) will result in the vertical components of the first-order derivatives. The L 2 norms computed from these two components will yield the results of the first-order derivatives of image samples S k , 0 , which could be represented by:
S k , 1 = I j , r 1 u , v , I j , g 1 u , v , I j , b 1 u , v , u 0 , U 1 , v 0 , V 1 , j 1 , k
Therefore, for image sample j in set S k , it actually has six image matrices which are: [ I j , r 0 u , v , I j , g 0 u , v , I j , b 0 u , v ] in Equation (3) and [ I j , r 1 u , v , I j , g 1 u , v , I j , b 1 u , v ] in Equation (6). Then, the next question is how to determine a feature vector F j which meaningfully represents image sample j in set S k .
A simple answer to the above question could be to convert the six image matrices of a sample into their vector representations (i.e., a 2D matrix is re-arranged as a 1D vector). Then, by putting these six image vectors together, we will obtain feature vector F j . The advantage of this method is its simplicity. However, the noticeable drawback is the large dimension of feature vector F j . Then, we may want to know whether there is a better way of determining feature vector F j from image matrices, or not.
So far, there is no theoretical answer to this question. Maybe, a practical way is to design workable solutions which could be suitable for applications in hands. In this way, a library of workable solutions may empower autonomous robots, vehicles, or machines to adapt their behaviors to real-time situations or applications. Clearly, this topic still offers opportunities for further or continuous research works.
Here, we propose a simple and practical way of determining feature vector F j from image matrices. The idea is to compute statistics from a set of image matrices. Interestingly, the two obvious types of statistics are the mean values and standard deviations.
For example, if { I u , v , u 0 , U 1 , v 0 , V 1 } is an image matrix of single values such as red components, green components, blue components, or their individual first-order derivatives, each value in { I u , v , u 0 , U 1 , v 0 , V 1 } could be considered as a kind of measurement of approximate electromagnetic energy. Therefore, we could compute the following four meaningful statistics from { I u , v , u 0 , U 1 , v 0 , V 1 } , which are:
  • The mean value I a of approximate electromagnetic energy:
    I a = 1 U V v = 0 V 1 u = 0 U 1 I ( u , v )
  • The square-root of the variance σ I of approximate electromagnetic energy:
    σ I = v = 0 V 1 u = 0 U 1 ( I u , v I a ) 2 U V
  • The horizontal distribution σ u of approximate electromagnetic energy:
    σ u = v = 0 V 1 u = 0 N 1 I ( u , v ) × ( u u c ) 2 v = 0 V 1 u = 0 U 1 I ( u , v )
    with:
    u c = v = 0 V 1 u = 0 N 1 { I ( u , v ) × u } v = 0 V 1 u = 0 U 1 I ( u , v )
    and
    v c = v = 0 V 1 u = 0 N 1 { I ( u , v ) × v } v = 0 V 1 u = 0 U 1 I ( u , v )
  • The vertical distribution σ v of approximate electromagnetic energy:
    σ v = v = 0 V 1 u = 0 N 1 { I ( u , v ) × v v c 2 } v = 0 V 1 u = 0 U 1 I ( u , v )
As a result, any image matrix such as { I u , v , u 0 , U 1 , v 0 , V 1 } could be represented by feature vector F as follows:
F = [ I a , σ I , σ u , σ v ]
In time domain, if image sample j in set S k has six image matrices, its feature vector F j will contain 24 feature values.

7. Feature Extraction from Sample Image in Frequency-Domain

In mathematics, a very important discovery was Fourier Transform which tells us that any signal is the (finite or infinite) sum of sine functions. In engineering, one of the greatest inventors was Nikolas Tesla who told us that the secret of the universe could be understood by simply thinking in terms of energy, vibration, and frequency. Such statement explicitly advises us to look for feature space and feature vector in frequency domain if we would like to understand the secret of machine intelligence.
Given image matrix { I u , v , u 0 , U 1 , v 0 , V 1 } , it could be represented by, or decomposed into, its Fourier series in terms of complex exponentials e ± i x (in which i = 1 ) which could be computed as follows:
I u , v = 1 U V ω v = 0 V 1 ω u = 0 U 1 I ^ ω u , ω v e i 2 π ω u u U e i 2 π ω v v V
with 0 u U 1 , 0 v V 1 and:
I ^ ω u , ω v = v = 0 V 1 u = 0 U 1 I ( u , v ) e i ( 2 π ω u u U ) e i ( 2 π ω v v V )
in which 0 ω u U 1 and 0 ω v V 1 . With continuous signals or data, Equation (14) will become inverse Fourier Transform while Equation (15) will become forward Fourier transform.
It is interesting to take note that each value I ^ ( ω u , ω v ) in Equation (15) is a complex number or more precisely a vector. Mathematically speaking, a vector indicates a position in a space. Hence, Fourier coefficient vectors (or complex numbers), which are stored inside complex matrix { I ^ ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } , nicely define a feature space. Such a feature space could be called as Fourier feature space.
In mathematics, complex matrix { I ^ ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } could be split into two ordinary matrices { A ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } and { B ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } , where A ( ω u , ω v ) is the real part of complex number (or vector) I ^ ( ω u , ω v ) and B ( ω u , ω v ) is the imaginary part of complex number (or vector) I ^ ( ω u , ω v ) . Both matrices A and B are Fourier coefficient matrices.
Therefore, in frequency domain, a straightforward way of determining feature vector F which characterizes image matrix { I u , v , u 0 , U 1 , v 0 , V 1 } taken from image sample j in set S k is to re-arrange the corresponding Fourier coefficient matrix { A ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } or { B ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } into a vector.
Alternatively, we could use equations, which are the same to Equations (7)–(12), to compute the mean values ( ω u , c , ω v , c ) of frequencies and their standard deviations ( σ ω , u , σ ω , v ) from Fourier coefficient matrix { A ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } or { B ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } . In this way, frequency domain’s feature vector corresponding to each Fourier coefficient matrix { A u , v , u 0 , U 1 , v 0 , V 1 } or { B ω u , ω v , ω u 0 , U 1 , ω v 0 , V 1 } could be as follows:
F j = [ ω u , c , ω v , c , σ ω , u , σ ω , v ]
In time domain, each image sample j in set S k has three color component images. Each color component image could yield two Fourier coefficient matrices. In total, there will be six Fourier coefficient matrices for any given image sample j in set S k . As a result, in frequency domain, feature vector F j of image sample j in set S k will also contain 24 feature values.

8. Cognition Process Using RCE Neural Network

Today, many researchers still believe that our mind arises from our brain. This opinion makes a lot of people or young researchers believe that the blueprint of mind is part of the blueprint of brain. For those who are familiar with microprocessors and operating systems, it is clear to us that the blueprints of operating systems are not part of the blueprints of microprocessors.
Here, we advocate the truth which states that mind is mind while brain is brain. Most importantly, the basic functions of brain are to support memorizations and computations which are intended by mind. With this truth in mind, the future research in artificial intelligence or machine intelligence should be focused on the physical principles behind the design of human-like minds which could transform signals into the cognitive states of knowing the conceptual meanings behind the signals.
In the previous sections, we have discussed the details of feature extraction. The results are lists of feature vectors in time domain, frequency domain, or both. Then, the next question will be how to learn the conceptual meaning behind a set of feature vectors corresponding to the same class of sample images or the same identity of sample images. The good news is that RCE neural network discovered in 1970s provides us a better version of answers, so far.
As shown in Figure 6, both cognition and recognition could be implemented with the use of RCE neural network which consists of three layers. There is a single vector at the input layer. Also, there is a single vector at the output layer. However, inside the middle layer, there is a dynamically growing number of nodes, each of which memorizes the feature vectors from a set of sample images provided by a training session of cognition.
Clearly, RCE neural network is fundamentally different from the so-called artificial neural network which is simply a graphical representation of a system of equations with coefficients to be tuned in some simple or deep manners (e.g., back-propagation method).
Refer to Figure 6. With training session i ’s feature vectors, we could easily compute the mean vector and the standard deviation of the distances from the training session’s feature vectors to their mean vector.
For example, if training session i has k i sample images which form the following set:
S k i = I j , r u , v , I j , g u , v , I j , b u , v , u 0 , U 1 , v 0 , V 1 , j 1 , k i
then training session i ’s set of feature vectors computed by feature extraction module could be denoted by { F i , j , j [ 1 , k i ] } where j is the index of image sample j in set S k i . Subsequently, the mean vector of { F i , j , j [ 1 , k i ] } could be calculated by:
F i , a = 1 k i j = 1 k i F i , j
and the standard deviation of the distances from { F i , j , j [ 1 , k i ] } to the mean vector could be computed by:
σ i = 1 k i j = 1 k i ( F i , j F i , a ) T ( F i , j F i , a )
By now, we could explain he physical meaning of node i (i.e., outcome of training session i ) in RCE neural network, which is simply the representation of hyper-sphere [21] with its center at F i , a and its radius to be equal to 3 σ i . Since i could dynamically grow, RCE neural network naturally supports the process of incrementally learning as well as the process of deep learning which is widely discussed about in the literature.
As we mentioned above, the deep tuning of parameters inside a complex artificial neural network, which is a graphical representation of a system of equations, has nothing to do with deep learning, and the true nature of deep learning is outlined in Figure 6.
In summary, a training session for cognizing entity n consists of supplying a set of entity n’s image samples in Equation (17) and entity n’s conceptual meaning L n which is a label or a word in a natural language such as English.

9. Recognition Process Using Possibility Function

Refer to Figure 6 again. With a trained RCE neural network by a cognition process for each entity of interest (e.g., entity n), the output layer is primarily for the purpose of executing recognition process when the feature vector computed from any arbitrary image sample is given to the input layer.
In literature, many researchers believe that recognition is a process of determining the chances of occurrences. As a result, probability functions are widely used inside a recognition module.
Here, we advocate the truth which states that recognition is a process of evaluating the beliefs about the identities and categories of any arbitrary image sample at input. This truth is in line with the fact that our mind consists of many sub-systems of beliefs. Hence, the function for estimating the degrees of beliefs should be a probability function such as:
p j , n i = e 1 2 σ i 2 F 0 , j F i , a T ( F 0 , j F i , a )
where ( F i , a , σ i ) is the parameter vector of the hyper-sphere obtained from training session i while F 0 , j is the feature vector computed from image sample j during recognition process, and p j , n ( i ) is the possibility for image sample j to belong to learnt entity n according to training session i ’s parameter vector.
Since RCE neural network intrinsically supports incremental learning as well as deep learning, the single node in the output layer must include a Soft-Max function such as:
P j , n = max i { p m i n , p j , n ( i ) }
where p m i n is the minimum value of acceptable possibility (e.g., 0.5). In practice, if P j , n = p m i n , the interpretation could be stated as follows: input F 0 , j does not support the belief that image sample j belongs to learnt entity n . Otherwise, if P j , n > p m i n , it means that the output of recognition will be ( L n , P j , n ) in which L n is the conceptual meanings of image sample j .

10. Forward/Inverse Processes of Template Matching

In the previous sections, we have discussed the key details about the modules of image sampling, hybrid feature extraction, cognition, and recognition. In a human-like stereovision system, these modules will produce the output of recognized entities inside left camera’s image plane, as illustrated by Figure 4. Then, the next question will be how to determine the match in right camera if a recognized entity in left camera is given. This question describes the famous problem of stereo matching faced by today’s stereovision systems.
In literature, stereo matching is a widely investigated problem. So far, there is no solution which could achieve the performance close to, or as good as, the one of human being’s stereovision system. Hence, better solutions for improved performance are still expected from future research works in this area.
In this paper, we present a new strategy which could cope with the problem of stereo matching in a better way. This new strategy consists of the interplay between forward template matching and inverse template matching.
In stereovision, the only geometrical constraint is the so-called epipolar line which indicates the possible locations of a match (e.g., at location b in Figure 7) in right image plane if a location in left image plane is given (e.g., location a in Figure 7).
As shown in Figure 7, if recognized sample j at location a is given in left image plane, the forward process of template matching will consist of the following steps:
  • Determine the equation of epipolar line from both stereovision’s calibration parameters (NOTE: such knowledge could be found in any textbook of computer vision) and location a’s coordinates.
  • Scan the epipolar line location by location.
  • Take image sample e at currently scanned location e.
  • Compute the feature vector of image sample e .
  • Compute the cosine distance between image sample j ’s feature vector and image sample e ’s feature vector.
  • Repeat the scanning until it is completed.
  • Choose the image sample to be the candidate of matched sample j if it minimizes the cosine distance.
  • Use the cosine distance between recognized sample j and the chosen candidate of matched sample j to compute the possibility value of match (i.e., to use Equation (20)).
  • Accept matched sample j if the possibility value of match is greater than a chosen threshold value (e.g., 0.5).
In the above process, if F j is the feature vector of image sample j while F e is the feature vector of image sample e , the cosine distance d j , e between those two vectors is simply calculated according to their inner product, which is:
d j , e = F j T × F e F j × F e
and the corresponding possibility value is calculated as follows:
P j , e = e 1 2 σ 0 2 d j , e 2
where σ 0 is a default value of standard deviation which could be self-determined by robots, vehicles, or machines during a training session of cognition process.
According to the illustration shown in Figure 3, the forward process of template matching will work only if there is no partial view due to either occlusion or image sampling. If matched sample j in right image contains partial view of recognized sample j in left image, the inverse process of template matching will perform better than its counterpart of forward process.
As shown in Figure 8, if recognized sample j at location a is given in left image plane, the inverse process of template matching consists of the following steps:
  • Determine the equation of epipolar line from both the stereovision’s calibration parameters and the location a’s coordinates.
  • Scan the epipolar line location by location.
  • Take image sample e at currently scanned location e.
  • Divide image sample e into a matrix of sub-samples { e i , i = 1,2 , 3 , } .
  • Use each sub-sample in { e i , i = 1,2 , 3 , } as template and do forward template matching with recognized sample j .
  • Compute the mean value of all the possibility values which measure the match between all the sub-samples in { e i , i = 1,2 , 3 , } and recognized sample j . This mean value represents the possibility value for image sample e in right image to match with recognized sample j in left image.
  • Repeat the scanning until it is completed.
  • Choose the image sample to be the candidate of matched sample j if it minimizes the possibility values of match (i.e., calculated by Equation (23)).
  • Accept the match if the possibility value of match is greater than a chosen threshold value (e.g., 0.5).
In practice, we could run both forward process and inverse process of template matching in parallel. In this way, a better decision of match in right image could be made if recognized sample j in left image is given.

11. Implementation and Results

The proposed new principle has been implemented in Python. Preliminary tests have been with image data from public domain. Especially, we use image data which are posted to public domain by maritime RobotX challenge (www.robotx.org (accessed on 15 March 2023)). Figure 9 shows two typical examples of scenes constructed by maritime RobotX challenge.
The tasks to be undertaken by an autonomous surface vehicle include stereovision-guided delivery of objects, stereovision-guided parking into the docking bay, etc. In the following sections, we share some of our experimental results.

11.1. Results of Top-Down Sampling Strategy of Input Images

Our proposed top-down sampling strategy of input images (e.g., images from left camera) is to divide an input image from left camera into a list of sets S k which contain increasing number of image samples (i.e., k = 1, 2, 3, …). Figure 10 shows an example of results (i.e., k = 28) from our proposed top-down sampling strategy of an input image. At this level of sampling, a red floating post clearly appears inside one of these 28 samples.

11.2. Examples of Training Data for Cognition (i.e., Learning)

The proposed new principle involves the use of cognition and recognition modules. For cognition module, it is necessary to train it with training data of reference entities. Without loss of generality, we simply use a set of 10 samples to train the cognition module which is specifically dedicated to an entity of interest. It is amazing to see that the proposed solution could achieve successful results with 10 samples inside a dataset of training for each entity of interest.
Figure 11 shows the scenario of autonomous parking into a docking bay by an autonomous surface vehicle. In this task, the mental capabilities of the autonomous surface vehicle include (a) cognition of triangle, cross, and circle, and (b) recognition of triangle, cross and circle. Hence, for the training of cognition module dedicated to each entity among triangle, cross and circle, we simply take ten samples as shown in Figure 11.

11.3. Results of Feature Extraction in Time Domain

For each sample image in Figure 10, we calculate its feature vector in time domain. Here, we share the results of feature vectors computed from the ten image samples of triangle in Figure 11. These results are shown in Figure 12, which also gives the result of the mean vector and its standard deviation.

11.4. Results of Feature Extraction in Frequency Domain

For each sample image in Figure 10, we calculate its feature vector in frequency domain. Similarly, we share the results of feature vectors computed from the ten image samples of triangle in Figure 11. These results are shown in Figure 13, which also gives the result of the mean vector and its standard deviation.

11.5. Results of Cognition

The mean vector and its standard deviation, which are obtained from each training session for any entity of interest, will be stored inside a node at the middle layer of the RCE neural network which is allocated to an entity’s cognition module. If there are total number N of entities of interest, there will be a set of N RCE neural networks which support N pairs of cognizers and recognizers such as {(Cognizer n Recognizer n ), n = 1, 2, 3, …, N} (i.e., n refers to the n -th entity). As illustrated in Figure 14, the total number N of entities could incrementally grow very deeply.

11.6. Results of Recognition

With total number N of pairs of cognizers and recognizers such as {(Cognizer n , Recognizer n ), n = 1, 2, 3, …, N} in place, an autonomous surface vehicle or robot is ready to recognize familiar or learnt entities inside images of left camera.
Figure 15 shows two examples of results of recognition in time domain. Each example contains seven image samples as input. Among these seven inputs, three of them are totally out of the class dedicated to the pair of cognizer and recognizer. We can see that recognition module performs quite successfully in recognizing the correct entries. Please take note that the feature vectors of image samples at input are all in time domain.
With the same image samples at input, Figure 16 shows the results of recognition in frequency domain. We can see that recognition module also performs quite successfully in recognizing the correct entries. Please take note that the feature vectors of image samples at input are all in frequency domain.
By now, people may ask whether the proposed new principle could work well with other more complex entities. For the sake of responding to such doubt, we give two more examples of results which make use of the feature vectors in frequency domain to do cognition and recognition. For each reference entity (e.g., car and dog), the cognition module is trained with ten image samples while the recognition process is tested with six image samples as input.
In Figure 17, we show the experimental results of cognizing and recognizing cars in frequency domain. The results are judged to be very good.
In Figure 18, we show the experimental results of cognizing and recognizing dogs in frequency domain. The results are also judged to be very good.

11.7. Results of Stereo Matching

Mathematically, a pair of images is good enough to validate a stereo matching algorithm. In practice, a pair of images could come from a binocular vision system which is normally named as stereovision system. Alternatively, a pair of images could come from a mobile monocular vision system. Since we use the image dataset from the public domain, it is easier for us to take two images from an image sequence captured by a mobile camera.
Here, we share one example of results in Figure 19, Figure 20 and Figure 21. In Figure 19, we let the stereovision system undergo the cognition process in which ten sample images of a floating post are used to train the RCE neural network inside the cognizer allocated to learn the red floating post. After the training of the cognizer’s RCE neural network, the so-called stereovision system is ready to enter the recognition process which takes any set of new images as input.
In Figure 19, seven image samples are selected for testing the validity of trained RCE neural network. The possibility values show good outcome from the recognition process.
Now, we could start the process of stereo matching. As shown in Figure 20, the first step is to do image sampling. When the so-called left image is sampled into a matrix of 4 × 7 image samples, the occurrence of a red floating post could be recognized. Please take note that the image sample of this recognized occurrence is named as image sample 1a by our testing program.
Subsequently, in the so-called right image, we could determine a line (i.e., equivalent to epipolar line) which will guide the search for the best match candidate.
For the purpose of illustration, we take three image samples to compute the stereo matching results. Among these three image samples, image sample 1b is the best match. In general, the best match is the one which maximizes the possibility value (i.e., computed by Equation (23) with σ 0 = 10 ) between image sample 6a in left image and all the possible image samples in right image. Figure 21 shows the possibility values computed for three pairs of possible matches, which are (1a, 1b), (1a, 1c), and (1a, 1d). Clearly, pair (1a, 1b) stands out to be the best match.

12. Conclusions

In this paper, we have described the details of the key steps in a proposed new principle which aims at achieving robust stereo matching in human-like stereovision. The main idea is to undertake stereo matching at a cognitive level. The significant contributions from this paper include: First, the introduction of a top-down sampling strategy will lighten the burden of subsequent processes in stereovision. This is because it will provide better versions of image samples, which will in return diminish the chance of committing errors by the subsequent processes in stereovision. Secondly, we advocate the process of feature extraction in both time domain and frequency domain. In this way, key characteristics of a visual entity will be able to be preserved as much as possible. Especially, we have highlighted the importance of Fourier series and Fourier coefficients in the process of extracting visual features from images. Thirdly, we have shown the important difference between artificial neural network and RCE neural network. Most importantly, we have introduced the possibility function to improve RCE neural network so as to make it a better way to support the process of cognition (including deep learning) as well as the process of recognition. Fourthly, we have introduced the inverse strategy of template matching. This is a better solution to cope with the problem of partial views due to occlusions or mis-aligned sampling of images. Finally, the key steps in the proposed new principle have been validated by experiments with real image data under the context of maritime RobotX challenge. The obtained results are very encouraging. We hope that more results and progress will emerge in this new direction of research.
Last, but not least, it is important to highlight the advantages and limitations of the proposed new principle. The notable advantages include: (a) fast responses which are independent of the left camera’s resolution because of the top-down sampling strategy, (b) fast attentions which focus on entities of interest according to the decreasing order of entities’ sizes, (c) computational efficiency because of cognition process which only runs on images from one camera, and (d) meaningful outputs in terms of what, who and where at the same time. The notable limitation of the proposed new principle is the prerequisite of training (i.e., cognition) which makes the stereovision systems to only respond to familiar entities. Future works include at least: (a) to test the robustness in the presence of partial views due to occlusions or mis-aligned image samples, (b) to explore more ways of doing the extraction of hybrid features, (c) to collect more training data of more entities of interest, (d) to extend the works to the case of stereovision for humanoid robots, and (e) to venture into the situation of cognizing and recognizing sub-entities.

Author Contributions

Conceptualization, M.X.; methodology, M.X.; software, T.L. and Y.F.; validation, T.L. and Y.F.; formal analysis, T.L. and Y.F.; investigation, T.L. and Y.F.; resources, T.L. and Y.F.; data curation, T.L. and Y.F.; writing—original draft preparation, M.X.; writing—review and editing, M.X.; visualization, M.X.; supervision, M.X.; project administration, M.X.; funding acquisition, M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Future Systems and Technology Directorate, Ministry of Defense, Singapore, grant number PA9022201473.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data related to this paper will be made available upon request. The access to the data is subject to the data protection policy of Nanyang Technological University, Singapore.

Acknowledgments

We would like to acknowledge the financial support from the Future Systems and Technology Directorate, Ministry of Defense, Singapore, to NTU’s RobotX Challenge team.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xie, M.; Hu, Z.C.; Chen, H. New Foundation of Artificial Intelligence; World Scientific: Singapore, 2021. [Google Scholar]
  2. Cassinelli, A.; Reynolds, C.; Ishikawa, M. Augmenting Spatial Awareness with Haptic Radar. In Proceedings of the 2006 10th IEEE International Symposium on Wearable Computers, Montreux, Switzerland, 11–14 October 2006; pp. 61–64. [Google Scholar]
  3. Li, Y.; Ibanez-Guzman, J. Lidar for Autonomous Driving: The Principles, Challenges, and Trends for Automotive Lidar and Perception Systems. IEEE Signal Process. Mag. 2020, 37, 50–61. [Google Scholar] [CrossRef]
  4. Rashidi, A.; Fathi, H.; Brilakis, I. Innovative Stereo Vision-Based Approach to Generate Dense Depth Map of Transportation Infrastructure. Transp. Res. Rec. 2011, 2215, 93–99. [Google Scholar] [CrossRef]
  5. Xie, M. Key Steps Toward Development of Humanoid Robots. In Proceedings of the 25th International Conference on Climbing and Walking Robots, Robotics in Natural Settings, Lecture Notes in Networks and Systems, Ponta Delgada, Portugal, 12–14 September 2022; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  6. Xie, M.; Velamala, S. Maritime Autonomous Vessels: A Review of RobotX Challenge’s Works. J. Technol. Soc. Sci. 2018, 2, 7–14. [Google Scholar]
  7. Gordon, I.E. Theories of Visual Perception, 3rd ed.; Psychology Press: London, UK, 2004. [Google Scholar]
  8. Bekey, G.A. Autonomous Robots: From Biological Inspiration to Implementation and Control; The MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
  9. Roberts, D.A.; Sho, Y. The Principles of Deep Learning Theory; The Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
  10. Wu, X.W.; Sahoo, D.; Hoi, S.C.H. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
  11. Rogister, P.; Benosman, R.; Ieng, S.H.; Lichtsteiner, P.; Delbruck, T. Asynchronous Event-Based Binocular Stereo Matching. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 347–353. [Google Scholar] [CrossRef] [PubMed]
  12. Yang, G.S.; Manela, J.; Happold, M.; Ramanan, D. Hierarchical Deep Stereo Matching on High-Resolution Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5515–5524. [Google Scholar]
  13. Bleyer, M.; Breiteneder, C. Stereo Matching: State-of-the-Art and Research Challenges. In Advances in Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2013; pp. 143–179. [Google Scholar]
  14. Chang, J.R.; Chen, Y.S. Pyramid Stereo Matching Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
  15. Xie, M. A Cooperative Strategy for The Matching of Multi-level Edge Primitives. Image Vis. Comput. 1995, 13, 89–99. [Google Scholar] [CrossRef]
  16. Medioni, G.; Nevatia, R. Segment-based stereo matching. Comput. Vis. Graph. Image Process. 1985, 31, 2–18. [Google Scholar] [CrossRef]
  17. Zhang, Y.N.; Gerbrands, J.J. Method for matching general stereo planar curves. Image Vis. Comput. 1995, 13, 645–655. [Google Scholar] [CrossRef]
  18. Wang, Z.F.; Zhi-Gang Zheng, Z.G. A region based stereo matching algorithm using cooperative optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  19. Li, L.; Fang, M.; Yin, Y.; Lian, J.; Wang, Z. A Traffic Scene Object Detection Method Combining Deep Learning and Stereo Vision Algorithm. In Proceedings of the IEEE International Conference on Real-time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021; pp. 1134–1138. [Google Scholar]
  20. Yin, X.M.; Guo, D.; Xie, M. Hand image segmentation using color and RCE neural network. Robot. Auton. Syst. 2001, 34, 235–250. [Google Scholar] [CrossRef]
  21. Cooper, P.W. The hypersphere in pattern recognition. Inf. Control. 1962, 5, 324–346. [Google Scholar] [CrossRef]
  22. Morgan, D.P.; Scofield, C.L. ANN Keyword Recognition. In Neural Networks and Speech Processing; The Springer International Series in Engineering and Computer Science; Springer: Berlin/Heidelberg, Germany, 1991; Volume 130. [Google Scholar]
  23. Cooper, L.N. How We Remember: Toward an Understanding of Brain and Neural Systems; World Scientific: Singapore, 1995. [Google Scholar]
Figure 1. Research Framework Underlying the Development of Intelligent Humanoid Robots and Autonomous Surface Vehicles.
Figure 1. Research Framework Underlying the Development of Intelligent Humanoid Robots and Autonomous Surface Vehicles.
Biomimetics 08 00285 g001
Figure 2. Illustration of Stereo Matching. “*” indicates the location of entity in 3D scene as well as its locations in both left image plane and right image plane.
Figure 2. Illustration of Stereo Matching. “*” indicates the location of entity in 3D scene as well as its locations in both left image plane and right image plane.
Biomimetics 08 00285 g002
Figure 3. Illustration of (a) partial view caused by occlusion in which left camera sees partial view of object 3 while right camera sees partial view of object 1, and (b) partial view caused by image sampling in which the three nearest vehicles partially appear inside samples at row 2 and row 3.
Figure 3. Illustration of (a) partial view caused by occlusion in which left camera sees partial view of object 3 while right camera sees partial view of object 1, and (b) partial view caused by image sampling in which the three nearest vehicles partially appear inside samples at row 2 and row 3.
Biomimetics 08 00285 g003
Figure 4. Outline of Proposed New Principle Toward Achieving Robust Matching in Human-like Stereovision.
Figure 4. Outline of Proposed New Principle Toward Achieving Robust Matching in Human-like Stereovision.
Biomimetics 08 00285 g004
Figure 5. Examples of results from the computations of zero-order derivatives, first-order derivatives, second-order derivatives, third-order derivatives, and forth-order derivatives.
Figure 5. Examples of results from the computations of zero-order derivatives, first-order derivatives, second-order derivatives, third-order derivatives, and forth-order derivatives.
Biomimetics 08 00285 g005
Figure 6. Structure of RCE Neural Network.
Figure 6. Structure of RCE Neural Network.
Biomimetics 08 00285 g006
Figure 7. Illustration of Forward Template Matching in Stereovision.
Figure 7. Illustration of Forward Template Matching in Stereovision.
Biomimetics 08 00285 g007
Figure 8. Illustration of Inverse Template Matching in Stereovision.
Figure 8. Illustration of Inverse Template Matching in Stereovision.
Biomimetics 08 00285 g008
Figure 9. Typical scenes constructed by maritime RobotX challenge.
Figure 9. Typical scenes constructed by maritime RobotX challenge.
Biomimetics 08 00285 g009
Figure 10. One Result of Top-down Sampling Strategy of Input Image.
Figure 10. One Result of Top-down Sampling Strategy of Input Image.
Biomimetics 08 00285 g010
Figure 11. Ten Sample Images for Training Cognition Module Dedicated to Each Entity Among Triangle, Cross and Circle.
Figure 11. Ten Sample Images for Training Cognition Module Dedicated to Each Entity Among Triangle, Cross and Circle.
Biomimetics 08 00285 g011
Figure 12. The Values of Ten Feature Vectors Computed from Ten Sample Images of Triangle in Time Domain. The statistics computed from these 10 feature vectors are highlighted in red color.
Figure 12. The Values of Ten Feature Vectors Computed from Ten Sample Images of Triangle in Time Domain. The statistics computed from these 10 feature vectors are highlighted in red color.
Biomimetics 08 00285 g012
Figure 13. The Values of Ten Feature Vectors Computed from Ten Sample Images of Triangle in Frequency Domain. The statistics computed from these 10 feature vectors are highlighted in red color.
Figure 13. The Values of Ten Feature Vectors Computed from Ten Sample Images of Triangle in Frequency Domain. The statistics computed from these 10 feature vectors are highlighted in red color.
Biomimetics 08 00285 g013
Figure 14. Results of Cognition in the Form of total number N of Cognizers (i.e., 1, 2, 3, …, N).
Figure 14. Results of Cognition in the Form of total number N of Cognizers (i.e., 1, 2, 3, …, N).
Biomimetics 08 00285 g014
Figure 15. Two Examples of Results of Recognition Using Feature Vectors in Time Domain.
Figure 15. Two Examples of Results of Recognition Using Feature Vectors in Time Domain.
Biomimetics 08 00285 g015
Figure 16. Two Examples of Results of Recognition Using Feature Vectors in Frequency Domain.
Figure 16. Two Examples of Results of Recognition Using Feature Vectors in Frequency Domain.
Biomimetics 08 00285 g016
Figure 17. Example of Cognizing and Recognizing Cars in Frequency Domain.
Figure 17. Example of Cognizing and Recognizing Cars in Frequency Domain.
Biomimetics 08 00285 g017
Figure 18. Example of Cognizing and Recognizing Dogs in Frequency Domain.
Figure 18. Example of Cognizing and Recognizing Dogs in Frequency Domain.
Biomimetics 08 00285 g018
Figure 19. Results of Cognizing and Recognizing Red Floating Posts in Frequency Domain.
Figure 19. Results of Cognizing and Recognizing Red Floating Posts in Frequency Domain.
Biomimetics 08 00285 g019
Figure 20. Results of Testing Recognition with Seven Image Samples, after Doing Cognition with Ten Image Samples Which Have Certain Level of Intended Variations for the Purpose of Appreciating Robustness. The image sample recognized in left image is denoted “1a” while three matching candidates of image samples in right image are denoted “1c”, “1b”, and “1d”, respectively.
Figure 20. Results of Testing Recognition with Seven Image Samples, after Doing Cognition with Ten Image Samples Which Have Certain Level of Intended Variations for the Purpose of Appreciating Robustness. The image sample recognized in left image is denoted “1a” while three matching candidates of image samples in right image are denoted “1c”, “1b”, and “1d”, respectively.
Biomimetics 08 00285 g020
Figure 21. Results of Stereo Matching Among Three Possible Pairs of Matches Which Are: pair(1a, 1b), pair(1a, 1c), and pair(1a, 1d).
Figure 21. Results of Stereo Matching Among Three Possible Pairs of Matches Which Are: pair(1a, 1b), pair(1a, 1c), and pair(1a, 1d).
Biomimetics 08 00285 g021
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, M.; Lai, T.; Fang, Y. A New Principle toward Robust Matching in Human-like Stereovision. Biomimetics 2023, 8, 285. https://doi.org/10.3390/biomimetics8030285

AMA Style

Xie M, Lai T, Fang Y. A New Principle toward Robust Matching in Human-like Stereovision. Biomimetics. 2023; 8(3):285. https://doi.org/10.3390/biomimetics8030285

Chicago/Turabian Style

Xie, Ming, Tingfeng Lai, and Yuhui Fang. 2023. "A New Principle toward Robust Matching in Human-like Stereovision" Biomimetics 8, no. 3: 285. https://doi.org/10.3390/biomimetics8030285

Article Metrics

Back to TopTop