1. Introduction
As the primary mechanism for gathering external information and the sole means for observing motion behaviors [
1], the visual system of mammalian vertebrates has been subjected to numerous evolutionary adaptations to address complex survival challenges [
2]. One significant adaptation is the frontalization of the eyes, a key morphological evolution that involves the migration of the eyes from the lateral sides of the head to a more central position at the front. This evolutionary process has profoundly influenced the functionality and distribution of various types of neurons within the visual neural system, especially affecting the disparity-sensitive neurons that are crucial for depth perception. As a result of eye frontalization, the area of binocular field overlap has expanded, enabling a greater number of disparity-sensitive neurons to contribute to the processing of visual information, particularly in discerning object depth [
3]. This enhanced capability allows organisms to interpret complex spatial information more effectively, culminating in the development of an advanced capability known as “stereoscopic vision” [
4].
Stereoscopic vision can be broadly defined as the capacity to discern the three-dimensional structure of visual scenes by comparing retinal images from different lines of sight in each eye, focused on the same spatial point. For organisms inhabiting three-dimensional environments, it enhances their ability to perceive and interact with their surroundings by integrating with other functions within the visual system. These functions encompass not only the perception of an object’s texture, contours, and color [
5], but also the detection of object motion [
6]. Among these, the capability to discern the motion direction within three-dimensional spaces is fundamental to the survival strategies of organisms, and has been extensively investigated in the field of computer vision. In current related research, the mainstream approaches are mainly divided into two categories: the application of deep learning [
7] and the imitation of physiological structures [
8]. In the application of deep learning models, models based on convolutional neural network (CNN) architectures have dominated overwhelmingly. The earliest instance of CNNs was the LeNet architecture, introduced by Yann LeCun and colleagues in 1998, primarily designed for the recognition of handwritten digits [
9]. However, over the subsequent decade, the advancement of CNNs was significantly hindered by the limitations in computational resources and the scarcity of large-scale datasets. It was not until the widespread adoption of GPU computing in the field of deep learning, the availability of large-scale datasets such as ImageNet, and the optimization of algorithms and training techniques that the potential of CNNs was fully realized. The introduction of AlexNet in 2012 marked a major breakthrough [
10], significantly improving the accuracy of image recognition and ushering in a new era of rapid progress in deep learning. Leveraging the extensibility of convolutional operations [
11], CNNs have incorporated the application of 3D convolutional layers [
12]. This enables the models to capture spatiotemporal features in three-dimensional space through their weight-sharing [
13] and feature-learning [
14] mechanisms, facilitating their application to 3D image analysis.
Throughout the development of CNNs in the field of 3D image motion detection, numerous representative models have emerged. For instance, VoxNet, introduced in 2015, is a CNN architecture specifically designed for processing and recognizing 3D point cloud data. This model effectively extracts spatial features from both point cloud and voxel grid representations, enabling real-time classification tasks and underscoring the potential of CNNs in advancing 3D image recognition [
15]. However, to address the computational cost and quantization errors inherent in the voxelization process, Charles R. Qi et al. introduced PointNet in 2017. By incorporating a symmetric function to handle unordered point sets, PointNet eliminates much of the overhead and accuracy loss associated with voxelization, demonstrating superior efficiency and robustness across various 3D image tasks, including object classification, part segmentation, and scene segmentation [
16].
In practical applications, 3D CNNs have also made notable progress in several fields. Specifically, 3D ResNet, benefiting from the introduction of residual blocks and the increased depth of the network architecture, has demonstrated outstanding performance in the classification of brain magnetic resonance imaging (MRI) data, particularly in tasks involving the classification of Alzheimer’s disease (AD), mild cognitive impairment (MCI), and normal control (NC) groups [
17]. Building upon this success, the enhanced 3D EfficientNet model, introduced in 2022, incorporates 3D mobile inverted bottleneck convolution (MBConv) layers alongside squeeze-and-excitation (SE) mechanisms to effectively capture multi-scale features from 3D MRI scans. These architectural improvements significantly enhance the model’s ability to perform the early detection and prediction of AD, demonstrating superior performance compared to previous approaches [
18]. Furthermore, the newly proposed ResNet-101 model employs transfer learning techniques, leveraging pre-trained weights to improve detection performance in previously unseen scenarios. This approach enables efficient detection in complex environments and under varying speeds, further highlighting the robustness of CNNs in 3D motion detection tasks [
19].
Despite the significant advancements in deep learning models for 3D motion detection, several challenges persist. These include high training costs, both in terms of data and time [
20], difficulties in adapting to complex scenarios, and the lack of biological interpretability [
21]. While CNNs were originally inspired by the human visual system, their growing complexity has made it increasingly difficult to interpret their internal decision-making processes. This “black box” nature is a common issue across many learning algorithms [
22], prompting researchers to explore alternative approaches. Consequently, there is a growing interest in revisiting bio-inspired models to improve both interpretability and efficiency, with a renewed focus on the foundational principles that originally guided the development of artificial intelligence.
Bio-inspired models trace their origins to the early 1940s, when Warren S. McCulloch and Walter Pitts introduced the idea of constructing computational models based on the principles underlying biological neuron activity [
23]. This groundbreaking work laid the theoretical foundation for the core concepts of bionics, marking a significant milestone in the development of biologically motivated computational frameworks. Neuromorphic computing represents an advanced extension of bio-inspired research, aiming to design computational systems characterized by low power consumption, parallel processing capabilities, and adaptive learning abilities by emulating the structure of biological neurons and synapses in the brain [
24]. This approach is considered capable of addressing the interpretability challenges inherent in modern deep learning models. There are numerous practical applications in this field, such as the dendritic neuron model (DNM), introduced in 2014, which simulates the nonlinear interactions between dendritic neurons [
25]. This model is primarily applied to simulate neurons in the retina and the primary visual cortex, where it has proven effective in capturing complex synaptic computations and enhancing our understanding of neural processing in visual systems [
26]. Building upon this foundation, the dendristor model introduced in 2024 further advances neuromorphic computing by leveraging dendritic computations through multi-gate silicon nanowire transistors to perform complex dendritic processes [
27]. The model exhibited exceptional performance in visual motion perception tasks, highlighting the potential of bio-inspired models as a promising alternative for enhancing both the transparency and efficiency of modern artificial intelligence systems.
The biological foundation of most current bio-inspired models for motion direction detection originates from motion-direction-selective neurons found in the retina [
28,
29]. This concept was initially introduced by Hubel and Wiesel in 1959, when they discovered that neurons in the striate cortex of cats exhibit directional selectivity, responding preferentially to motion stimuli in specific directions [
30]. However, most of these models focus solely on monocular vision, neglecting the role of stereoscopic vision. This lack of depth perception inherently limits their applicability to two-dimensional images. To address this gap in the current research, this paper introduces a bio-inspired model designed to simulate the motion direction detection function in biological stereoscopic vision, referred to as the Stereoscopic Direction Detection Model (SDDM). The model’s overall architecture is divided into two components, corresponding to the left and right eyes in the formation of binocular disparity [
31]. While both components share a similar structure, the horizontal positional difference results in each eye receiving slightly different visual information [
32]. The individual monocular model consists of two distinct layers designed to replicate the structural and functional characteristics of the retina and the primary visual cortex. Specifically, we employed computational formulas to model the functions of specific cells and neurons involved in detecting motion direction within 3D images, emulating key components of the biological visual system. These include photoreceptor cells, bipolar cells, horizontal cells, and ganglion cells within the retinal layer, as well as complex cells and binocular-disparity-selective neurons located in the primary visual cortex [
33]. Furthermore, we formulated assumptions and constructed simulations based on the biological characteristics of synaptic connections between these cells [
34]. The detailed mechanisms underlying these processes will be fully explained in the “Mechanisms and Methods” section of this paper.
For a high-performing model, both interpretability and robust performance are essential requirements. A model must not only offer transparency in its underlying mechanisms but also deliver strong, reliable results across various tasks. To evaluate this, a series of comprehensive performance evaluations and comparative experiments was conducted, the results of which are detailed in the “Experiments and Analysis” section. In the performance evaluation, experiments were designed for the SDDM to detect motion direction in a 3D binary environment, involving objects of varying sizes, random shapes, and random positions. These experiments were designed to evaluate the model’s consistent performance across varying conditions, highlighting its robustness and adaptability across diverse scenarios. In the comparative experiments, EfficientNet and ResNet, which are currently regarded as the most effective CNN architectures for traditional 3D image tasks, were selected as baseline models for comparison. To rigorously assess their performance compared to the proposed model, the dataset used in these experiments was further augmented with four different types of noise, in addition to the standard performance evaluation criteria. The results demonstrate that the proposed SDDM not only offers significant advantages in transparency and interpretability but also exhibits superior robustness compared to EfficientNet and ResNet under variable conditions.
In summary, the SDDM presented in this paper is a bio-inspired model that mimics the physiological processes underlying mammalian stereoscopic vision, specifically designed to emulate the function of object motion direction detection in 3D images. Its primary innovation within the domain of traditional 3D image recognition is its biologically driven approach to modeling, which analyzes and replicates structures of the biological visual system. This approach circumvents the inherent “black box” limitations typically associated with deep learning models, thereby offering a transparent, interpretable mechanism grounded in biological principles. Within the realm of bio-inspired models, the SDDM addresses the gap in stereoscopic vision modeling and extends its application from 2D to 3D imagery, aligning more closely with the characteristics of the human visual system. Moreover, the potential of the SDDM illustrates that, beyond merely deepening the layers of a model, leveraging a bio-inspired research approach offers a promising avenue for developing models that not only exhibit strong performance but also align closely with biological plausibility.
2. Mechanism and Method
The Stereoscopic Direction Detection Model (SDDM) in this paper is a bio-inspired model that aims to simulate the function of stereoscopic vision, especially the motion direction detection mechanism in traditional 3D images. As depicted in
Figure 1, the architecture of the SDDM can be divided into two distinct components, each designed to emulate the physiological structures of the retinal layer and the primary visual cortex, respectively.
The retinal layer describes how light signals from external 3D images are captured by the retina and converted into 2D electrical signals. It further elucidates how the signals are transformed into local motion signals through the interactions of photoreceptor cells, bipolar cells, horizontal cells, and ganglion cells, prior to transmission to the subsequent layer. The primary visual cortex proposes a mechanism based on the physiological characteristics of complex cells and disparity-detecting neurons, which may play a crucial role in detecting the motion direction of objects in 3D images. This mechanism outlines how these neurons receive local motion signals from the retinal layer and subsequently process them into global motion signals through analytical and statistical computations. In the next section of this chapter, we will provide a detailed explanation of the structure and operational logic of the SDDM from a bio-inspired perspective, including an in-depth discussion of the specific functions of each model component and the characteristics of their corresponding physiological structures.
2.1. Photoreception
In biological systems, optical imaging within the visual system can be described as the following process: light, reflected from external objects, enters the eye and undergoes refraction through various ocular structures, including the cornea and lens, before finally being projected as an inverted two-dimensional image onto the retina [
35]. During this process, the photoreceptor cells in each eye capture two slightly different two-dimensional projections of the external scene, as each eye observes from a slightly different angle. The brain then integrates these two inputs to perceive depth and the spatial position of objects in three-dimensional space, thereby forming stereoscopic vision [
36]. To emulate this mechanism, it is necessary to first establish a transformation model that describes the underlying optical processes. This model will demonstrate how an input 3D image is converted into two slightly differing 2D images, each projected onto the retinas of the left and right eyes.
To introduce this mechanism more specifically, consider a simple 3 × 3 × 3 transparent cube as an example. As illustrated in
Figure 2, the inner layer is labeled B1-9, the middle layer as M1-9, and the outer layer as F1-9, represented in green, yellow, and red, respectively. For the left eye, when directly viewing this cube, the linear propagation of light reflected from the external object results in a projection of the left, outer, and partially inner sides of the cube onto the retina, forming a matrix. In this case, the outer parts of the object obscure the inner parts, making it difficult to perceive depth in certain regions. Consequently, the projection matrix is arranged as follows: B1, B4, B7 (first column); M1, M4, M7 (second column); F1, F4, F7 (third column, which occludes B2, B5, and B8, sharing the same column); M2, M5, M8 (fourth column); F2, F5, F8 (fifth column, which occludes B3, B6, and B9, also sharing the same column); M3, M6, M9 (sixth column); and F3, F6, F9 (seventh column). This creates the preliminary projection-L as illustrated in
Figure 2.
However, due to the lensing effects of the cornea and lens, the projection is inverted according to optical principles [
37], resulting in a completely flipped projection, as shown by the final projection-L in
Figure 2. According to the same principles, the right eye can generate the final projection-R. This demonstrates the corresponding relationship between the external 3D object and the two slightly different 2D projections formed on the retinas of the left and right eyes.
2.2. Photoreceptor Cells
According to the biological consensus, each photoreceptor cell in the retinal layer is directly associated with a specific region on the retina. When this region is stimulated by incoming light signals, the corresponding photoreceptor cell generates a response. This region is commonly referred to as the “receptive field” [
38]. Photoreceptor cells are unique within the retinal layer in that they are the only cells capable of directly interacting with light signals. Their primary function is to transduce the light signals received from their corresponding receptive fields into electrical signals via photoelectric conversion and transmit these signals to other cells in the retinal layer. The propagation of electrical signals generated by light stimuli in these cells can be approximated by the following equation:
In this equation, represents the time-varying membrane potential, denotes the resting potential, and corresponds to the potential change (hyperpolarization) induced by light stimulation.
To emulate the function of photoreceptor cells within the framework of the SDDM through computational programming, the initial task is to establish a formal definition of the receptive field associated with each photoreceptor cell. As illustrated in
Figure 3, the receptive field corresponding to each photoreceptor in the SDDM is defined as a 1 × 1 pixel. This pixel represents the smallest unit within the retinal receptive field, commonly referred to as the “local receptive field”, and the position of each pixel can be represented by (x, y).
From a mathematical perspective, the process of photoelectric conversion can be conceptualized as detecting the grayscale value in the relevant region of the receptive field at a given time point and then transmitting this information to the subsequent layer of cells. This function can be expressed by the following equation:
In this formula, represents the grayscale value of the receptive field at time t, with horizontal coordinate x and vertical coordinate y.
2.3. Horizontal Cells
From a biological perspective, horizontal cells play a critical role in modulating the signal transmission between photoreceptors and bipolar cells. To replicate their function within a bio-inspired framework, a concise summary of their physiological characteristics, focusing on their location, structure, and functional roles, is presented as follows:
- (1)
Horizontal cells (HCs) are laterally located in the outer plexiform layer of the retina, where they form direct synaptic connections with photoreceptor cells through an extensive synaptic network. Their lateral distribution allows their synaptic processes to cover a broad area of the retina, enabling them to integrate information from multiple photoreceptor cells.
- (2)
The primary function of horizontal cells is lateral regulation, achieved by inhibiting the activity of neighboring cells to influence their output signals [
39].
- (3)
In the mechanism of object motion direction detection, horizontal cells contribute to enhancing contrast and spatial resolution, thereby strengthening edge detection capabilities [
40].
To reappear the aforementioned functions and characteristics within the SDDM framework, the horizontal cell structure has been developed as depicted in
Figure 4. From a mathematical perspective, the lateral modulation function of horizontal cells can be conceptualized as a process of calculating the absolute differences in grayscale value between neighboring regions. Specifically, two neighboring horizontal cells receive grayscale values from corresponding photoreceptors at two consecutive time points:
t (prior to object movement) and
(post-object movement). The absolute difference between these values is computed, and if the result exceeds a set threshold
L (set to 0 to reflect the sensitivity of the visual system), it indicates that the post-movement receptive field is not the location of the object’s movement, leading to an inhibitory effect on subsequent neuronal activity. Conversely, if the difference is below the threshold, it suggests that the post-movement receptive field may be the site of the object’s movement, and no inhibitory effect is applied. This functionality can be represented by the following equation:
In this equation, an output of 1 indicates that the horizontal cell has executed an inhibitory function, whereas an output of 0 indicates that the horizontal cell has remained inactive. However, the connections between horizontal cells and other cells are primarily inhibitory synaptic connections. Activation of these inhibitory synapses typically leads to the opening of chloride or potassium ion channels, resulting in hyperpolarization of the postsynaptic membrane potential, thereby reducing the likelihood of neuronal activation. Functionally, inhibitory synapse connections can be likened to NOT gates in logic circuits. Even if other cells receive excitatory input (logic 1) from the horizontal cells, the activation of an inhibitory synapse will prevent the generation of an action potential through hyperpolarization, resulting in an inhibitory effect (logical 0).
2.4. Bipolar Cells
To align with the design principles of bio-inspired models, the bipolar cell component in the SDDM framework should adhere to the following structural and functional characteristics:
- (1)
Bipolar cells (BCs) are classified into three types based on their responses to changes in light intensity: ON, OFF, and ON–OFF [
41]. In the SDDM model, ON–OFF bipolar cells are employed, characterized by their ability to respond to both increases and decreases in light intensity.
- (2)
ON–OFF bipolar cells establish direct synaptic connections with photoreceptor cells, enabling them to respond to changes in light intensity by receiving instantaneous light signals. Specifically, when light intensity fluctuates, the concentration of glutamate released by photoreceptors simultaneously changes. ON–OFF bipolar cells react to these fluctuations by either initiating or inhibiting depolarization, thereby transmitting distinct response signals accordingly [
42].
Based on these characteristics, it can be concluded that the primary function of ON–OFF bipolar cells is to respond to instantaneous changes in light intensity. However, in the context of motion detection function within the visual system, the concept of “instantaneous” is relative. Even if changes in light intensity occur instantaneously, the bipolar cells and the entire visual pathway are constrained by a temporal response window [
43]. This window, denoted as
, is defined as 13 milliseconds, which is currently regarded as the minimal time interval required to process a single frame of visual information [
44]. Thus, the ON–OFF bipolar cell component, as illustrated in
Figure 5, has been developed.
Mathematically, the mechanism of the instantaneous light intensity response of ON–OFF bipolar cells can be simulated by comparing the electrical signals from the same receptive field at two time points separated by
. Specifically, the ON–OFF bipolar cells in the SDDM receive electrical signals (grayscale values of the receptive field) from the corresponding photoreceptor cells at times
t and
, which reflects their synaptic connections with photoreceptors at the physiological level. The difference between the received values is then computed, representing the process by which the cells respond to changes in light intensity. The resulting difference is converted to its absolute value and compared with the set threshold
L. If the result exceeds the threshold
L, it indicates that a change in light intensity has occurred in the corresponding receptive field, thereby activating the bipolar cell and producing an output of 1. Conversely, if the result is less than or equal to the threshold
L, it signifies no significant change in light intensity, resulting in an output of 0. The whole process can be represented by the following equation:
In this equation, result 1 represents the activation of the bipolar cell and the generation of a response while result 0 indicates that the bipolar cell is not activated. Based on biological research, the connections between bipolar cells and other cells are excitatory synaptic connections. When these synapses are activated, sodium ion channels on the postsynaptic membrane open, allowing sodium ions to flow in, causing depolarization of the postsynaptic neuron’s membrane potential. This depolarization pushes the neuron toward the action potential threshold, ultimately leading to the transmission of a nerve impulse. In the SDDM, this means that when the excitatory signal (logic 1) from the bipolar cell is received by other cells, it will remain as an excitatory signal (logic 1) without being disrupted by the synaptic connection.
2.5. Ganglion Cells
The ganglion cells referred to here specifically denote the direction-selective ganglion cells (DSGCs) located in the retinal layer. These specialized ganglion cells are responsible for detecting the direction of object movement and exhibit the following biological characteristics:
- (1)
The primary characteristic of direction-selective ganglion cells is their sensitivity to motion in a specific direction. Each type of DSGC has a specific preferred direction and elicits an excitatory response to motion aligned with that direction. This direction selectivity is mediated by the spatial asymmetry of their receptive fields, along with the integration of inhibitory inputs within the neural circuitry [
45,
46].
- (2)
DSGCs primarily receive inputs from bipolar cells and horizontal cells. These inputs are integrated through a complex combination of excitatory and inhibitory signals to encode motion direction function. The bipolar cells primarily transmit changes in light intensity from photoreceptors, providing excitatory input to DSGCs, while the horizontal cells primarily deliver inhibitory input, contributing to the selective suppression of non-preferred directions [
47].
- (3)
Nonlinear synaptic interactions exist within the dendrites of retinal ganglion cells, particularly when an inhibitory synapse is positioned along the transmission path from an excitatory synapse to the soma. In such cases, the inhibitory synapse exerts a significant shunting inhibition on the excitatory input. These nonlinear interactions form the foundational mechanism underlying direction selectivity in the visual system [
48].
In SDDM, DSGCs are categorized into 26 types, each corresponding to 1 of the 26 fundamental directions of motion in a three-dimensional context (up, up-right, right, down-right, down, down-left, left, up-left; forward, forward-up, forward-up-right, forward-right, forward-down-right, forward-down, forward-down-left, forward-left, forward-up-left; backward, backward-up, backward-up-right, backward-right, backward-down-right, backward-down, backward-down-left, backward-left, backward-up-left). These DSGCs share a fundamentally similar structure, with variations in their preferred directional selectivity, and they exhibit active responses exclusively to motion in a specific direction. DSGCs do not directly acquire information from the retina; instead, they receive preliminarily processed visual inputs through intermediary cells such as bipolar and horizontal cells, which connect to their receptive fields [
49]. This structural configuration consequently leads to multiple DSGCs potentially responding to stimuli from the same visual space area, thus sharing the same receptive field. In this paper, a simplified logical model is employed to simulate this complex physiological structure, as shown in
Figure 6.
Using the retinal layer of the left eye as an example, the receptive field of each DSGC consists of one central pixel surrounded by 20 surrounding pixels, with activations corresponding to 26 fundamental motion directions, as illustrated in
Figure 7. The subscripts
in the figure represent the three depth layers in three-dimensional space, while the arrows indicate the eight fundamental directions of motion on each layer.
and
denote forward-only and backward-only motion, respectively. In this model, the occluded regions due to light imaging are defined as shared receptive fields, such as the third and fifth columns in the matrix. Consequently, light intensity variations in these regions lead to the simultaneous activation of multiple DSGCs.
As mentioned in the “Photoreception” section, the left and right eyes will obtain distinct projected images due to different viewing angles. This difference, known as parallax, is crucial for depth perception in three-dimensional space [
50]. Due to parallax, it is plausible to hypothesize that although the structures and functions of DSGCs in both eyes are nearly identical, certain differences may exist in order to effectively process three-dimensional visual information, accommodating inputs from two perspectives. In this model, these adjustments are represented by subtle differences in the receptive fields of DSGCs in each eye. As illustrated in
Figure 8, the structure of DSGCs in the right retina aligns closely with that of the left retina, although differences appear in the activation regions corresponding to specific motion directions.
In the SDDM, the nonlinear synaptic interactions present within the dendrites of retinal ganglion cells are emulated through using the conditional output logical operations in electronic circuits. Specifically, DSGCs have the potential to be activated only when excitatory synapses connected to bipolar cells are stimulated, while the inhibitory synapses associated with horizontal cells can exert a veto effect on the output of DSGCs through lateral inhibition. This leads to an output condition analogous to the NOT-AND logic observed in electronic circuits. The mechanism governing this interaction can be mathematically represented as follows:
For DSGCs, each ganglion cell responds exclusively to light stimuli within its receptive field. To replicate this biological function in the SDDM, it is assumed that, during the photoreception phase, a three-dimensional object is not projected onto the retina as a whole but rather in discrete units of
voxels. As shown in
Figure 9, the stride between each voxel unit is set to one minimal unit, producing an effect akin to the extraction of local motion information through three-dimensional convolution with a stride of 1.
In the SDDM, the retinal layer processes directional information of object movement as follows: the three-dimensional image is segmented locally into units of voxels, which are projected onto the retina as a two-dimensional projection, which corresponds to the local receptive fields of DSGCs. The light signals received by these receptive fields are converted into electrical signals by photoreceptors and transmitted to horizontal cells and bipolar cells. These signals are then conveyed to DSGCs via excitatory and inhibitory synapses, respectively, determining DSGC activation based on the resulting input.
This process can be represented by the following formulas:
In these formulas, and are 26-element arrays, where each value represents the activation count of DSGCs in the corresponding direction in either the left or right retina. is an indicator function that outputs 1 when it reaches position and the corresponding DSGCs are activated; otherwise, it outputs 0.
2.6. Primary Visual Cortex
Local motion information originating from the retinal layer is integrated by the lateral geniculate nucleus before being conveyed to layer
of the primary visual cortex. Within this cortical area, a diverse array of neuronal functions is present, collaboratively engaged in the processing of incoming information from superior layers. With this diverse array of neurons, those potentially involved in the detection motion direction in three-dimensional images are the simple neurons that are more sensitive to specific disparities and directions, along with complex neurons that can extract correct disparity information. These neurons are often located in the early stages of the visual pathway, capable of achieving correct binocular matching through disparity information, thus forming the rudiments of global perception in stereoscopic vision [
51,
52]. The primary function of direction-selective neurons is to produce a preferential response to motion in a specific direction by integrating inputs from various synapses with different weights. Here, this function is simplified into a statistical and computational approach: the direction corresponding to the most frequently activated DSGC signals will be inferred as the global motion direction.
However, as illustrated in
Figure 10, in the SDDM, the overlapping receptive fields of DSGCs in the retinal layer mean that light intensity variations in specific regions can simultaneously activate multiple DSGCs. The resulting electrical signals are then transmitted to the primary visual cortex; however, direction-selective neurons alone are insufficient to accurately discern certain motion directions in this situation. This limitation mirrors the biological phenomenon wherein monocular vision is unable to fully achieve stereoscopic perception.
In biological terms, various neurons exhibit distinct functional specializations. For effective motion direction detection in a stereoscopic environment, it is essential to utilize neurons that possess binocular disparity detection capabilities. Neurons primarily processing fundamental visual information, previously discussed as direction detection neurons, are categorized as simple cells. In contrast, neurons that manage disparity processing are classified as complex cells. Although the precise neural mechanisms underlying stereoscopic vision have not yet been fully elucidated, it is possible to deduce several essential attributes that complex cells must exhibit, particularly sensitivity and selectivity [
53]. Beyond the inherent qualities of complex cells, additional elements warrant closer examination. Studies utilizing reverse correlation with random-dot stereograms have indicated that individual disparity-detecting neurons alone are insufficient for the perception of stereoscopic vision. Consequently, an aggregation of outputs from multiple such neurons is required to effectively address the matching challenge [
54]. Therefore, in the design of the SDDM, particular emphasis should be placed on the interconnections among disparity-detecting neurons. As shown in
Figure 10, the physical distance between the two eyes results in slight differences in object projections onto the retinas, as well as variations in the receptive fields of the corresponding DSGCs. In this way, the concept of “binocular disparity” in the biological visual system is represented in the SDDM. In this context, complex cells (disparity-detecting neurons) can demonstrate an ability to process binocular disparity that goes beyond simple spatial location responses. This is achieved by integrating information from a wide area to enhance sensitivity and accuracy in detecting changes in depth, which corresponds to the ’sensitivity’ characteristic mentioned above. In the SDDM, this capability is replicated through the integration and summation of motion direction information from both eyes. For example, as depicted in
Figure 10, in the left retinal layer, receptive fields F8 (front-down) and B9 (back-down-right) overlap. Light stimuli in this area will simultaneously activate the DSGCs corresponding to both directions. However, due to the presence of binocular disparity, the DSGCs corresponding to these directions on the right retina do not share the same receptive field. If the actual direction of motion is back-down-right, then, after integration by the disparity-detecting neurons, the direction-selective neuron corresponding to F8 will be activated once, while the one corresponding to B9 will be activated twice, ultimately enabling successful detection of back-down-right motion. This mechanism can be represented by the following formula:
4. Conclusions and Discussion
In this paper, we draw inspiration from the key biological element of stereovision, known as “disparity,” along with the established physiological structures and corresponding functions within the retinal layer and primary visual cortex. Based on these insights, we propose a biologically inspired three-dimensional motion direction detection model, termed the Stereoscopic Direction Detection Mechanism (SDDM). This model aims to address the challenges of insufficient biological interpretability—the so-called “black box” problem—commonly associated with traditional deep learning models, as well as the inadequate explanation of binocular functions provided by existing biologically inspired models.
The SDDM leverages straightforward mathematical calculations and logical relationships to replicate the functionalities of cells involved in stereoscopic motion direction detection in both the retinal layer and primary visual cortex. This includes photoreceptor cells, horizontal cells, bipolar cells, and ganglion cells in the retina, as well as motion direction detection neurons and disparity detection neurons in the primary visual cortex. To encapsulate the concept of disparity, the model is structurally divided into left and right eye components. From the photoreception phase, the information received by the retinal layers of each eye exhibits slight variations due to the interocular distance. These disparities influence the subsequent processing of motion direction information until the disparity detection neurons integrate the information from both eyes to determine the final global motion direction.
To verify the reliability and robustness of this model, several datasets with objects of various shapes and sizes were designed to conduct the experiments. Additionally, we augmented these datasets with up to four distinct types of noise and conducted comparative analyses with advanced convolutional neural network models, EfficientNetB0 and ResNet34. The results indicate that the SDDM effectively addresses the “black box” problem inherent in the learning processes of contemporary deep learning models due to its robust biological interpretability, thereby enhancing model transparency. Its commendable performance across various scenarios also underscores its potential as a novel possibility for explicating binocular functionality in stereoscopic vision. Moreover, the model’s low parameter count and computational demand substantially reduce both time and hardware costs, facilitating its deployment in environments with limited computational resources.
Although the current functions of the SDDM may seem somewhat limited, the approach of building low-cost, high-performance models with strong biological interpretability— inspired by biologically heuristic principles—is not confined to any specific domain. Essentially, the human brain can also be understood as a composite system composed of functional modules, each primarily responsible for specific tasks. Further research into the SDDM has the potential to integrate various domain-specific biologically inspired models into a single cohesive system. Such method can enhance the interpretability of neural network algorithms from a biological perspective, thereby charting new directions and methodologies for the development of artificial intelligence.