A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision

Hua, Yuxiao; Tao, Sichen; Todo, Yuki; Chen, Tianqi; Qiu, Zhiyu; Tang, Zheng

doi:10.3390/sym17020162

Open AccessArticle

A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision

by

Yuxiao Hua

¹

,

Sichen Tao

^2,*

,

Yuki Todo

^1,*

,

Tianqi Chen

¹,

Zhiyu Qiu

¹ and

Zheng Tang

^3,4

¹

Faculty of Natural Science, University of Kanazawa, Kanazawa-shi 920-1192, Japan

²

Faculty of Engineering, University of Toyama, Toyama-shi 930-8555, Japan

³

Institute of AI for Industries, Chinese Academy of Sciences, 168 Tianquan Road, Nanjing 201008, China

⁴

School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(2), 162; https://doi.org/10.3390/sym17020162

Submission received: 19 December 2024 / Revised: 9 January 2025 / Accepted: 18 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Computer Vision, Pattern Recognition, Machine Learning, and Symmetry, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a biologically inspired model, the Stereoscopic Direction Detection Mechanism (SDDM), designed to detect motion direction in three-dimensional space. The model addresses two key challenges: the lack of biological interpretability in current deep learning models and the limited exploration of binocular functionality in existing biologically inspired models. Rooted in the fundamental concept of ’disparity’, the SDDM is structurally divided into components representing the left and right eyes. Each component mimics the layered architecture of the human visual system, from the retinal layer to the primary visual cortex. By replicating the functions of various cells involved in stereoscopic motion direction detection, the SDDM offers enhanced biological plausibility and interpretability. Extensive experiments were conducted to evaluate the model’s detection accuracy for various objects and its robustness against different types of noise. Additionally, to ascertain whether the SDDM matches the performance of established deep learning models in the field of three-dimensional motion direction detection, its performance was benchmarked against EfficientNet and ResNet under identical conditions. The results demonstrate that the SDDM not only exhibits strong performance and robust biological interpretability but also requires significantly lower hardware and time costs compared to advanced deep learning models.

Keywords:

biologically inspired; stereoscopic vision; disparity detection; binocular vision; motion direction detection

1. Introduction

As the primary mechanism for gathering external information and the sole means for observing motion behaviors [1], the visual system of mammalian vertebrates has been subjected to numerous evolutionary adaptations to address complex survival challenges [2]. One significant adaptation is the frontalization of the eyes, a key morphological evolution that involves the migration of the eyes from the lateral sides of the head to a more central position at the front. This evolutionary process has profoundly influenced the functionality and distribution of various types of neurons within the visual neural system, especially affecting the disparity-sensitive neurons that are crucial for depth perception. As a result of eye frontalization, the area of binocular field overlap has expanded, enabling a greater number of disparity-sensitive neurons to contribute to the processing of visual information, particularly in discerning object depth [3]. This enhanced capability allows organisms to interpret complex spatial information more effectively, culminating in the development of an advanced capability known as “stereoscopic vision” [4].

Stereoscopic vision can be broadly defined as the capacity to discern the three-dimensional structure of visual scenes by comparing retinal images from different lines of sight in each eye, focused on the same spatial point. For organisms inhabiting three-dimensional environments, it enhances their ability to perceive and interact with their surroundings by integrating with other functions within the visual system. These functions encompass not only the perception of an object’s texture, contours, and color [5], but also the detection of object motion [6]. Among these, the capability to discern the motion direction within three-dimensional spaces is fundamental to the survival strategies of organisms, and has been extensively investigated in the field of computer vision. In current related research, the mainstream approaches are mainly divided into two categories: the application of deep learning [7] and the imitation of physiological structures [8]. In the application of deep learning models, models based on convolutional neural network (CNN) architectures have dominated overwhelmingly. The earliest instance of CNNs was the LeNet architecture, introduced by Yann LeCun and colleagues in 1998, primarily designed for the recognition of handwritten digits [9]. However, over the subsequent decade, the advancement of CNNs was significantly hindered by the limitations in computational resources and the scarcity of large-scale datasets. It was not until the widespread adoption of GPU computing in the field of deep learning, the availability of large-scale datasets such as ImageNet, and the optimization of algorithms and training techniques that the potential of CNNs was fully realized. The introduction of AlexNet in 2012 marked a major breakthrough [10], significantly improving the accuracy of image recognition and ushering in a new era of rapid progress in deep learning. Leveraging the extensibility of convolutional operations [11], CNNs have incorporated the application of 3D convolutional layers [12]. This enables the models to capture spatiotemporal features in three-dimensional space through their weight-sharing [13] and feature-learning [14] mechanisms, facilitating their application to 3D image analysis.

Throughout the development of CNNs in the field of 3D image motion detection, numerous representative models have emerged. For instance, VoxNet, introduced in 2015, is a CNN architecture specifically designed for processing and recognizing 3D point cloud data. This model effectively extracts spatial features from both point cloud and voxel grid representations, enabling real-time classification tasks and underscoring the potential of CNNs in advancing 3D image recognition [15]. However, to address the computational cost and quantization errors inherent in the voxelization process, Charles R. Qi et al. introduced PointNet in 2017. By incorporating a symmetric function to handle unordered point sets, PointNet eliminates much of the overhead and accuracy loss associated with voxelization, demonstrating superior efficiency and robustness across various 3D image tasks, including object classification, part segmentation, and scene segmentation [16].

In practical applications, 3D CNNs have also made notable progress in several fields. Specifically, 3D ResNet, benefiting from the introduction of residual blocks and the increased depth of the network architecture, has demonstrated outstanding performance in the classification of brain magnetic resonance imaging (MRI) data, particularly in tasks involving the classification of Alzheimer’s disease (AD), mild cognitive impairment (MCI), and normal control (NC) groups [17]. Building upon this success, the enhanced 3D EfficientNet model, introduced in 2022, incorporates 3D mobile inverted bottleneck convolution (MBConv) layers alongside squeeze-and-excitation (SE) mechanisms to effectively capture multi-scale features from 3D MRI scans. These architectural improvements significantly enhance the model’s ability to perform the early detection and prediction of AD, demonstrating superior performance compared to previous approaches [18]. Furthermore, the newly proposed ResNet-101 model employs transfer learning techniques, leveraging pre-trained weights to improve detection performance in previously unseen scenarios. This approach enables efficient detection in complex environments and under varying speeds, further highlighting the robustness of CNNs in 3D motion detection tasks [19].

Despite the significant advancements in deep learning models for 3D motion detection, several challenges persist. These include high training costs, both in terms of data and time [20], difficulties in adapting to complex scenarios, and the lack of biological interpretability [21]. While CNNs were originally inspired by the human visual system, their growing complexity has made it increasingly difficult to interpret their internal decision-making processes. This “black box” nature is a common issue across many learning algorithms [22], prompting researchers to explore alternative approaches. Consequently, there is a growing interest in revisiting bio-inspired models to improve both interpretability and efficiency, with a renewed focus on the foundational principles that originally guided the development of artificial intelligence.

Bio-inspired models trace their origins to the early 1940s, when Warren S. McCulloch and Walter Pitts introduced the idea of constructing computational models based on the principles underlying biological neuron activity [23]. This groundbreaking work laid the theoretical foundation for the core concepts of bionics, marking a significant milestone in the development of biologically motivated computational frameworks. Neuromorphic computing represents an advanced extension of bio-inspired research, aiming to design computational systems characterized by low power consumption, parallel processing capabilities, and adaptive learning abilities by emulating the structure of biological neurons and synapses in the brain [24]. This approach is considered capable of addressing the interpretability challenges inherent in modern deep learning models. There are numerous practical applications in this field, such as the dendritic neuron model (DNM), introduced in 2014, which simulates the nonlinear interactions between dendritic neurons [25]. This model is primarily applied to simulate neurons in the retina and the primary visual cortex, where it has proven effective in capturing complex synaptic computations and enhancing our understanding of neural processing in visual systems [26]. Building upon this foundation, the dendristor model introduced in 2024 further advances neuromorphic computing by leveraging dendritic computations through multi-gate silicon nanowire transistors to perform complex dendritic processes [27]. The model exhibited exceptional performance in visual motion perception tasks, highlighting the potential of bio-inspired models as a promising alternative for enhancing both the transparency and efficiency of modern artificial intelligence systems.

The biological foundation of most current bio-inspired models for motion direction detection originates from motion-direction-selective neurons found in the retina [28,29]. This concept was initially introduced by Hubel and Wiesel in 1959, when they discovered that neurons in the striate cortex of cats exhibit directional selectivity, responding preferentially to motion stimuli in specific directions [30]. However, most of these models focus solely on monocular vision, neglecting the role of stereoscopic vision. This lack of depth perception inherently limits their applicability to two-dimensional images. To address this gap in the current research, this paper introduces a bio-inspired model designed to simulate the motion direction detection function in biological stereoscopic vision, referred to as the Stereoscopic Direction Detection Model (SDDM). The model’s overall architecture is divided into two components, corresponding to the left and right eyes in the formation of binocular disparity [31]. While both components share a similar structure, the horizontal positional difference results in each eye receiving slightly different visual information [32]. The individual monocular model consists of two distinct layers designed to replicate the structural and functional characteristics of the retina and the primary visual cortex. Specifically, we employed computational formulas to model the functions of specific cells and neurons involved in detecting motion direction within 3D images, emulating key components of the biological visual system. These include photoreceptor cells, bipolar cells, horizontal cells, and ganglion cells within the retinal layer, as well as complex cells and binocular-disparity-selective neurons located in the primary visual cortex [33]. Furthermore, we formulated assumptions and constructed simulations based on the biological characteristics of synaptic connections between these cells [34]. The detailed mechanisms underlying these processes will be fully explained in the “Mechanisms and Methods” section of this paper.

For a high-performing model, both interpretability and robust performance are essential requirements. A model must not only offer transparency in its underlying mechanisms but also deliver strong, reliable results across various tasks. To evaluate this, a series of comprehensive performance evaluations and comparative experiments was conducted, the results of which are detailed in the “Experiments and Analysis” section. In the performance evaluation, experiments were designed for the SDDM to detect motion direction in a 3D binary environment, involving objects of varying sizes, random shapes, and random positions. These experiments were designed to evaluate the model’s consistent performance across varying conditions, highlighting its robustness and adaptability across diverse scenarios. In the comparative experiments, EfficientNet and ResNet, which are currently regarded as the most effective CNN architectures for traditional 3D image tasks, were selected as baseline models for comparison. To rigorously assess their performance compared to the proposed model, the dataset used in these experiments was further augmented with four different types of noise, in addition to the standard performance evaluation criteria. The results demonstrate that the proposed SDDM not only offers significant advantages in transparency and interpretability but also exhibits superior robustness compared to EfficientNet and ResNet under variable conditions.

In summary, the SDDM presented in this paper is a bio-inspired model that mimics the physiological processes underlying mammalian stereoscopic vision, specifically designed to emulate the function of object motion direction detection in 3D images. Its primary innovation within the domain of traditional 3D image recognition is its biologically driven approach to modeling, which analyzes and replicates structures of the biological visual system. This approach circumvents the inherent “black box” limitations typically associated with deep learning models, thereby offering a transparent, interpretable mechanism grounded in biological principles. Within the realm of bio-inspired models, the SDDM addresses the gap in stereoscopic vision modeling and extends its application from 2D to 3D imagery, aligning more closely with the characteristics of the human visual system. Moreover, the potential of the SDDM illustrates that, beyond merely deepening the layers of a model, leveraging a bio-inspired research approach offers a promising avenue for developing models that not only exhibit strong performance but also align closely with biological plausibility.

2. Mechanism and Method

The Stereoscopic Direction Detection Model (SDDM) in this paper is a bio-inspired model that aims to simulate the function of stereoscopic vision, especially the motion direction detection mechanism in traditional 3D images. As depicted in Figure 1, the architecture of the SDDM can be divided into two distinct components, each designed to emulate the physiological structures of the retinal layer and the primary visual cortex, respectively.

The retinal layer describes how light signals from external 3D images are captured by the retina and converted into 2D electrical signals. It further elucidates how the signals are transformed into local motion signals through the interactions of photoreceptor cells, bipolar cells, horizontal cells, and ganglion cells, prior to transmission to the subsequent layer. The primary visual cortex proposes a mechanism based on the physiological characteristics of complex cells and disparity-detecting neurons, which may play a crucial role in detecting the motion direction of objects in 3D images. This mechanism outlines how these neurons receive local motion signals from the retinal layer and subsequently process them into global motion signals through analytical and statistical computations. In the next section of this chapter, we will provide a detailed explanation of the structure and operational logic of the SDDM from a bio-inspired perspective, including an in-depth discussion of the specific functions of each model component and the characteristics of their corresponding physiological structures.

2.1. Photoreception

In biological systems, optical imaging within the visual system can be described as the following process: light, reflected from external objects, enters the eye and undergoes refraction through various ocular structures, including the cornea and lens, before finally being projected as an inverted two-dimensional image onto the retina [35]. During this process, the photoreceptor cells in each eye capture two slightly different two-dimensional projections of the external scene, as each eye observes from a slightly different angle. The brain then integrates these two inputs to perceive depth and the spatial position of objects in three-dimensional space, thereby forming stereoscopic vision [36]. To emulate this mechanism, it is necessary to first establish a transformation model that describes the underlying optical processes. This model will demonstrate how an input 3D image is converted into two slightly differing 2D images, each projected onto the retinas of the left and right eyes.

To introduce this mechanism more specifically, consider a simple 3 × 3 × 3 transparent cube as an example. As illustrated in Figure 2, the inner layer is labeled B1-9, the middle layer as M1-9, and the outer layer as F1-9, represented in green, yellow, and red, respectively. For the left eye, when directly viewing this cube, the linear propagation of light reflected from the external object results in a projection of the left, outer, and partially inner sides of the cube onto the retina, forming a matrix. In this case, the outer parts of the object obscure the inner parts, making it difficult to perceive depth in certain regions. Consequently, the projection matrix is arranged as follows: B1, B4, B7 (first column); M1, M4, M7 (second column); F1, F4, F7 (third column, which occludes B2, B5, and B8, sharing the same column); M2, M5, M8 (fourth column); F2, F5, F8 (fifth column, which occludes B3, B6, and B9, also sharing the same column); M3, M6, M9 (sixth column); and F3, F6, F9 (seventh column). This creates the preliminary projection-L as illustrated in Figure 2.

However, due to the lensing effects of the cornea and lens, the projection is inverted according to optical principles [37], resulting in a completely flipped projection, as shown by the final projection-L in Figure 2. According to the same principles, the right eye can generate the final projection-R. This demonstrates the corresponding relationship between the external 3D object and the two slightly different 2D projections formed on the retinas of the left and right eyes.

2.2. Photoreceptor Cells

According to the biological consensus, each photoreceptor cell in the retinal layer is directly associated with a specific region on the retina. When this region is stimulated by incoming light signals, the corresponding photoreceptor cell generates a response. This region is commonly referred to as the “receptive field” [38]. Photoreceptor cells are unique within the retinal layer in that they are the only cells capable of directly interacting with light signals. Their primary function is to transduce the light signals received from their corresponding receptive fields into electrical signals via photoelectric conversion and transmit these signals to other cells in the retinal layer. The propagation of electrical signals generated by light stimuli in these cells can be approximated by the following equation:

V_{(m)} t = V_{rest} + Δ V,

(1)

In this equation,

V_{(m)} t

represents the time-varying membrane potential,

V_{rest}

denotes the resting potential, and

Δ V

corresponds to the potential change (hyperpolarization) induced by light stimulation.

To emulate the function of photoreceptor cells within the framework of the SDDM through computational programming, the initial task is to establish a formal definition of the receptive field associated with each photoreceptor cell. As illustrated in Figure 3, the receptive field corresponding to each photoreceptor in the SDDM is defined as a 1 × 1 pixel. This pixel represents the smallest unit within the retinal receptive field, commonly referred to as the “local receptive field”, and the position of each pixel can be represented by (x, y).

From a mathematical perspective, the process of photoelectric conversion can be conceptualized as detecting the grayscale value in the relevant region of the receptive field at a given time point and then transmitting this information to the subsequent layer of cells. This function can be expressed by the following equation:

G_{(x, y, t)} = 255 \times B_{(x, y, t)}

(2)

In this formula,

G_{(x, y, t)}

represents the grayscale value of the receptive field at time t, with horizontal coordinate x and vertical coordinate y.

2.3. Horizontal Cells

From a biological perspective, horizontal cells play a critical role in modulating the signal transmission between photoreceptors and bipolar cells. To replicate their function within a bio-inspired framework, a concise summary of their physiological characteristics, focusing on their location, structure, and functional roles, is presented as follows:

(1): Horizontal cells (HCs) are laterally located in the outer plexiform layer of the retina, where they form direct synaptic connections with photoreceptor cells through an extensive synaptic network. Their lateral distribution allows their synaptic processes to cover a broad area of the retina, enabling them to integrate information from multiple photoreceptor cells.
(2): The primary function of horizontal cells is lateral regulation, achieved by inhibiting the activity of neighboring cells to influence their output signals [39].
(3): In the mechanism of object motion direction detection, horizontal cells contribute to enhancing contrast and spatial resolution, thereby strengthening edge detection capabilities [40].

To reappear the aforementioned functions and characteristics within the SDDM framework, the horizontal cell structure has been developed as depicted in Figure 4. From a mathematical perspective, the lateral modulation function of horizontal cells can be conceptualized as a process of calculating the absolute differences in grayscale value between neighboring regions. Specifically, two neighboring horizontal cells receive grayscale values from corresponding photoreceptors at two consecutive time points: t (prior to object movement) and

t + Δ t

(post-object movement). The absolute difference between these values is computed, and if the result exceeds a set threshold L (set to 0 to reflect the sensitivity of the visual system), it indicates that the post-movement receptive field is not the location of the object’s movement, leading to an inhibitory effect on subsequent neuronal activity. Conversely, if the difference is below the threshold, it suggests that the post-movement receptive field may be the site of the object’s movement, and no inhibitory effect is applied. This functionality can be represented by the following equation:

HCs = \{\begin{matrix} 1, & |G_{(x, y, t)} - G_{(x + i, y + j, t + Δ t)}| > L, \\ 0, & |G_{(x, y, t)} - G_{(x + i, y + j, t + Δ t)}| \leq L . \end{matrix}

(3)

In this equation, an output of 1 indicates that the horizontal cell has executed an inhibitory function, whereas an output of 0 indicates that the horizontal cell has remained inactive. However, the connections between horizontal cells and other cells are primarily inhibitory synaptic connections. Activation of these inhibitory synapses typically leads to the opening of chloride or potassium ion channels, resulting in hyperpolarization of the postsynaptic membrane potential, thereby reducing the likelihood of neuronal activation. Functionally, inhibitory synapse connections can be likened to NOT gates in logic circuits. Even if other cells receive excitatory input (logic 1) from the horizontal cells, the activation of an inhibitory synapse will prevent the generation of an action potential through hyperpolarization, resulting in an inhibitory effect (logical 0).

2.4. Bipolar Cells

To align with the design principles of bio-inspired models, the bipolar cell component in the SDDM framework should adhere to the following structural and functional characteristics:

(1): Bipolar cells (BCs) are classified into three types based on their responses to changes in light intensity: ON, OFF, and ON–OFF [41]. In the SDDM model, ON–OFF bipolar cells are employed, characterized by their ability to respond to both increases and decreases in light intensity.
(2): ON–OFF bipolar cells establish direct synaptic connections with photoreceptor cells, enabling them to respond to changes in light intensity by receiving instantaneous light signals. Specifically, when light intensity fluctuates, the concentration of glutamate released by photoreceptors simultaneously changes. ON–OFF bipolar cells react to these fluctuations by either initiating or inhibiting depolarization, thereby transmitting distinct response signals accordingly [42].

Based on these characteristics, it can be concluded that the primary function of ON–OFF bipolar cells is to respond to instantaneous changes in light intensity. However, in the context of motion detection function within the visual system, the concept of “instantaneous” is relative. Even if changes in light intensity occur instantaneously, the bipolar cells and the entire visual pathway are constrained by a temporal response window [43]. This window, denoted as

Δ t

, is defined as 13 milliseconds, which is currently regarded as the minimal time interval required to process a single frame of visual information [44]. Thus, the ON–OFF bipolar cell component, as illustrated in Figure 5, has been developed.

Mathematically, the mechanism of the instantaneous light intensity response of ON–OFF bipolar cells can be simulated by comparing the electrical signals from the same receptive field at two time points separated by

Δ t

. Specifically, the ON–OFF bipolar cells in the SDDM receive electrical signals (grayscale values of the receptive field) from the corresponding photoreceptor cells at times t and

t + Δ t

, which reflects their synaptic connections with photoreceptors at the physiological level. The difference between the received values is then computed, representing the process by which the cells respond to changes in light intensity. The resulting difference is converted to its absolute value and compared with the set threshold L. If the result exceeds the threshold L, it indicates that a change in light intensity has occurred in the corresponding receptive field, thereby activating the bipolar cell and producing an output of 1. Conversely, if the result is less than or equal to the threshold L, it signifies no significant change in light intensity, resulting in an output of 0. The whole process can be represented by the following equation:

BCs = \{\begin{matrix} 1, & |G_{(x, y, t)} - G_{(x + i, y + j, t + Δ t)}| > L, \\ 0, & |G_{(x, y, t)} - G_{(x + i, y + j, t + Δ t)}| \leq L . \end{matrix}

(4)

In this equation, result 1 represents the activation of the bipolar cell and the generation of a response while result 0 indicates that the bipolar cell is not activated. Based on biological research, the connections between bipolar cells and other cells are excitatory synaptic connections. When these synapses are activated, sodium ion channels on the postsynaptic membrane open, allowing sodium ions to flow in, causing depolarization of the postsynaptic neuron’s membrane potential. This depolarization pushes the neuron toward the action potential threshold, ultimately leading to the transmission of a nerve impulse. In the SDDM, this means that when the excitatory signal (logic 1) from the bipolar cell is received by other cells, it will remain as an excitatory signal (logic 1) without being disrupted by the synaptic connection.

2.5. Ganglion Cells

The ganglion cells referred to here specifically denote the direction-selective ganglion cells (DSGCs) located in the retinal layer. These specialized ganglion cells are responsible for detecting the direction of object movement and exhibit the following biological characteristics:

(1): The primary characteristic of direction-selective ganglion cells is their sensitivity to motion in a specific direction. Each type of DSGC has a specific preferred direction and elicits an excitatory response to motion aligned with that direction. This direction selectivity is mediated by the spatial asymmetry of their receptive fields, along with the integration of inhibitory inputs within the neural circuitry [45,46].
(2): DSGCs primarily receive inputs from bipolar cells and horizontal cells. These inputs are integrated through a complex combination of excitatory and inhibitory signals to encode motion direction function. The bipolar cells primarily transmit changes in light intensity from photoreceptors, providing excitatory input to DSGCs, while the horizontal cells primarily deliver inhibitory input, contributing to the selective suppression of non-preferred directions [47].
(3): Nonlinear synaptic interactions exist within the dendrites of retinal ganglion cells, particularly when an inhibitory synapse is positioned along the transmission path from an excitatory synapse to the soma. In such cases, the inhibitory synapse exerts a significant shunting inhibition on the excitatory input. These nonlinear interactions form the foundational mechanism underlying direction selectivity in the visual system [48].

In SDDM, DSGCs are categorized into 26 types, each corresponding to 1 of the 26 fundamental directions of motion in a three-dimensional context (up, up-right, right, down-right, down, down-left, left, up-left; forward, forward-up, forward-up-right, forward-right, forward-down-right, forward-down, forward-down-left, forward-left, forward-up-left; backward, backward-up, backward-up-right, backward-right, backward-down-right, backward-down, backward-down-left, backward-left, backward-up-left). These DSGCs share a fundamentally similar structure, with variations in their preferred directional selectivity, and they exhibit active responses exclusively to motion in a specific direction. DSGCs do not directly acquire information from the retina; instead, they receive preliminarily processed visual inputs through intermediary cells such as bipolar and horizontal cells, which connect to their receptive fields [49]. This structural configuration consequently leads to multiple DSGCs potentially responding to stimuli from the same visual space area, thus sharing the same receptive field. In this paper, a simplified logical model is employed to simulate this complex physiological structure, as shown in Figure 6.

Using the retinal layer of the left eye as an example, the receptive field of each DSGC consists of one central pixel surrounded by 20 surrounding pixels, with activations corresponding to 26 fundamental motion directions, as illustrated in Figure 7. The subscripts

f, m, b

in the figure represent the three depth layers in three-dimensional space, while the arrows indicate the eight fundamental directions of motion on each layer.

F_{f}

and

B_{b}

denote forward-only and backward-only motion, respectively. In this model, the occluded regions due to light imaging are defined as shared receptive fields, such as the third and fifth columns in the matrix. Consequently, light intensity variations in these regions lead to the simultaneous activation of multiple DSGCs.

As mentioned in the “Photoreception” section, the left and right eyes will obtain distinct projected images due to different viewing angles. This difference, known as parallax, is crucial for depth perception in three-dimensional space [50]. Due to parallax, it is plausible to hypothesize that although the structures and functions of DSGCs in both eyes are nearly identical, certain differences may exist in order to effectively process three-dimensional visual information, accommodating inputs from two perspectives. In this model, these adjustments are represented by subtle differences in the receptive fields of DSGCs in each eye. As illustrated in Figure 8, the structure of DSGCs in the right retina aligns closely with that of the left retina, although differences appear in the activation regions corresponding to specific motion directions.

In the SDDM, the nonlinear synaptic interactions present within the dendrites of retinal ganglion cells are emulated through using the conditional output logical operations in electronic circuits. Specifically, DSGCs have the potential to be activated only when excitatory synapses connected to bipolar cells are stimulated, while the inhibitory synapses associated with horizontal cells can exert a veto effect on the output of DSGCs through lateral inhibition. This leads to an output condition analogous to the NOT-AND logic observed in electronic circuits. The mechanism governing this interaction can be mathematically represented as follows:

DSGCs = BCs \cdot \bar{HCs}

(5)

For DSGCs, each ganglion cell responds exclusively to light stimuli within its receptive field. To replicate this biological function in the SDDM, it is assumed that, during the photoreception phase, a three-dimensional object is not projected onto the retina as a whole but rather in discrete units of

3 \times 3 \times 3

voxels. As shown in Figure 9, the stride between each voxel unit is set to one minimal unit, producing an effect akin to the extraction of local motion information through three-dimensional convolution with a stride of 1.

In the SDDM, the retinal layer processes directional information of object movement as follows: the three-dimensional image is segmented locally into units of

3 \times 3 \times 3

voxels, which are projected onto the retina as a two-dimensional projection, which corresponds to the local receptive fields of DSGCs. The light signals received by these receptive fields are converted into electrical signals by photoreceptors and transmitted to horizontal cells and bipolar cells. These signals are then conveyed to DSGCs via excitatory and inhibitory synapses, respectively, determining DSGC activation based on the resulting input.

This process can be represented by the following formulas:

D S G C s = f (x, y, z, i)

(6)

L_{i} = \sum_{x = 1}^{M} \sum_{y = 1}^{M} \sum_{z = 1}^{M} f_{L} (x, y, z, i), for i = 1, 2, \dots, 26

(7)

R_{i} = \sum_{x = 1}^{M} \sum_{y = 1}^{M} \sum_{z = 1}^{M} f_{R} (x, y, z, i), for i = 1, 2, \dots, 26

(8)

In these formulas,

L_{i}

and

R_{i}

are 26-element arrays, where each value represents the activation count of DSGCs in the corresponding direction in either the left or right retina.

f (x, y, z, i)

is an indicator function that outputs 1 when it reaches position

(x, y, z)

and the corresponding DSGCs are activated; otherwise, it outputs 0.

2.6. Primary Visual Cortex

Local motion information originating from the retinal layer is integrated by the lateral geniculate nucleus before being conveyed to layer

L V C_{α}

of the primary visual cortex. Within this cortical area, a diverse array of neuronal functions is present, collaboratively engaged in the processing of incoming information from superior layers. With this diverse array of neurons, those potentially involved in the detection motion direction in three-dimensional images are the simple neurons that are more sensitive to specific disparities and directions, along with complex neurons that can extract correct disparity information. These neurons are often located in the early stages of the visual pathway, capable of achieving correct binocular matching through disparity information, thus forming the rudiments of global perception in stereoscopic vision [51,52]. The primary function of direction-selective neurons is to produce a preferential response to motion in a specific direction by integrating inputs from various synapses with different weights. Here, this function is simplified into a statistical and computational approach: the direction corresponding to the most frequently activated DSGC signals will be inferred as the global motion direction.

However, as illustrated in Figure 10, in the SDDM, the overlapping receptive fields of DSGCs in the retinal layer mean that light intensity variations in specific regions can simultaneously activate multiple DSGCs. The resulting electrical signals are then transmitted to the primary visual cortex; however, direction-selective neurons alone are insufficient to accurately discern certain motion directions in this situation. This limitation mirrors the biological phenomenon wherein monocular vision is unable to fully achieve stereoscopic perception.

In biological terms, various neurons exhibit distinct functional specializations. For effective motion direction detection in a stereoscopic environment, it is essential to utilize neurons that possess binocular disparity detection capabilities. Neurons primarily processing fundamental visual information, previously discussed as direction detection neurons, are categorized as simple cells. In contrast, neurons that manage disparity processing are classified as complex cells. Although the precise neural mechanisms underlying stereoscopic vision have not yet been fully elucidated, it is possible to deduce several essential attributes that complex cells must exhibit, particularly sensitivity and selectivity [53]. Beyond the inherent qualities of complex cells, additional elements warrant closer examination. Studies utilizing reverse correlation with random-dot stereograms have indicated that individual disparity-detecting neurons alone are insufficient for the perception of stereoscopic vision. Consequently, an aggregation of outputs from multiple such neurons is required to effectively address the matching challenge [54]. Therefore, in the design of the SDDM, particular emphasis should be placed on the interconnections among disparity-detecting neurons. As shown in Figure 10, the physical distance between the two eyes results in slight differences in object projections onto the retinas, as well as variations in the receptive fields of the corresponding DSGCs. In this way, the concept of “binocular disparity” in the biological visual system is represented in the SDDM. In this context, complex cells (disparity-detecting neurons) can demonstrate an ability to process binocular disparity that goes beyond simple spatial location responses. This is achieved by integrating information from a wide area to enhance sensitivity and accuracy in detecting changes in depth, which corresponds to the ’sensitivity’ characteristic mentioned above. In the SDDM, this capability is replicated through the integration and summation of motion direction information from both eyes. For example, as depicted in Figure 10, in the left retinal layer, receptive fields F8 (front-down) and B9 (back-down-right) overlap. Light stimuli in this area will simultaneously activate the DSGCs corresponding to both directions. However, due to the presence of binocular disparity, the DSGCs corresponding to these directions on the right retina do not share the same receptive field. If the actual direction of motion is back-down-right, then, after integration by the disparity-detecting neurons, the direction-selective neuron corresponding to F8 will be activated once, while the one corresponding to B9 will be activated twice, ultimately enabling successful detection of back-down-right motion. This mechanism can be represented by the following formula:

Global = max_{i = 1, 2, \dots, 26} (L_{i} + R_{i})

(9)

3. Experiments and Analysis

A series of rigorously designed experiments was conducted to validate the accuracy and reliability of the SDDM, including evaluations of accuracy, generalization tests, and comparative analyses with widely recognized deep learning models. This section provides a comprehensive overview of the datasets used in these experiments, along with an in-depth explanation of the core methodologies and conceptual framework.

3.1. Dataset

The datasets used in the experiments are relatively simple simulated data. This decision was primarily driven by the nascent stage of replicating biological stereoscopic vision through biologically inspired models. Utilizing simple datasets facilitates a clearer comprehension of the model’s behavior and decision-making pathways during the simulation of biological visual processes, thereby verifying the strong interpretability of the model. Thus, the foundational dataset utilized during the experimental phase is designated as the binary dataset (Dataset B). This dataset comprises 52,000 three-dimensional background images, each comprising 4096 (16 × 16 × 16) uniformly colored voxels with an assigned grayscale value of 255. These background images are grouped into 26,000 pairs. Each pair includes two background frames chronologically labeled as time points t and

t + Δ t

, representing pre-motion and post-motion frames, respectively.

In the background image labeled as time t, a target object is generated at a random location. Each target object consists of multiple contiguous voxels arranged randomly, producing a shape that is fully continuous internally but completely random in form. These objects are categorized into 10 classes based on size, ranging from 1 to 512 voxels, and all objects are assigned a grayscale value of 0. The combination of the three-dimensional background image and the target object constitutes the complete pre-motion frame. Subsequently, each object moves randomly in 1 of the 26 fundamental directions by a span of 1 voxel. The new position after movement is then projected onto the background images labeled as t2, forming the post-motion frames. This method generated a total of 26,000 pairs of such sequential frames as illustrated in Figure 11, thereby constituting Dataset B. In addition, the complexity of the dataset will be continuously increased in subsequent experiments through methods such as adding noise. This will ultimately create a dataset of such sufficient complexity that it becomes difficult to discern with the naked eye, thereby serving as a validation of the model’s robustness.

3.2. Accuracy Evaluation

The initial experiments conducted are model accuracy evaluation tests. This series of experiments was designed to assess the SDDM’s accuracy in detecting the motion direction of objects with varying shapes and sizes (ranging from 1 to 512 voxels) within Dataset B. Table 1 presents the SDDM’s detection results on the 26,000 pairs of consecutive frames in Dataset B. The results indicate that the SDDM achieves a 100% accuracy rate in detecting the motion direction of randomly shaped objects within a uniform-colored 3D background. This outcome demonstrates the SDDM’s ability to perform high-precision detection of motion direction for objects of arbitrary shape, size, and position in three-dimensional space. At the same time, it further suggests that employing a biologically inspired approach to model and replicate the motion direction detection function within stereoscopic vision is both a feasible and effective strategy.

3.3. Generalization Experiment

The human visual system demonstrates exceptional generalization, allowing it to function effectively under various extreme conditions. Consequently, generalization is a critical criterion for evaluating the biological plausibility of a model, as it reflects the model’s adaptability to diverse data and environments [55]. Based on this criterion, we assess the generalization performance of the SDDM by introducing four types of artificial noise into the dataset: static background noise, dynamic background noise, static global noise, and dynamic global noise. The following sections provide a detailed description of the methods used to introduce each type of noise, along with a quantitative analysis of the impact of noise type and intensity on the SDDM’s performance.

3.3.1. Static Background Noise

Static background noise, characterized by its temporal constancy and selective transparency, remains unchanged as objects move and does not overlap with objects in their initial state. The static background noise is introduced into experiments for following several critical reasons: firstly, it provides a baseline environment for generalization experiments, enabling the assessment of the model’s fundamental performance. Secondly, it serves as an essential control in studies involving dynamic noise, facilitating the determination of whether the impacts on results are predominantly due to the noise texture itself or its dynamic changes, thus enhancing the reliability of the findings. Thirdly, as static background noise frequently occurs in real-world scenarios, incorporating static background noise validates the model’s adaptability to prevalent operational environments. To add this type of noise, we randomly select a corresponding number of non-overlapping voxels in the pre-motion frame labeled as t in each pair of consecutive motion frames, assigning them a grayscale value of 0. The same voxels are then projected onto the post-motion frame labeled as

t + Δ t

, creating the static background noise dataset. As illustrated in Figure 12, the direction-selective neuron with a preferred direction of front-lower-rightwards (FLR) exhibits the highest activation count, corresponding to the actual motion direction of the target object. This result confirms that the SDDM can operate effectively even with the addition of static background noise in the dataset.

The experiments were conducted with three levels of noise percent: 1%, 5%, and 10%. The results shown in Table 2 indicate that adding static background noise has a certain impact on the SDDM’s detection accuracy. However, this impact mainly arises from the extreme scenario where the proportion of noise voxels greatly exceeds that of the target object, a situation in which even the human visual system struggles to make accurate judgments. As the size of the target object increases, the detection accuracy gradually returns to a high level of 100%.

3.3.2. Dynamic Background Noise

Dynamic background noise, characterized by its non-deterministic variability and selective transparency, changes in a completely random manner as objects move, without overlapping with the objects before or after their movement. This type of noise is integrated into experiments for several critical reasons: firstly, its inherent non-deterministic nature may introduce interference similar to the movement direction of the target object, thereby imposing higher demands on the model’s robustness. Secondly, dynamic background noise complicates the detection of target boundary features and motion trajectories, testing the model’s capacity to differentiate between objects and noise and to maintain focus on the objects being tracked. Moreover, dynamic background noise closely mirrors real-world scenarios. Incorporating this type of noise allows for the simulation of complex background variations in actual environments, further assessing the model’s applicability in practical settings. In order to accurately incorporate dynamic background noise into the dataset, it is crucial to adhere to two key characteristics: the ’completely random distribution of the noise’ and the ’non-overlapping of the noise with the object both before and after movement’. Therefore, the method for adding this type of noise to the dataset entails randomly selecting a corresponding number of non-overlapping voxels in the consecutive motion frames labeled t and

t + Δ t

and assigning them the grayscale value of 0. There is a dataset that contains a target object of 64 voxels after the addition of dynamic background noise, as illustrated in Figure 13. Under such conditions, the correct motion direction becomes challenging to discern visually. Nonetheless, the SDDM accurately identifies the correct motion direction, middle-lower-rightwards (MLR). This result demonstrates that, for relatively larger objects, the SDDM remains effective even in the presence of dynamic background noise, which theoretically poses a greater challenge.

The experimental results for the dataset with dynamic background noise are shown in Table 3. The results demonstrate that dynamic background noise has a significantly greater impact on the detection accuracy of the SDDM compared to static background noise. When only 1% noise is added, the motion direction detection accuracy for ultra-small objects measuring one voxel drops to 22.62%. However, for objects consisting of eight voxels, the accuracy recovers to 99.46%. As the noise level increases, the SDDM achieves detection accuracy above 99% only when the objects measure 32 voxels or more. Considering that the color and amount of dynamic noise are consistent with static noise, we can infer that the factor affecting detection accuracy is not the noise itself but the process of its movement. In Dataset B, which includes dynamic background noise, simultaneous changes in the positions of noise voxels and target objects present a significant challenge to the model’s ability to focus on target objects, thereby reducing detection accuracy. This phenomenon is commonly observed in biological systems, where tracking motion directions of objects under complex conditions also poses significant challenges.When the ratio of target objects to noise increases to 1:5 or higher, the model demonstrates improved differentiation between objects and noise, leading to a recovery of detection accuracy to high levels. While this phenomenon highlights certain limitations in the model’s anti-interference capabilities, it simultaneously underscores the alignment of its overall performance with the characteristics of human stereoscopic vision. This alignment enhances the model’s biological interpretability, a crucial advantage in biologically inspired designs.

3.3.3. Static Global Noise

The effect of static global noise is illustrated in Figure 14. Its addition method closely resembles that of static background noise, with the key difference being that noise voxels in the pre-motion frame are selected completely at random, allowing for potential overlap with the target object.

The experimental results for the dataset with static global noise are presented in Table 4. The characteristics of static global noise closely resemble those of static background noise, and the potential overlap between the object and noise voxels only has minimal impact on the performance of the SDDM. As a result, the detection accuracy exhibits a trend consistent with that observed for static background noise.

3.3.4. Dynamic Global Noise

Figure 15 illustrates an example of the addition of dynamic global noise. The addition logic for this type of noise is similar to that of dynamic background noise, with the key distinction being that noise voxels in both consecutive motion frames are selected entirely at random. Consequently, overlap between noise and the target object may occur in both the t and

t + Δ t

frames.

Table 5 summarizes the experimental results for the dataset with dynamic global noise. Consistent with previous conclusions, the overlap between object voxels and noise voxels has a minimal effect on detection performance. The observed trends closely align with those seen in the experiments involving dynamic background noise.

3.4. Comparative Experiment

To further assess the reliability and generalization performance of the SDDM, two additional types of comparative experiments were conducted, building upon the foundational experiments. For these experiments, EfficientNet and ResNet, two widely recognized deep learning models [56,57], were selected as benchmark frameworks for comparison, based on the following considerations. Firstly, these two models hold significant value as benchmarks for comparison. With the incorporation of 3D convolution, CNNs have established themselves as top-performing architectures in achieving high accuracy for 3D object recognition. Comparing the SDDM to advanced CNN models such as EfficientNet and ResNet under identical conditions provides a robust foundation for evaluating its performance [58]. In addition, the development of CNNs was originally inspired by bio-inspired principles, whereas EfficientNet and ResNet represent advancements achieved after diverging from this approach. If the SDDM achieves performance comparable to these models while offering the added advantage of high biological interpretability, it would underscore the potential of bio-inspired models to compete with deep learning models in specific domains.

In the comparative experiments conducted on these two convolutional neural network models, the training set consisted of 2600 three-dimensional images with 0.1% corresponding noise added (100 images for each motion direction). Specifically, when experiments were conducted on test sets with added static background noise, the corresponding training sets were similarly augmented with static background noise, and so forth. This design aimed to leverage the full learning potential of convolutional neural networks to ensure the fairness and comparability of the experimental conditions. The training process employed the Adam optimizer, extending over 100 epochs with a learning rate of 0.001. To minimize the impact of fluctuations during training, the optimal values recorded across 10 iterations were used to determine the final results.

3.4.1. EfficientNetB0

In the selection of models for comparative analysis, EfficientNetB0, the foundational variant of EfficientNet introduced in 2019, was chosen as the benchmark [59]. The rationale for our selection highlights its capacity to deliver performance comparable to larger models while operating at lower computational costs. This choice ensures a more fair comparison by controlling for critical variables such as model size and computational expense, which are crucial for assessing the efficiency and effectiveness of the models under consideration. Moreover, recent experimental applications of EfficientNetB0 in processing three-dimensional images provide a theoretical foundation for its inclusion in the comparative experiments [60]. The architectural structure of the EfficientNetB0 model employed in the comparative experiment is depicted in Figure 16. The model comprises a total of 36 convolutional layers, including the input layer, 7 blocks, an adaptive pooling layer, and a fully connected layer. Each block in the backbone network is composed of specific layers with the following functionalities, and the specific parameters are also detailed in Figure 16:

(1): Expansion convolution: expands the input features to higher dimensions, enhancing the network’s representation capacity.
(2): Depthwise convolution: reduces computational cost while preserving the spatial information of the feature map.
(3): Squeeze-and-excitation (SE) module: dynamically adjusts the importance of each channel, simulating an attention mechanism.
(4): Compression convolution: compresses the expanded features back to the specified number of channels, reducing computational cost while extracting important features.
(5): Skip connection: improves the network’s convergence speed.

Furthermore, the model makes extensive use of 3D convolutions. Compared to traditional 2D convolutions, 3D convolutions not only process features on a 2D plane (height and width) but also handle the depth dimension, enabling the capture of local features in 3D data. This capability allows the model to directly process 3D data without requiring additional preprocessing, thereby enhancing its suitability for 3D data analysis tasks.

The experimental results are detailed in Table A1, Table A2, Table A3 and Table A4 in Appendix A. These results show that although EfficientNetB0 performs well under conditions of no or minimal noise, an increase in noise levels significantly reduces its detection accuracy. For example, in scenarios with static background noise, increasing the noise content from 1% to 5% resulted in a significant decrease in accuracy, dropping by 76.39% for detecting objects of one voxel. Additionally, while the problem of ‘confusing objects with noise’ tends to lessen with increasing object size, the recovery rate of the model’s accuracy remains unsatisfactory. For instance, with 10% dynamic background noise, the highest recovery in accuracy reached only 38.46%.

3.4.2. ResNet34

In the comparative analysis, ResNet34 was selected as the subject from the ResNet series. Although introduced earlier with fewer layers, this model is distinguished by its ability to perform effectively under constrained computational resources [61,62], which meets the experimental needs to control key variables such as model size and computational cost. Furthermore, ResNet34’s extensive deployment in processing three-dimensional data, particularly in action recognition and video processing, provides a solid justification for its selection in our comparative experiments [63]. The architecture of the ResNet34 model utilized in this comparative experiment is illustrated in Figure 17. The model comprises an input layer, four residual modules, an adaptive pooling layer, and a fully connected layer. The specific functionalities and parameters of each layer are described in detail below.

(1): Light blue section: input layer

$\begin{matrix} X_{input} = Conv 3 D (8, 64, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (64) \to ReLU \end{matrix}$
(2): Blue section: initial feature extraction layer

$\begin{matrix} X_{conv 1} = Conv 3 D (64, 64, k = 3, stride = 1, padding = 1) \\ \to BatchNorm 3 D (64) \to ReLU \end{matrix}$
(3): Red section: residual modules
Layer1:

$\begin{matrix} X_{block 1} = & Conv 3 D (64, 64, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (64) \to \\ ReLU & \to Conv 3 D (64, 64, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (64) \end{matrix}$

Layer2:

$\begin{matrix} X_{block 2_1} = & Conv 3 D (64, 128, k = 3, stride = 2, padding = 1) \to BatchNorm 3 D (128) \to \\ ReLU & \to Conv 3 D (128, 128, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (128) \end{matrix}$

$\begin{matrix} X_{block 2_n} = & Conv 3 D (128, 128, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (128) \to \\ ReLU & \to Conv 3 D (128, 128, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (128) \end{matrix}$

Layer 3:

$\begin{matrix} X_{block 3_1} = & Conv 3 D (128, 256, k = 3, stride = 2, padding = 1) \to BatchNorm 3 D (256) \to \\ ReLU & \to Conv 3 D (256, 256, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (256) \end{matrix}$

$\begin{matrix} X_{block 3_n} = & Conv 3 D (256, 256, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (256) \to \\ ReLU & \to Conv 3 D (256, 256, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (256) \end{matrix}$

Layer 4:

$\begin{matrix} X_{block 4_1} = & Conv 3 D (256, 512, k = 3, stride = 2, padding = 1) \to BatchNorm 3 D (512) \to \\ ReLU & \to Conv 3 D (512, 512, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (512) \end{matrix}$

$\begin{matrix} X_{block 4_n} = & Conv 3 D (512, 512, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (512) \to \\ ReLU & \to Conv 3 D (512, 512, k = 3, stride = 1, padding = 1) \to BatchNorm 3 D (512) \end{matrix}$
(4): Green section: adaptive pooling layer

$\begin{matrix} X_{pool} = & AdaptiveAvgPool 3 D (1) \end{matrix}$
(5): Blue section: fully connected layer

$\begin{matrix} X_{fc} = & Linear (512, 26) \end{matrix}$

The experimental results are detailed in Table A5, Table A6 and Table A7 in Appendix A. The experimental results reveal distinct differences in performance between ResNet34 and EfficientNetB0. ResNet34 demonstrates better noise robustness when detecting small objects; however, it does not achieve the desired accuracy when handling larger objects. For instance, under the condition of 5% static background noise, ResNet34 achieves an accuracy of 42.27% (compared to 14.19% of EfficientNetB0) in detecting target objects of one voxel in size. In contrast, when the target object size increases to 128 voxels, its accuracy is only 79.54% (compared to 95.16% of EfficientNetB0). This observation highlights the model’s strength in detecting small objects but also points to its limitations in maintaining accuracy for larger objects, potentially due to differences in feature extraction mechanisms or architectural design.

3.4.3. Comparative Results Analysis

The results of the comparative experiments on the dataset with added static background noise are presented in Figure 18. From the bar chart, we can easily find that the SDDM achieves slightly lower accuracy than EfficientNetB0 and ResNet34 only when detecting objects with sizes 1 and 2 under conditions of 1% noise. In all other tested scenarios, the SDDM comprehensively outperforms both models. Notably, as the noise percentage increases, the SDDM exhibits remarkable robustness against noise, underscoring its superior performance in environments with high noise levels.

The results of the comparative experiments on the dataset with added dynamic background noise are shown in Figure 19. With the introduction of merely 1% corresponding noise, it becomes apparent that the SDDM significantly outperforms the other two deep learning models in detecting the motion direction of small- to medium-sized objects. As the noise level increases to 5% and 10%, the superiority of the SDDM extends even to the detection of larger objects, while, in scenarios with dynamic background noise, EfficientNetB0 exhibits slightly inferior performance compared to ResNet34, indicating its relatively weak noise resistance despite performing well under ideal conditions. Although ResNet34 demonstrates a relatively stronger noise resistance, it remains less effective than the SDDM, and its accuracy under minimal noise conditions is also suboptimal.

The comparative experimental results for datasets with added static global noise and dynamic global noise are shown in Figure 20 and Figure 21. As the characteristics of global noise closely resemble those of background noise, the overall results and variation trends align closely with those observed under the influence of background noise.

Based on the results of the comparative analysis, the following conclusions can be derived. Even with the resource-intensive approach of training on datasets augmented with corresponding noise, the deep learning models EfficientNetB0 and ResNet34 outperform the SDDM only under noise-free conditions or with 1% static noise. In contrast, the SDDM exhibits superior noise robustness and enhanced capability in handling complex scenarios, surpassing the performance of EfficientNetB0 and ResNet34.

Moreover, the SDDM provides a level of interpretability and biological plausibility that is lacking in conventional deep learning algorithms. This advantage arises from its bio-inspired design, which emulates the intrinsic visual neural pathways of the retina and primary visual cortex. Notably, the SDDM achieves motion detection in three-dimensional space without relying on a learning process, instead directly emulating the neural computational mechanisms observed in the visual system. While substantial architectural optimizations were employed to minimize computational loads and memory demands, current lightweight models still predominantly rely on deep learning frameworks and typically require millions of parameters and corresponding FLOPS for execution [64,65]. In contrast, the SDDM, grounded in biologically inspired principles, achieves substantial reductions in both parameters and required FLOPS by bypassing the conventional learning process, which not only diminishes computational demands but also preserves accuracy.

4. Conclusions and Discussion

In this paper, we draw inspiration from the key biological element of stereovision, known as “disparity,” along with the established physiological structures and corresponding functions within the retinal layer and primary visual cortex. Based on these insights, we propose a biologically inspired three-dimensional motion direction detection model, termed the Stereoscopic Direction Detection Mechanism (SDDM). This model aims to address the challenges of insufficient biological interpretability—the so-called “black box” problem—commonly associated with traditional deep learning models, as well as the inadequate explanation of binocular functions provided by existing biologically inspired models.

The SDDM leverages straightforward mathematical calculations and logical relationships to replicate the functionalities of cells involved in stereoscopic motion direction detection in both the retinal layer and primary visual cortex. This includes photoreceptor cells, horizontal cells, bipolar cells, and ganglion cells in the retina, as well as motion direction detection neurons and disparity detection neurons in the primary visual cortex. To encapsulate the concept of disparity, the model is structurally divided into left and right eye components. From the photoreception phase, the information received by the retinal layers of each eye exhibits slight variations due to the interocular distance. These disparities influence the subsequent processing of motion direction information until the disparity detection neurons integrate the information from both eyes to determine the final global motion direction.

To verify the reliability and robustness of this model, several datasets with objects of various shapes and sizes were designed to conduct the experiments. Additionally, we augmented these datasets with up to four distinct types of noise and conducted comparative analyses with advanced convolutional neural network models, EfficientNetB0 and ResNet34. The results indicate that the SDDM effectively addresses the “black box” problem inherent in the learning processes of contemporary deep learning models due to its robust biological interpretability, thereby enhancing model transparency. Its commendable performance across various scenarios also underscores its potential as a novel possibility for explicating binocular functionality in stereoscopic vision. Moreover, the model’s low parameter count and computational demand substantially reduce both time and hardware costs, facilitating its deployment in environments with limited computational resources.

Although the current functions of the SDDM may seem somewhat limited, the approach of building low-cost, high-performance models with strong biological interpretability— inspired by biologically heuristic principles—is not confined to any specific domain. Essentially, the human brain can also be understood as a composite system composed of functional modules, each primarily responsible for specific tasks. Further research into the SDDM has the potential to integrate various domain-specific biologically inspired models into a single cohesive system. Such method can enhance the interpretability of neural network algorithms from a biological perspective, thereby charting new directions and methodologies for the development of artificial intelligence.

Author Contributions

Conceptualization, Y.H. and S.T.; methodology, Y.H. and S.T.; software, Y.H. and S.T.; validation, Y.H., Z.T. and T.C.; formal analysis, Y.H. and S.T.; investigation, Y.H., S.T. and Y.T.; resources, Y.H.; data curation, Y.H. and S.T.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H., S.T. and Z.T.; visualization, Y.H. and Z.Q.; supervision, Y.T. and Z.T.; project administration, Y.T. and Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number 23K11261 and Japan Science and Technology Agency (JST) Support for Pioneering Research Initiated by the Next Generation (SPRING) under Grant JPMJSP2145.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Results of EfficientNetB0 within static background noise.

Static Background Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	90.58%	14.19%	5.50%
2	100.00%	94.12%	19.42%	5.12%
4	100.00%	96.85%	28.65%	12.73%
8	100.00%	99.31%	39.23%	13.29%
16	100.00%	99.81%	52.23%	9.96%
32	100.00%	99.96%	68.88%	16.96%
64	100.00%	100.00%	84.73%	30.73%
128	100.00%	100.00%	95.16%	51.19%
256	100.00%	100.00%	98.65%	72.69%
512	100.00%	100.00%	99.77%	89.15%

Table A2. Results of EfficientNetB0 within dynamic background noise.

Dynamic Background Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	11.42%	4.50%	4.27%
2	100.00%	34.50%	4.65%	4.38%
4	100.00%	65.54%	5.23%	4.54%
8	100.00%	85.38%	6.15%	4.54%
16	100.00%	95.19%	9.81%	5.35%
32	100.00%	99.04%	11.65%	5.42%
64	100.00%	99.65%	15.77%	6.27%
128	100.00%	99.88%	25.35%	8.54%
256	100.00%	100.00%	43.54%	15.81%
512	100.00%	100.00%	77.88%	38.46%

Table A3. Results of EfficientNetB0 within static global noise.

Static Global Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	93.42%	9.19%	5.08%
2	100.00%	96.35%	15.42%	6.92%
4	100.00%	98.96%	34.73%	7.92%
8	100.00%	99.62%	56.35%	10.31%
16	100.00%	99.85%	76.54%	20.08%
32	100.00%	100.00%	88.27%	38.81%
64	100.00%	100.00%	95.96%	56.08%
128	100.00%	100.00%	99.35%	73.46%
256	100.00%	100.00%	100.00%	89.00%
512	100.00%	100.00%	100.00%	97.50%

Table A4. Results of EfficientNetB0 within dynamic global noise.

Dynamic Global Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	8.85%	4.58%	4.65%
2	100.00%	33.37%	4.42%	4.04%
4	100.00%	63.65%	5.77%	4.42%
8	100.00%	84.08%	6.04%	4.54%
16	100.00%	93.92%	7.42%	5.08%
32	100.00%	97.69%	9.50%	4.81%
64	100.00%	98.96%	12.69%	5.50%
128	100.00%	99.62%	16.65%	7.04%
256	100.00%	99.96%	25.96%	8.08%
512	100.00%	99.96%	49.23%	14.27%

Table A5. Results of ResNet34 within static background noise.

Static Background Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	97.42%	42.27%	9.38%
2	100.00%	98.35%	56.54%	14.12%
4	100.00%	98.92%	64.16%	23.31%
8	100.00%	99.54%	66.81%	28.62%
16	100.00%	99.77%	67.58%	32.23%
32	100.00%	99.92%	69.08%	35.31%
64	100.00%	99.88%	73.81%	41.92%
128	100.00%	100.00%	79.54%	48.19%
256	100.00%	100.00%	85.85%	56.65%
512	100.00%	100.00%	95.81%	66.88%

Table A6. Results of ResNet34 within dynamic background noise.

Dynamic Background Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	14.38%	4.62%	4.04%
2	100.00%	37.85%	4.88%	4.23%
4	100.00%	61.88%	7.16%	4.58%
8	100.00%	79.35%	10.46%	5.31%
16	100.00%	88.19%	12.73%	5.65%
32	100.00%	93.15%	18.46%	7.62%
64	100.00%	96.77%	23.38%	9.54%
128	100.00%	98.54%	35.81%	13.77%
256	100.00%	99.77%	52.00%	21.12%
512	100.00%	100.00%	76.42%	36.31%

Table A7. Results of ResNet34 within static global noise.

Static Global Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	97.58%	38.58%	8.31%
2	100.00%	98.19%	54.08%	14.58%
4	100.00%	98.88%	67.42%	21.69%
8	100.00%	99.46%	71.77%	27.08%
16	100.00%	99.85%	72.69%	31.88%
32	100.00%	99.92%	75.04%	34.31%
64	100.00%	99.96%	77.54%	37.12%
128	100.00%	99.92%	83.42%	45.46%
256	100.00%	100.00%	89.58%	54.12%
512	100.00%	100.00%	97.19%	68.15%

Table A8. Results of ResNet34 within dynamic global noise.

Static Global Noise/Size/Percent	0%	1%	5%	10%
1	100.00%	97.58%	38.58%	8.31%
2	100.00%	98.19%	54.08%	14.58%
4	100.00%	98.88%	67.42%	21.69%
8	100.00%	99.46%	71.77%	27.08%
16	100.00%	99.85%	72.69%	31.88%
32	100.00%	99.92%	75.04%	34.31%
64	100.00%	99.96%	77.54%	37.12%
128	100.00%	99.92%	83.42%	45.46%
256	100.00%	100.00%	89.58%	54.12%
512	100.00%	100.00%	97.19%	68.15%

References

Kolb, H.; Fernandez, E.; Nelson, R. Webvision: The Organization of the Retina and Visual System [Internet]; University of Utah Health Sciences Center: Salt Lake City, UT, USA, 1995. [Google Scholar]
Hochachka, P.W.; Somero, G.N. Biochemical Adaptation: Mechanism and Process in Physiological Evolution; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
Kaas, J.H. The evolution of the visual system in primates. Vis. Neurosci. 2004, 2, 1563–1572. [Google Scholar]
Read, J.C. Binocular vision and stereopsis across the animal kingdom. Annu. Rev. Vis. Sci. 2021, 7, 389–415. [Google Scholar] [CrossRef]
Buckley, D.; Frisby, J.P. Interaction of stereo, texture and outline cues in the shape perception of three-dimensional ridges. Vis. Res. 1993, 33, 919–933. [Google Scholar] [CrossRef] [PubMed]
Cavanagh, P. Reconstructing the third dimension: Interactions between color, texture, motion, binocular disparity, and shape. Comput. Vision Graph. Image Process. 1987, 37, 171–195. [Google Scholar] [CrossRef]
Hou, B.; Liu, Y.; Ling, N. A super-fast deep network for moving object detection. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Virtual, 10–21 October 2020; pp. 1–5. [Google Scholar]
Shah, S.T.H.; Xuezhi, X. Traditional and modern strategies for optical flow: An investigation. SN Appl. Sci. 2021, 3, 289. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Achour, E.M.; Malgouyres, F.; Mamalet, F. Existence, stability and scalability of orthogonal convolutional neural networks. J. Mach. Learn. Res. 2022, 23, 1–56. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef]
Lahoti, A.; Karp, S.; Winston, E.; Singh, A.; Li, Y. Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs. arXiv 2024, arXiv:2403.15707. [Google Scholar]
Jogin, M.; Madhulika, M.; Divya, G.; Meghana, R.; Apoorva, S. Feature extraction using convolution neural networks (CNN) and deep learning. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 18–19 May 2018; pp. 2319–2323. [Google Scholar]
Maturana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Korolev, S.; Safiullin, A.; Belyaev, M.; Dodonova, Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In Proceedings of the 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), Melbourne, Australia, 18–21 April 2017; pp. 835–838. [Google Scholar]
Zheng, B.; Gao, A.; Huang, X.; Li, Y.; Liang, D.; Long, X. A modified 3D EfficientNet for the classification of Alzheimer’s disease using structural magnetic resonance images. IET Image Process. 2023, 17, 77–87. [Google Scholar] [CrossRef]
Panigrahi, U.; Sahoo, P.K.; Panda, M.K.; Panda, G. A ResNet-101 deep learning framework induced transfer learning strategy for moving object detection. Image Vis. Comput. 2024, 146, 105021. [Google Scholar] [CrossRef]
Abbasi-Asl, R.; Yu, B. Structural compression of convolutional neural networks with applications in interpretability. Front. Big Data 2021, 4, 704182. [Google Scholar] [CrossRef]
Zhang, Q.; Wu, Y.N.; Zhu, S.C. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8827–8836. [Google Scholar]
Hassija, V.; Chamola, V.; Mahapatra, A.; Singal, A.; Goel, D.; Huang, K.; Scardapane, S.; Spinelli, I.; Mahmud, M.; Hussain, A. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 2024, 16, 45–74. [Google Scholar] [CrossRef]
McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A survey of neuromorphic computing and neural networks in hardware. arXiv 2017, arXiv:1705.06963. [Google Scholar]
Todo, Y.; Tamura, H.; Yamashita, K.; Tang, Z. Unsupervised learnable neuron model with nonlinear interaction on dendrites. Neural Netw. 2014, 60, 96–103. [Google Scholar] [CrossRef]
Tang, C.; Todo, Y.; Ji, J.; Tang, Z. A novel motion direction detection mechanism based on dendritic computation of direction-selective ganglion cells. Knowl.-Based Syst. 2022, 241, 108205. [Google Scholar] [CrossRef]
Baek, E.; Song, S.; Baek, C.K.; Rong, Z.; Shi, L.; Cannistraci, C.V. Neuromorphic dendritic network computation with silent synapses for visual motion perception. Nat. Electron. 2024, 7, 1–12. [Google Scholar] [CrossRef]
Tao, S.; Zhang, Z.; Zhao, R.; Tang, Z.; Todo, Y. A novel artificial visual system for motion direction detection in color images. Knowl.-Based Syst. 2024, 295, 111816. [Google Scholar] [CrossRef]
Han, M.; Todo, Y.; Tang, Z. Mechanism of motion direction detection based on barlow’s retina inhibitory scheme in direction-selective ganglion cells. Electronics 2021, 10, 1663. [Google Scholar] [CrossRef]
Hubel, D.H.; Wiesel, T.N. Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 1959, 148, 574–591. [Google Scholar] [CrossRef] [PubMed]
Barlow, H.B.; Blakemore, C.; Pettigrew, J.D. The neural mechanism of binocular depth discrimination. J. Physiol. 1967, 193, 327. [Google Scholar] [CrossRef] [PubMed]
Qian, N. Binocular disparity and the perception of depth. Neuron 1997, 18, 359–368. [Google Scholar] [CrossRef] [PubMed]
Huberman, A.D.; Feller, M.B.; Chapman, B. Mechanisms underlying development of visual maps and receptive fields. Annu. Rev. Neurosci. 2008, 31, 479–509. [Google Scholar] [CrossRef] [PubMed]
Taylor, W.R.; He, S.; Levick, W.R.; Vaney, D.I. Dendritic computation of direction selectivity by retinal ganglion cells. Science 2000, 289, 2347–2350. [Google Scholar] [CrossRef]
Tong, F. Foundations of vision. In Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience; Wiley: Hoboken, NJ, USA, 2018; Volume 2, pp. 1–61. [Google Scholar]
Blake, R.; Wilson, H. Binocular vision. Vision Res. 2011, 51, 754–770. [Google Scholar] [CrossRef] [PubMed]
Fishman, R.S. Johannes Kepler and René Descartes: A retinal image is transmitted to the brain. In Foundations of Ophthalmology: Great Insights That Established the Discipline; Springer: Berlin/Heidelberg, Germany, 2017; pp. 1–10. [Google Scholar]
Wassle, H.; Boycott, B.B. Functional architecture of the mammalian retina. Physiol. Rev. 1991, 71, 447–480. [Google Scholar] [CrossRef] [PubMed]
Kamermans, M.; Spekreijse, H. The feedback pathway from horizontal cells to cones: A mini review with a look ahead. Vis. Res. 1999, 39, 2449–2468. [Google Scholar] [CrossRef]
Chaya, T.; Matsumoto, A.; Sugita, Y.; Watanabe, S.; Kuwahara, R.; Tachibana, M.; Furukawa, T. Versatile functional roles of horizontal cells in the retinal circuit. Sci. Rep. 2017, 7, 5540. [Google Scholar] [CrossRef] [PubMed]
Wong, K.Y.; Adolph, A.R.; Dowling, J.E. Retinal bipolar cell input mechanisms in giant danio. I. Electroretinographic analysis. J. Neurophysiol. 2005, 93, 84–93. [Google Scholar] [CrossRef]
Burkhardt, D.A. Contrast processing by ON and OFF bipolar cells. Vis. Neurosci. 2011, 28, 69–75. [Google Scholar] [CrossRef]
Boahen, K.A. Spatiotemporal sensitivity of the retina: A physical model. In Rapport Technique, California Institute of Technology, Computation and Neural Systems Program, CNS Memorandum; California Institute of Technology: Pasadena, CA, USA, 1991; Volume 30. [Google Scholar]
He, Z.J.; Nakayama, K. Surfaces versus features in visual search. Nature 1992, 359, 231–233. [Google Scholar] [CrossRef]
Kuffler, S.W. Discharge patterns and functional organization of mammalian retina. J. Neurophysiol. 1953, 16, 37–68. [Google Scholar] [CrossRef] [PubMed]
Hubel, D.H. Single unit activity in lateral geniculate body and optic tract of unrestrained cats. J. Physiol. 1960, 150, 91. [Google Scholar] [CrossRef] [PubMed]
Barlow, H.B.; Hill, R.M. Selective sensitivity to direction of movement in ganglion cells of the rabbit retina. Science 1963, 139, 412–414. [Google Scholar] [CrossRef] [PubMed]
Koch, C.; Poggio, T.; Torre, V. Nonlinear interactions in a dendritic tree: Localization, timing, and role in information processing. Proc. Natl. Acad. Sci. USA 1983, 80, 2799–2802. [Google Scholar] [CrossRef] [PubMed]
Kartsaki, E.; Hilgen, G.; Sernagor, E.; Cessac, B. How Does the Inner Retinal Network Shape the Ganglion Cells Receptive Field? A Computational Study. Neural Comput. 2024, 36, 1041–1083. [Google Scholar] [CrossRef] [PubMed]
Cang, J.; Fu, J.; Tanabe, S. Neural circuits for binocular vision: Ocular dominance, interocular matching, and disparity selectivity. Front. Neural Circuits 2023, 17, 1084027. [Google Scholar] [CrossRef]
Gundavarapu, A.; Chakravarthy, V.S.; Soman, K. A model of motion processing in the visual cortex using neural field with asymmetric Hebbian learning. Front. Neurosci. 2019, 13, 67. [Google Scholar] [CrossRef]
Poggio, G.F.; Motter, B.C.; Squatrito, S.; Trotter, Y. Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic random-dot stereograms. Vis. Res. 1985, 25, 397–406. [Google Scholar] [CrossRef]
Ohzawa, I.; DeAngelis, G.C.; Freeman, R.D. Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science 1990, 249, 1037–1041. [Google Scholar] [CrossRef] [PubMed]
Cumming, B.; Parker, A. Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature 1997, 389, 280–283. [Google Scholar] [CrossRef] [PubMed]
Nguyen, A.; Yosinski, J.; Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 427–436. [Google Scholar]
Koonce, B.; Koonce, B. EfficientNet. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–123. [Google Scholar]
Bello, I.; Fedus, W.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.Y.; Shlens, J.; Zoph, B. Revisiting resnets: Improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2021, 34, 22614–22627. [Google Scholar]
Zeiler, M. Visualizing and Understanding Convolutional Networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Volume 1311. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Alkhaldi, N.A.; Alabdulathim, R.E. Optimizing Glaucoma Diagnosis with Deep Learning-Based Segmentation and Classification of Retinal Images. Appl. Sci. 2024, 14, 7795. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tschandl, P.; Rinner, C.; Apalla, Z.; Argenziano, G.; Codella, N.; Halpern, A.; Janda, M.; Lallas, A.; Longo, C.; Malvehy, J.; et al. Human–computer collaboration for skin cancer recognition. Nat. Med. 2020, 26, 1229–1234. [Google Scholar] [CrossRef]
Kataoka, H.; Wakamiya, T.; Hara, K.; Satoh, Y. Would mega-scale datasets further enhance spatiotemporal 3D CNNs? arXiv 2020, arXiv:2004.04968. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 13 January 2025; pp. 78–96. [Google Scholar]

Figure 1. Overall model structure.

Figure 2. Photoreception mechanism.

Figure 3. Photoreceptor cells.

Figure 4. Horizontal cells.

Figure 5. Bipolar cells.

Figure 6. Direction-selective ganglion cells.

Figure 7. Left receptive field of direction-selective ganglion cells.

Figure 8. Right receptive field of direction-selective ganglion cells.

Figure 9. Voxels of 3D objects.

Figure 10. Direction detection and disparity detection.

Figure 11. Dataset B.

Figure 12. Dataset B with static background noise.

Figure 13. Dataset B with dynamic background noise.

Figure 14. Dataset B with static global noise.

Figure 15. Dataset B with dynamic global noise.

Figure 16. The structure of EfficientNetB0.

Figure 17. The structure of ResNet34.

Figure 18. Final results of static background noise.

Figure 19. Final results of dynamic background noise.

Figure 20. Final results of static global noise.

Figure 21. Final results of dynamic global noise.

Table 1. Results of Dataset B.

Size	Good	Count	Accuracy
1	2600	2600	100.00%
2	2600	2600	100.00%
4	2600	2600	100.00%
8	2600	2600	100.00%
16	2600	2600	100.00%
32	2600	2600	100.00%
64	2600	2600	100.00%
128	2600	2600	100.00%
256	2600	2600	100.00%
512	2600	2600	100.00%

Table 2. Results of static background noise.

Noise Percent (%)	1%			5%			10%
Object Size	Good	Count	Accuracy	Good	Count	Accuracy	Good	Count	Accuracy
1	2279	2600	87.65%	1334	2600	51.31%	808	2600	31.08%
2	2512	2600	96.62%	2129	2600	81.88%	1547	2600	59.50%
4	2594	2600	99.77%	2498	2600	96.08%	2117	2600	81.42%
8	2600	2600	100.00%	2593	2600	99.73%	2500	2600	96.15%
16	2600	2600	100.00%	2600	2600	100.00%	2595	2600	99.81%
32	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
64	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
128	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
256	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
512	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%

Table 3. Results of dynamic background noise.

Noise Percent (%)	1%			5%			10%
Object Size	Good	Count	Accuracy	Good	Count	Accuracy	Good	Count	Accuracy
1	588	2600	22.62%	131	2600	5.04%	113	2600	4.35%
2	1399	2600	53.81%	241	2600	9.27%	135	2600	5.19%
4	2264	2600	87.08%	379	2600	14.58%	197	2600	7.58%
8	2586	2600	99.46%	920	2600	35.38%	354	2600	13.62%
16	2600	2600	100.00%	1973	2600	75.88%	851	2600	32.73%
32	2600	2600	100.00%	2561	2600	98.50%	1880	2600	72.31%
64	2600	2600	100.00%	2600	2600	100.00%	2566	2600	98.69%
128	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
256	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
512	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%

Table 4. Results of static global noise.

Noise Percent (%)	1%			5%			10%
Object Size	Good	Count	Accuracy	Good	Count	Accuracy	Good	Count	Accuracy
1	2282	2600	87.77%	1322	2600	50.85%	799	2600	30.73%
2	2520	2600	96.92%	2108	2600	81.08%	1594	2600	61.31%
4	2596	2600	99.85%	2498	2600	96.08%	2242	2600	86.23%
8	2600	2600	100.00%	2590	2600	99.62%	2552	2600	98.15%
16	2600	2600	100.00%	2599	2600	99.96%	2596	2600	99.85%
32	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
64	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
128	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
256	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
512	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%

Table 5. Results of dynamic global noise.

Noise Percent (%)	1%			5%			10%
Object Size	Good	Count	Accuracy	Good	Count	Accuracy	Good	Count	Accuracy
1	521	2600	20.04%	178	2600	6.85%	122	2600	4.69%
2	1387	2600	53.35%	214	2600	8.23%	152	2600	5.85%
4	2319	2600	89.19%	417	2600	16.04%	220	2600	8.46%
8	2586	2600	99.46%	910	2600	35.00%	377	2600	14.50%
16	2600	2600	100.00%	1965	2600	75.58%	821	2600	31.58%
32	2600	2600	100.00%	2563	2600	98.58%	1791	2600	68.88%
64	2600	2600	100.00%	2600	2600	100.00%	2528	2600	97.23%
128	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
256	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%
512	2600	2600	100.00%	2600	2600	100.00%	2600	2600	100.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hua, Y.; Tao, S.; Todo, Y.; Chen, T.; Qiu, Z.; Tang, Z. A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision. Symmetry 2025, 17, 162. https://doi.org/10.3390/sym17020162

AMA Style

Hua Y, Tao S, Todo Y, Chen T, Qiu Z, Tang Z. A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision. Symmetry. 2025; 17(2):162. https://doi.org/10.3390/sym17020162

Chicago/Turabian Style

Hua, Yuxiao, Sichen Tao, Yuki Todo, Tianqi Chen, Zhiyu Qiu, and Zheng Tang. 2025. "A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision" Symmetry 17, no. 2: 162. https://doi.org/10.3390/sym17020162

APA Style

Hua, Y., Tao, S., Todo, Y., Chen, T., Qiu, Z., & Tang, Z. (2025). A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision. Symmetry, 17(2), 162. https://doi.org/10.3390/sym17020162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Biologically Inspired Model for Detecting Object Motion Direction in Stereoscopic Vision

Abstract

1. Introduction

2. Mechanism and Method

2.1. Photoreception

2.2. Photoreceptor Cells

2.3. Horizontal Cells

2.4. Bipolar Cells

2.5. Ganglion Cells

2.6. Primary Visual Cortex

3. Experiments and Analysis

3.1. Dataset

3.2. Accuracy Evaluation

3.3. Generalization Experiment

3.3.1. Static Background Noise

3.3.2. Dynamic Background Noise

3.3.3. Static Global Noise

3.3.4. Dynamic Global Noise

3.4. Comparative Experiment

3.4.1. EfficientNetB0

3.4.2. ResNet34

3.4.3. Comparative Results Analysis

4. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI