Next Article in Journal
A LAMP Detection System Based on a Microfluidic Chip for Pyricularia grisea
Previous Article in Journal
Adaptive Kalman Filtering for Compensating External Effects in On-Line Spectroscopic Measurements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geometry-Constrained Learning-Based Visual Servoing with Projective Homography-Derived Error Vector

by
Yueyuan Zhang
,
Arpan Ghosh
,
Yechan An
,
Kyeongjin Joo
,
SangMin Kim
and
Taeyong Kuc
*
Department of Electrical and Computer Engineering, College of Information and Communication Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(8), 2514; https://doi.org/10.3390/s25082514
Submission received: 16 March 2025 / Revised: 7 April 2025 / Accepted: 15 April 2025 / Published: 16 April 2025
(This article belongs to the Section Intelligent Sensors)

Abstract

:
We propose a novel geometry-constrained learning-based method for camera-in-hand visual servoing systems that eliminates the need for camera intrinsic parameters, depth information, and the robot’s kinematic model. Our method uses a cerebellar model articulation controller (CMAC) to execute online Jacobian estimation within the control framework. Specifically, we introduce a fixed-dimension, uniform-magnitude error function based on the projective homography matrix. The fixed-dimension error function ensures a constant Jacobian size regardless of the number of feature points, thereby reducing computational complexity. By not relying on individual feature points, the approach maintains robustness even when some features are occluded. The uniform magnitude of the error vector elements simplifies neural network input normalization, thereby enhancing online training efficiency. Furthermore, we incorporate geometric constraints between feature points (such as collinearity preservation) into the network update process, ensuring that model predictions conform to the fundamental principles of projective geometry and eliminating physically impossible control outputs. Experimental and simulation results demonstrate that our approach achieves superior robustness and faster learning rates compared to other model-free image-based visual servoing methods.

1. Introduction

Visual servoing utilizes visual feedback to refine robot motion strategies, thereby significantly enhancing adaptability and intelligence. The visual system can be mounted on the end effector of the robot (eye-in-hand configuration), or positioned near the robot (hand-to-eye configuration) [1]. This study mainly focuses on the eye-in-hand configuration.
Visual servoing is generally divided into two main approaches: position-based visual servoing (PBVS) and image-based visual servoing (IBVS) [2,3,4]. PBVS operates in Cartesian space, enabling global convergence by directly associating image features with the pose of the target in 3D space. However, the PBVS is sensitive to camera calibration errors, inaccuracies in the kinematic model, and noise, often requiring precise calibration, making it challenging for non-experts. In contrast, IBVS defines its error signal in the image space, making it less sensitive to calibration errors that primarily affect convergence speed rather than accuracy. Nevertheless, IBVS is prone to local minima, and the appearance or disappearance of feature points during the control process can significantly disrupt the continuity of the control law [5]. An alternative approach is homography-based visual servoing (HBVS), which combines the advantages of PBVS and IBVS. HBVS is resilient to the partial occlusion of features and does not require prior information for 3D reconstruction. Studies [6,7,8,9] have decomposed the homography matrix to obtain the translation vector t and rotation matrix R, converting the image information into motion in the Cartesian space. However, decomposing the homography matrix during the control process increases computational demands and affects the real-time performance of the system. To circumvent this issue, one study [10] has proposed constructing a cost function directly from the elements of the homography matrix. However, this method often requires vanishing-point detection. To address these challenges, projective homography-based uncalibrated visual servoing (PHUVS) was introduced [11]. The PHUVS establishes the task function by estimating the projective homography using only image information within a certain range, eliminating the need for vanishing point detection. Moreover, it can produce Jacobian matrices of fixed dimensions, thereby reducing the computational complexity when inverting matrices.
Visual servoing methods typically rely on a deeper understanding of the Jacobian matrix. This matrix describes the mapping between the visual information and manipulator joint angular displacements. It is a local linear approximation of the nonlinear and highly coupled relationship between the visual and motion spaces, making it crucial for control algorithms. Traditional methods usually require accurate system kinematic models or precise estimations of the intrinsic camera parameters. These requirements significantly limit their applicability. To address the challenges of uncalibrated visual servoing, adaptive controllers have been proposed to estimate unknown parameters online [12,13,14]. However, these methods often require linear parameterization of the system modeling, which is complicated for nonlinear systems with multiple degrees of freedom. Consequently, model-free learning methods have been introduced to compensate for these uncertainties without the need for linear parameterization. These methods treat the estimation of a Jacobian matrix as a dynamic parameter identification problem. They employ recursive estimation techniques such as weighted recursive least squares, Broyden’s method, and dynamic Gauss–Newton algorithms [15,16]. Neural networks are renowned for their powerful nonlinear function approximation capabilities and have been widely applied in robotic systems [17,18,19,20,21,22,23]. However, their application in visual servoing systems is relatively limited. In [17], a method based on perceptron neural networks was proposed to learn the inverse mapping of an unknown interaction matrix; however, it required a large amount of data for offline training. Data-driven online mapping estimation methods have been designed [18,19]. In addition, ref. [20] proposed a data-driven IBVS method that addresses both the target tracking and physical constraints of the robot, such as joint velocity limits and FOV constraints. The study in [21] presented a robotic control system that uses a CMAC network with a Takagi–Sugeno fuzzy framework, capable of online learning joint velocities directly from image feature errors. Recent works have also explored deep learning-based visual servoing frameworks with improved robustness and noise tolerance. For instance, ref. [22] employs deep networks for 3D visual servoing without requiring precise calibration, while [23] introduces a noise-tolerant Jacobian estimation method using neural networks under pixel-level disturbances.
Traditional visual servoing systems often rely on handcrafted features such as SIFT or ORB, which can be sensitive to noise, lighting changes, and viewpoint variations. To improve robustness, learning-based methods like SuperPoint [24] and LightGlue [25] have been developed. These methods offer improved feature repeatability and matching accuracy under challenging conditions, making them promising complements to learning-based visual servoing frameworks.
Current visual servoing algorithms face two main challenges: (1) Existing model-free visual servoing approaches are predominantly based on IBVS. The loss of feature points can cause interruptions in the control process. However, due to changes in lighting and partial occlusion of feature points, it is difficult to ensure that every feature point is detected and accurately matched during control. Hence, IBVS is primarily designed for basic visual primitives, such as distinct points or edges. For objects with sparse or low-texture surfaces, feature matching errors become significantly higher, further degrading the reliability of the control process. Furthermore, the size of the image Jacobian matrix is proportional to the number of feature points, resulting in high computational costs when calculating the Jacobian pseudoinverse. When feature points are used as the inputs or outputs of a neural network, an increase in their number significantly expands the network’s dimensionality, thereby affecting training efficiency and computational complexity. This not only increases the number of network parameters but may also lead to optimization issues such as gradient vanishing or gradient explosion, ultimately impacting convergence speed and stability. (2) Current model-free visual servoing methods typically use neural networks or recursive algorithms to learn the Image Jacobian or map visual features to control commands. However, these approaches often ignore the geometric constraints among visual features, resulting in parameter updates that violate physical laws during training. Without explicitly incorporating these constraints, robots may execute unreasonable motions.
We develop a geometry-constrained learning-based control strategy based on PHUVS. The main contributions of this paper include:
  • Learning-Based Control Strategy: A novel control method is proposed for PHUVS, where the CMAC neural network is employed to estimate the Jacobian matrix online. This eliminates the need for traditional calibration and kinematic modeling.
  • Fixed-Dimension Visual Error Function via Homography: A new visual error function based on the projective homography matrix is designed. It maintains a fixed dimension and uniform magnitude across all components, thereby improving learning efficiency, reducing computational complexity, and offering robustness against feature point occlusion and detection errors.
  • Incorporation of Geometric Constraints in Learning: Fundamental geometric relationships among visual features (e.g., collinearity) are embedded into the CMAC network’s learning process, enhancing both accuracy and convergence speed.
The remainder of this paper is organized as follows: Section 2 provides an overview of the camera/robot model and the projective homography matrix, while clearly defining the control problem. Section 3 details the developed controller. Section 4 presents the stability analysis of the proposed control system, providing theoretical guarantees for the controller’s performance. In Section 5, we analyze the simulation and experimental results, to demonstrate the efficacy of proposed methodology. Section 6 presents the conclusions and highlights the contributions of the study.

2. Preliminary Knowledge

2.1. Kinematics Model

As shown as Figure 1, {Fc} and {Fcd} represent the current and desired camera frames, respectively. Consider n d as the normal vector of the plane π expressed in {Fcd} with length n d   = 1 d . Here, d is the distance between the plane and center of the projection at the reference pose. R denotes the orientation of {Fcd} with respect to {Fc}, and t denotes the translation vector from {Fc} to {Fcd} expressed in the coordinate frame {Fc}.
Let p i be the point of the target in 3D space. The Euclidean coordinates expressed in {Fc} and {Fcd} are denoted:
p i c = H p i cd ,
H = R + t n d T ,
where H is the Euclidean homography matrix. This matrix describes how the 3D points lying on a planar surface are mapped between two Euclidean frames.
The projections of the 3D points into two image spaces are: ε di 1 T = K p i cd Z d i and ε i 1 T = K p i c Z i , where K is the camera intrinsic parameter matrix.
Subsequently, the relationship between ε i and ε di will be:
ε i 1 = Z d i Z i G ε di 1 ,
G = KH K 1 ,
where G is the projective homography matrix. In the absence of depth information, we can compute the matrix G up to a scale factor using only four or more non-collinear matched points ε i , ε di , i 1 , 2 , 3 , 4 : G ¯ = β G .
G ¯ is an estimation of G and the parameter β R represents an arbitrary positive scale factor. It is crucial to note that this random ratio is independent of the depth ratio: Z d i Z i . Typically, the determinant of the G ¯ is normalized to 1: d e t G ¯ = 1 to eliminate the ambiguity introduced by the scale factor. In this case, the matrix G ¯ can be represented as: T G ¯ = G d e t G . Then, the relationship between the positions of feature points in the two images can be rewritten as follows:
ε i 1 = γ i G ¯ ε di 1 ,
γ i = 1 g ¯ 3 , : ε di T 1 T ,
where g ¯ 3 , : is the 3rd row of the matrix G ¯ . The subscript i denotes the index of the feature point. In the subsequent text, we use subscript c to denote the object center, where ε dc and ε c represent its desired and current positions in image space, respectively. The values of G ¯ and γ i can be estimated by m (m > 4) pairs of non-collinear matched feature points. When G ¯ tends to the identity matrix I 3 × 3 , it indicates that the two images are perfectly aligned.
In a camera-in-hand visual servoing setup, the camera is mounted on the robot’s end-effector. By solving the kinematics, the position vector r c b of the camera in the base frame is a function of the joint variables:
r c b = f q ,
where q R M is the joint angle vector of the manipulator, the spatial velocity of the end-effector can be obtained as:
v c b = J r q q ˙ , J r R 6 × n q ,
where n q is the number of robot joints, J r ( q ) is the robot Jacobian, q ˙ is the joint velocities, and v c b = v x , v y , v z , ω x , ω y , ω z T is end-effector spatial velocity, which includes the linear velocity v and angular velocity ω .

2.2. Problem Description and Motivation

Equation (8) describes the velocity mapping from the joint space to the camera motion in Cartesian space. To obtain the mapping from the joint space to the image space, we must construct an image Jacobian, which relates the camera’s velocities to the resulting changes in the image space.
In the methods of IBVS [12,13,14,15,16,17,18,19,20,21], the image error ε ˜ i = ε di ε i is directly used as the feedback signal. Its derivative and mapping to the joint space can be expressed as follows:
ε ˜ ˙ = J c I B V S J r q ˙ = J I B V S q ˙ , J I B V S R 2 P × n q
where ε ˙ = ε ˙ 1 T , ε ˙ 2 T , ε ˙ P T T R 2 P × 1 . P and n q represent the numbers of feature points and robot joints, respectively. The image Jacobian matrix contains information on the feature point positions and image depth: J c I B V S = J c I B V S 1 ε 1 , Z 1 ; J c I B V S 2 ε 2 , Z 2 ; J c I B V S N ε P , Z P .
Although depth-independent Jacobian matrices have been developed in some studies, several common challenges persist. First, the dimension of the Jacobian is proportional to the number of feature points, implying that as the number of feature points increases, the computational complexity also increases. Second, the task function typically minimizes the position errors of all feature points. When some feature points are occluded, they become unavailable. Changes in the number of feature points can lead to discontinuities in the task function, resulting in discontinuities in the control law.
In study [11], a task error based on the PHUVS was designed as follows:
e g ¯ = re s h a p e I G ¯ , 9 , 1
The expression linking the derivative of e g ¯ with the camera velocity is given as:
e ˙ g ¯ = J c P H U V S e g ¯ v c b ,
where the image Jacobian J c P H U V S e g ¯ R 9 × 6 . Combining with Equation (8), the relationship between e ˙ g ¯ and q ˙ can be determined as:
e ˙ g ¯ = J P H U V S e g ¯ , q q ˙ , J P H U V S e g ¯ , q R 9 × n q
where J P H U V S e g ¯ , q = J c P H U V S e g ¯ · J r q denotes the system Jacobian matrix.
The error function e g ¯ in Equation (10) consists of nine elements, each representing a different type of geometric transformation, such as rotation, scaling, perspective, and translation. In conventional neural networks, normalization is challenging due to the uncertain range of each element. Similarly, in the CMAC network, the input quantization process becomes difficult as the elements vary significantly in their numerical scales and physical meanings, complicating the discretization into appropriate regions.
To address the above issues, we first reconstructed a fixed-dimension, uniform-magnitude visual error function. It is important to note that the error function we designed differs from the one presented in reference [11], specifically defined in Equation (10). In contrast, our error function consists of only 10 elements, each with consistent physical meanings and uniform magnitude, which simplifies the input quantization process for the neural network.
Despite the improved error function, a key limitation remains: existing model-free learning approaches update network weights without explicitly enforcing geometric constraints. This omission can lead to physically inconsistent mappings, such as collinear points deviating from collinearity after projection. To address this, we integrate geometric constraints into the weight update process. Specifically, we incorporate regularization terms into the loss function to preserve geometric consistency and employ projected gradient descent to ensure physically meaningful updates.

3. Controller Development

3.1. Novel Fixed-Dimension, Uniform-Magnitude Task Error Function

Assume that the camera captures an image at a reference position. The objective is to adjust the pose of the robot to ensure the current image aligns with the reference image, effectively bringing the camera back to its reference position.
This process involves the detecting and matching of image feature points. In Figure 2, the blue points represent the matched feature point pairs: ε j , ε dj , j = 1 , 2 P , which are used to estimate the projective homography matrix G ¯ . To ensure robust estimation of this transformation, we use enough feature points across both images. Having sufficient points provides stability in the homography estimation, as it reduces the sensitivity to noise or small mismatches at individual points. Robust estimation methods, such as random sample consensus (RANSAC), help filter out outliers and produce an accurate homography matrix G ¯ .
To achieve a fixed-dimensional Jacobian, we introduce five fixed, non-collinear virtual reference feature points around the object center. The expected position of each virtual feature point in the reference image is defined as:
ε d 1 ( aux ) ε d 2 ( aux ) ε d 3 ( aux ) ε d 4 ( aux ) ε d 5 ( aux ) = ε dc ( aux ) + Δ u Δ v T ε dc ( aux ) + Δ u Δ v T ε dc ( aux ) + Δ u Δ v T ε dc ( aux ) + Δ u Δ v T ε dc ( aux )
where ε dc represents the center of the object in the reference image, and Δ u and Δ v are the displacements of the feature points in the u and v directions, respectively.
We define an error matrix as follows:
E i = I γ i aux G ¯
γ i aux = 1 g ¯ 3 , : ε di aux T 1 T
The actual position of the i-th virtual feature point in the current image is denoted as ε i aux . Based on Equations (5), (6), (13) and (14), the position error can be rewritten as:
ε ˜ i aux = ε di aux ε i aux = E 1 : 2 , : i ε di aux T 1 T = e i ,
where E 1 : 2 , : i represents the first two rows of the matrix E i .
Furthermore, e i is our designed error task function for the i-th feature. If the estimation G ¯ is precise, e i is equivalent to the feature point position error ε ˜ i aux .
We combine the error vectors for the four virtual feature points into a single error task vector:
e = e 1 e 2 e 5 = E 1 : 2 , : 1 ε d 1 aux E 1 : 2 , : 2 ε d 2 aux E 1 : 2 , : 5 ε d 5 aux , e R 10 × 1 ,
Theorem 1.
The task error function e = 0 if and only if R = I and t = 0 , which is proven in Appendix A.
As mentioned earlier, in the traditional IBVS, it is essential to accurately track each image feature point throughout the entire visual servoing process. This requirement can be stringent and poses challenges, particularly in dynamic environments where feature points may become obscured. In contrast, our PHUVS-based control system can compute the projective homography matrix using a local set of feature points. RANSAC is employed to filter outliers and enhance the robustness of the system to noise. Furthermore, even if the i-th virtual feature point is occluded, we can still determine the scale factor γ i aux and its corresponding error matrix based on Equations (13) and (14). This implies that the loss of certain feature points do not affect the computation of the task function.
If the projective homography matrix is estimated precisely, the vector error e i is equivalent to the position error ε ˜ i aux for the i-th auxiliary feature point. Each element of e i is expressed in the same unit (pixels), ensuring uniformity in magnitude across all components. In the subsequent sections, we use e i as an input signal for the neural network. This consistency simplifies the input quantization process, thus enhancing the effectiveness of the learning process.
We define the system state as:
x = e T q T T
The mapping between the derivative of the new task function and the robot joint velocities is defined as:
e ˙ = J x q ˙
where J is the Jacobian matrix of the entire system.

3.2. Collinearity Constraint

In visual servoing systems, utilizing geometric relationships between feature points enhances control performance. As shown in Figure 3, we consider two key collinearity relationships: points ε d 1 aux , ε d 4 aux and ε d 5 aux form one collinear set, while ε d 2 aux , ε d 3 aux and ε d 5 aux form another. These collinearity properties remain invariant under projective transformations, ensuring that corresponding points in the current configuration ( ε 1 aux , ε 4 aux , ε 5 aux ) and ( ε 2 aux , ε 3 aux , ε 5 aux ) maintain the same geometric relationships. The arrows in the figure represent the velocity vectors of the feature points during the servoing process. Based on these collinearity relationships, we derive velocity constraints that preserve the geometric properties throughout the motion.
For collinear points, the following relationships hold:
ε 5 aux = α 1 ε 1 aux + 1 α 1 ε 4 aux
ε 5 aux = α 2 ε 2 aux + 1 α 2 ε 3 aux
where parameter α 1 and α 2 can be calculated as:
α 1 = ε 5 aux ε 4 aux T ε 1 aux ε 4 aux ε 1 aux ε 4 aux 2
α 2 = ε 5 aux ε 3 aux T ε 2 aux ε 3 aux ε 2 aux ε 3 aux 2
Differentiating Equations (19) and (20) with respect to time, we obtain:
ε ˙ 5 aux = α 1 ε ˙ 1 aux + 1 α 1 ε ˙ 4 aux + α ˙ 1 ε 1 aux ε c 4 aux
ε ˙ 5 aux = α 2 ε ˙ 2 aux + 1 α 2 ε ˙ 3 aux + α ˙ 2 ε 2 aux ε 3 aux
This velocity relationship indicates that the velocity of point 5, ε ˙ 5 aux , is a weighted combination of ( ε ˙ 1 aux , ε ˙ 4 aux ) and ( ε ˙ 2 aux , ε ˙ 3 aux ).
For a static target object, ε ˙ d 1 aux = ε ˙ d 2 aux = ε ˙ d 3 aux = ε d 4 aux = ε ˙ d 5 aux = 0 . In terms of Equations (15), (23) and (24) can be written as:
e ˙ 5 = α 1 e ˙ 1 1 α 1 e ˙ 4 + α ˙ 1 ε 1 aux ε 4 aux
e ˙ 5 = α 2 e ˙ 2 1 α 2 e ˙ 3 + α ˙ 2 ε 2 aux ε 3 aux
where e ˙ i = ε ˙ i aux ε ˙ di aux , i = 1, 2,…5. To eliminate the influence of α 1 and α 2 in the last terms, we multiply both sides of the equation by ε 1 aux ε 4 aux R 2 × 1 and ε 2 aux ε 3 aux R 2 × 1 , yielding:
ε 1 aux ε 4 aux × e ˙ 5 + α 1 e ˙ 1 + 1 α 1 e ˙ 4 = 0
ε 2 aux ε 3 aux × e ˙ 5 + α 2 e ˙ 2 + 1 α 2 e ˙ 3 = 0
These velocity constraints are incorporated into our learning algorithm, ensuring that the neural network updates respect the underlying geometric principles of projective transformations.

3.3. Model-Free Jacobian Learning

Traditional methods typically rely on precise mathematical models; however, in practical applications, accurate modeling is often difficult to achieve due to system parameter uncertainties and environmental disturbances. This chapter proposes the use of a CMAC neural network to learn and approximate the system’s Jacobian matrix, enabling more robust and adaptive control strategies. The chapter first introduces the CMAC network structure, then explains the weight learning process. Notably, our weight update approach incorporates geometric constraints of visual features to accelerate the learning process.

3.3.1. CMAC Model

The CMAC network is divided into five layers: input layer, association memory space, receptive-field space, weight memory space and output layer, as shown in Figure 4.
  • Input space:
    In this research, we use the state vector x as the input and employ a neural network to learn the unknown Jacobian matrix J .
    The neural network input dimension is 16 × 1 ( e R 10 × 1 and the robot has six joints: q R 6 × 1 ). The input vector needs to be quantized into discrete regions according to its corresponding range:
    I i = x i x i min x i max x i min · I m
    where I m represents the maximum index value after quantization, and · denotes the floor function, ensuring that I i is an integer. And x i min and x i max represent the minimum and maximum values of ith element of state vector x . Figure 5 illustrates a CMAC neural network with two-dimensional input, where each input is discretized into 10 regions: I m = 10 .
  • Association memory space:
    Each input dimension is uniformly partitioned into several discrete regions. Each complete block includes three adjacent regions. In Figure 5, input x 1 and x 2 have their regions grouped into blocks (a, b, c, d) and (A, B, C, D), respectively.
    The network creates multiple layers by shifting the block boundaries. For example, layer 2 contains blocks D, E, F for x 1 and blocks d, e, f for x 2 , formed by shifting the original blocks by one region. Subsequent layer 3 follows the same pattern with additional shifts. Each block is represented by a Gaussian membership function:
    ψ i k = exp I i m i k 2 σ i k 2 , k 1 , n b
    where k denotes the k-th block and n b represents the total number of blocks for input dimension I i . The parameters m i k and σ i k represent the center and width of the Gaussian function for the k-th block, respectively.
  • Receptive-field space:
    The receptive field refers to the region in the input space that can activate specific neurons or memory cells in the network. A key feature of the CMAC network is the use of overlapping receptive fields. This means that a single input can simultaneously activate multiple memory cells, creating a distributed representation. The degree of overlap affects the network’s smoothness and generalization capability; highly overlapping receptive fields produce smoother function approximations, while less overlap produces more localized responses. As shown in Figure 5, point p in space falls within blocks aB, fD, and Ee, activating the corresponding memory cells. The multidimensional receptive-field function is defined as
    b k = Π n i = 1 ψ i k = exp i = 1 n I i m i k 2 σ i k 2 , k = 1 , 2 n l
    where n l represents the number of layers in the receptive field space.
  • Weight memory space and output layer:
    Each receptive field space is connected to the weighted memory W. The i-th output of the neural network is:
    y i = k = 1 n l w i k b k = w i T b
    where w i = w i 1 w i 2 w i n l T and b = b 1 b 2 b n l T .

3.3.2. Weight Update of Neural Network

The discrete form of the Equation (18) can be written as:
Δ e t i = J t i Δ q t i
where the subscript i represents the i-th time step. Δ q R n q × 1 represents the joint increment over the sampling interval. Next, the state vector x is used as input to approximate the Jacobian matrix using the CMAC network.
J = w 1 , 1 T b x t i w 1 , 2 T b x t i w 1 , n q T b x t i w 2 , 1 T b x t i w 2 , 2 T b x t i w 2 , n q T b x t i w 10 , 1 T b x t i w 10 , 2 T b x t i w 10 , n q T b x t i      + σ J
where w are constant matrices of ideal network weights and w k , m R n l × 1 is used to approximate the i-th row and jth column element of the Jacobian matrix j k m , and b is the vector of receptive-field functions. The approximation error of the neural network is denoted as σ J .
It is important to note that the estimated incremental task error
Δ e = Δ e 1 T Δ e 2 T Δ e 5 T T
consists of five 2D vectors, where each Δ e i R 2 × 1 . To distinguish between these components, we use superscript notation to denote the k-th element of Δ e : Δ e k . Thus, we have:
Δ e k t i = m = 1 n q j k m t i Δ q m t i + σ k = Φ T w k + σ k
where w k = w k , 1 T w k , 2 T w k , n q T T and σ k is the approximation error of the neural network. Φ T = b T x t i Δ q 1 b T x t i Δ q 2 b T x t i Δ q n q , and Δ q m represents the m-th element of Δ q .
Δ e is further expressed as shown below:
Δ e t i = m = 1 n q j m t i Δ q m t i = Ω T w + σ
where j m represents the m-th column of J . σ = σ 1 σ 2 σ 10 T is the approximation error term introduced by the neural network model. Ω T = Φ T 0 0 0 Φ T 0 0 0 0 0 0 Φ T and w = w 1 T w 2 T w 10 T T .
Assumption 1.
The approximation error σ is assumed to be upper bounded:
σ ( t ) σ M
where σ M is a finite positive constant. In addition, under a properly designed neural network architecture, σ can be made sufficiently small.
However, w is unknown. Using the estimated weights, the m-th column and the k-th row of the estimated Jacobian matrix can be given by: j ^ k m t i = w ^ k , m t i b x t i where w ^ k , m is the estimated value of desired w k , m .
The estimated value of Δ e k is:
Δ e ^ k t i = Φ T w ^ k
The estimated incremental change in task error is:
Δ e ^ t i = J ^ t i Δ q t i = Ω T w ^
To incorporate geometric constraints into the learning process, we leverage the velocity relationships between collinear feature points defined in Equations (27) and (28). The Jacobian matrix estimated by our neural network must satisfy the following constraints:
h 1 = f 1 Δ e ^ = ε 1 aux ε 4 aux × Δ e ^ 5 + α 1 Δ e ^ 1 + 1 α 1 Δ e ^ 4
h 2 = f 2 Δ e ^ = ε 2 aux ε 3 aux × Δ e ^ 5 + α 2 Δ e ^ 2 + 1 α 2 Δ e ^ 3
We then define a loss function that incorporates both prediction accuracy and geometric constraints:
min L = min 1 2 Δ e Δ e ^ T Δ e Δ e ^ + λ 2 j = 1 2 h j 2
Then, the weights of neural network are updated by:
w ^ t i + 1 = w ^ t i + β L w ^ = w ^ t i β Ω t i Δ e ˜ t i j = 1 2 λ h j h j Δ w ^
where β represents the learning rate parameter for the neural network weight update process, and Δ e ˜ = Δ e Δ e ^ is the estimation error.

3.4. Controller Design

The control law of a robot is defined as:
u = η J ^ + x w ^ e
where η is a positive value.
Figure 6 illustrates the control flow diagram. The process begins with image acquisition and feature point detection. Matched feature pairs ε i , ε d i , i = 1 , 2 , , P are utilized to estimate the projective homography matrix G ¯ using the RANSAC algorithm. Based on G ¯ , the task error vector e and a set of virtual feature points ε i ( aux ) are computed according to Equations (12)–(16).
Next, the estimated system state vector, which includes the task error e and the joint position q , is input to the CMAC neural network. The network predicts the task error increment Δ e ^ and estimates the Jacobian matrix. Importantly, the geometry constraint module is activated after this estimation phase. It incorporates Δ e ^ and the virtual feature points to compute constraint terms (see Equations (39) and (40)), which are then embedded into the neural network’s weight update rule (Equation (42)) to preserve geometric consistency, such as point collinearity under projective transformations.
Finally, the joint controller utilizes the estimated Jacobian and task error to compute the control command u , as described in Equation (43), which is then sent to the robot actuators. The entire procedure operates in a closed-loop fashion, with the visual feedback continually guiding subsequent control iterations.

4. Stability Analysis

In this section, we establish the stability of the proposed geometry-constrained learning-based visual servoing controller.

4.1. Geometric Constraint Relationships

Let us define the weight estimation error as w ˜ = w w ^ , where w represents the ideal weights. According to Equations (35) and (37), we have:
Δ e ˜ k = Δ e k Δ e ^ k = Φ T w ˜ k + σ k
Similarly, from Equations (36) and (38), we can derive:
Δ e ˜ = Δ e Δ e ^ = Ω T w ˜ + σ
For the actual error increment Δ e , the geometric constraints must be satisfied, i.e., h i ( Δ e ) = 0 . We establish the relationship between these constraint equations and the estimation error of the task error increment:
h 1 ( Δ e ^ ) h 1 ( Δ e ) = h 1 = ( ε 1 ( aux ) ε 4 ( aux ) ) × ( Δ e 5 + α 1 Δ e 1 + ( 1 α 1 ) Δ e 4 ) ( ε 1 ( aux ) ε 4 ( aux ) ) × ( Δ e ^ 5 + α 1 Δ e ^ 1 + ( 1 α 1 ) Δ e ^ 4 ) = ( ε 1 ( aux ) ε 4 ( aux ) ) × ( Δ e ˜ 5 + α 1 Δ e ˜ 1 + ( 1 α 1 ) Δ e ˜ 4 )
Similarly, for the second constraint:
h 2 = ( ε 2 ( aux ) ε 3 ( aux ) ) × ( Δ e ˜ 5 + α 2 Δ e ˜ 2 + ( 1 α 2 ) Δ e ˜ 3 )
It should be noted that Δ e ˜ i = [ Δ e ˜ 2 i 1 Δ e ˜ 2 i ] T is a two-dimensional vector containing the ( 2 i 1 ) -th and 2 i -th elements of Δ e ˜ . The constraints h 1 and h 2 can be reformulated as:
h 1 = μ u ( Δ e ˜ 9 + α 1 Δ e ˜ 1 + ( 1 α 1 ) Δ e ˜ 7 ) μ v ( Δ e ˜ 10 + α 1 Δ e ˜ 2 + ( 1 α 1 ) Δ e ˜ 8 ) = μ u ( Φ T w ˜ 9 σ 9 + α 1 Φ T w ˜ 1 + α 1 σ 1 + ( 1 α 1 ) Φ T w ˜ 7 + ( 1 α 1 ) σ 7 ) μ v ( Φ T w ˜ 10 σ 10 + α 1 Φ T w ˜ 2 + α 1 σ 2 + ( 1 α 1 ) Φ T w ˜ 8 + ( 1 α 1 ) σ 8 ) = n 1 T w ˜ + μ u ( σ 9 + α 1 σ 1 + ( 1 α 1 ) σ 7 ) μ v ( σ 10 + α 1 σ 2 + ( 1 α 1 ) σ 8 ) = n 1 T w ˜ + σ h 1
h 2 = κ u ( Δ e ˜ 9 + α 2 Δ e ˜ 3 + ( 1 α 2 ) Δ e ˜ 5 ) + κ v ( Δ e ˜ 10 + α 2 Δ e ˜ 4 + ( 1 α 2 ) Δ e ˜ 6 ) = κ u ( Φ T w ˜ 9 σ 9 + α 2 Φ T w ˜ 3 + α 2 σ 3 + ( 1 α 2 ) Φ T w ˜ 5 + ( 1 α 2 ) σ 5 ) κ v ( Φ T w ˜ 10 σ 10 + α 2 Φ T w ˜ 4 + α 2 σ 4 + ( 1 α 2 ) Φ T w ˜ 6 + ( 1 α 2 ) σ 6 ) = n 2 T w ˜ + κ u ( σ 9 + α 2 σ 3 + ( 1 α 2 ) σ 5 ) κ v ( σ 10 + α 2 σ 4 + ( 1 α 2 ) σ 6 ) = n 2 T w ˜ + σ h 2
where σ h 1 and σ h 2 represent the combined approximation errors in the constraint equations. μ u and μ v are the first and second elements of ( ε 1 ( aux ) ε 4 ( aux ) ) , respectively. Similarly, κ u and κ v are the first and second elements of ( ε 2 ( aux ) ε 3 ( aux ) ) .
The vectors n 1 and n 2 are defined as:
n 1 T = μ u α 1 Φ T μ v α 1 Φ T 0 0 0 0 μ u ( 1 α 1 ) Φ T μ v ( 1 α 1 ) Φ T μ u Φ T μ v Φ T n 2 T = 0 0 μ u α 2 Φ T μ v α 2 Φ T μ u ( 1 α 2 ) Φ T μ v ( 1 α 2 ) Φ T 0 0 μ u Φ T μ v Φ T
The geometric constraints contribute to the neural network weight update rule as shown in Equation (42). The second term in this equation can be reformulated as:
j = 1 2 h j h j w ^ = h 1 h 1 w ˜ w ˜ w ^ + h 2 h 2 w ˜ w ˜ w ^ = ( n 1 T w ˜ + σ h 1 ) ( n 1 ) + ( n 2 T w ˜ + σ h 2 ) ( n 2 ) = n 1 n 1 T w ˜ σ h 1 n 1 n 2 n 2 T w ˜ σ h 2 n 2 = N w ˜ σ N
where N = n 1 n 1 T + n 2 n 2 T and σ N = σ h 1 n 1 + σ h 2 n 2 represent the combined approximation error effect.
The weight error dynamics can thus be expressed as:
w ˜ ( t i + 1 ) = w ˜ ( t i ) β Ω ( t i ) Δ e ˜ ( t i ) λ j = 1 2 h j h j w ^ = w ˜ ( t i ) β Ω ( t i ) ( Ω T w ˜ + σ ) + λ N w ˜ + λ σ N = w ˜ ( t i ) β Ω ( t i ) Ω T w ˜ β Ω ( t i ) σ β λ N w ˜ β λ σ N = ( I β M β λ N ) w ˜ ( t i ) β Ω ( t i ) σ β λ σ N
where M = Ω ( t i ) Ω T .

4.2. Lyapunov Stability Analysis

To analyze the stability of the learning process, we define a positive definite Lyapunov function candidate:
V ( t i ) = 1 2 w ˜ T ( t i ) w ˜ ( t i )
The change in this Lyapunov function between consecutive time steps is:
Δ V = V ( t i + 1 ) V ( t i ) = 1 2 w ˜ T ( t i + 1 ) w ˜ ( t i + 1 ) 1 2 w ˜ T ( t i ) w ˜ ( t i )
Substituting the weight error dynamics:
Δ V = 1 2 [ ( I β M β λ N ) w ˜ ( t i ) β Ω ( t i ) σ β λ σ N ] T · [ ( I β M β λ N ) w ˜ ( t i ) β Ω ( t i ) σ β λ σ N ] 1 2 w ˜ T ( t i ) w ˜ ( t i )
where M = Ω ( t i ) Ω T ( t i ) and N = n 1 n 1 T + n 2 n 2 T are both positive semi-definite matrices.
Equation (55) can be simplified as:
Δ V = β w ˜ T ( t i ) ( M + λ N ) w ˜ β w ˜ T ( t i ) ( Ω ( t i ) σ + λ σ N ) + 1 2 β 2 w ˜ T ( t i ) ( M + λ N ) 2 w ˜ + β 2 w ˜ T ( t i ) ( M + λ N ) ( Ω ( t i ) σ + λ σ N ) + 1 2 β 2 Ω ( t i ) σ + λ σ N 2 = β w ˜ T ( t i ) ( M + λ N ) β 2 ( M + λ N ) 2 w ˜ β w ˜ T ( t i ) ( I β ( M + λ N ) ) ( Ω ( t i ) σ + λ σ N ) + 1 2 β 2 Ω ( t i ) σ + λ σ N 2
Let us define P = ( M + λ N ) β 2 ( M + λ N ) 2 and Q = I β ( M + λ N ) . Then:
Δ V = β w ˜ T ( t i ) P w ˜ β w ˜ T ( t i ) Q ( Ω ( t i ) σ + λ σ N ) + 1 2 β 2 Ω ( t i ) σ + λ σ N 2
Based on Assumption 1, and noting that both the joint increment Δ q and the activation of the receptive field b ( x ) are bounded, it follows that Ω ( t i ) is also bounded. Consequently, there exists a small positive constant ϵ > 0 such that:
Ω ( t i ) σ + λ σ N ϵ , t i
To rigorously ensure stability, we assume the following Persistent Excitation (PE) condition:
Assumption 2 (Persistent Excitation).
There exists a positive constant γ min > 0 and a finite time interval T > 0 , such that for all t:
1 T t t + T Ω ( τ ) Ω T ( τ ) d τ γ min I
Under this assumption, define the average matrix as:
M a v g = 1 T t t + T M ( τ ) d τ γ min I
Hence, the averaged matrix M a v g + λ N is strictly positive definite, ensuring:
x T ( M a v g + λ N ) x γ min x 2 , x 0
Selecting the learning rate β to be sufficiently small:
0 < β < 2 λ max ( M a v g + λ N )
the Lyapunov function increment becomes:
Δ V β γ min w ˜ ( t i ) 2 + β w ˜ ( t i ) ϵ + 1 2 β 2 ϵ 2
Therefore, the weight estimation error w ˜ ( t i ) is uniformly ultimately bounded, and the system error converges to a small neighborhood of the origin, ensuring robust convergence and stability under the persistent excitation condition.

4.3. Convergence to Desired Pose

In practice, the estimated Jacobian J ^ differs from the true Jacobian J due to learning or modeling errors. We define the Jacobian estimation error as:
J ˜ ( t k ) = J ( t k ) J ^ ( t k )
Assume the control update is given by u ( t k ) = η J ^ + ( t k ) e ( t k ) . Then, the error propagation becomes:
e ( t k + 1 ) = e ( t k ) + J ( t k ) u ( t k ) = e ( t k ) η J ( t k ) J ^ + ( t k ) e ( t k ) = I η S ( t k ) e ( t k )
where we define the matrix S = J J ^ + = J ^ + J ˜ J ^ + = I + δ s .
Given that the neural network estimation error has been proven to converge, and with an appropriate neural network structure design along with sufficient pre-training data to obtain quality initial weights, it is reasonable to assume that δ s is bounded. There exists a constant σ Δ s such that:
δ s 2 σ Δ s
Define the Lyapunov function:
V ( t k ) = 1 2 e ( t k ) 2
The difference is:
V ( t k + 1 ) V ( t k ) = 1 2 e ( t k + 1 ) 2 1 2 e ( t k ) 2 = 1 2 I η S ( t k ) e ( t k ) 2 1 2 e ( t k ) 2 = η e T ( t k ) S ( t k ) e ( t k ) + η 2 2 S ( t k ) e ( t k ) 2
To ensure convergence (i.e., V ( t k + 1 ) V ( t k ) < 0 ), we require that η e T S e > η 2 2 S e 2 . This inequality holds for all e 0 if the matrix D = S + η 2 S T S is positive definite. We now derive a conservative bound on η to guarantee the positive definiteness of D .
For any unit vector x R n , we have:
x T S x = x T ( I + δ s ) x = 1 + x T δ s x 1 σ Δ s
λ min ( S T S ) = λ min 2 ( S ) ( 1 σ Δ s ) 2
Hence:
x T D x = x T S x + η 2 x T S T S x ( 1 σ Δ s ) + η 2 ( 1 σ Δ s ) 2
To ensure x T D x > 0 , we require:
( 1 σ Δ s ) + η 2 ( 1 σ Δ s ) 2 > 0 η > 2 ( 1 σ Δ s ) ( 1 σ Δ s ) 2
When the Jacobian estimation error is relatively small ( σ Δ s < 1 ), the system can maintain stability across a wide range of control gains. When the error becomes too large ( σ Δ s > 1 ), a larger control gain is required to maintain stability, but this may lead to other issues (such as excessive oscillation).
Combined with Theorem 1, which shows that e = 0 if and only if R = I and t = 0 , it follows that the camera pose converges to the desired pose. The proposed controller thus ensures robust convergence despite approximation errors in the neural network.

5. Simulation and Experiment Results

5.1. Pre-Training Process

To ensure efficient initialization and promote global convergence during the visual servoing task, a pre-training phase is conducted for the neural network. In this phase, the robot performs small randomized motions around its initial configuration. During these exploratory movements, the system collects training data consisting of system states x , the corresponding joint displacements, and the incremental task errors.
The collected system state data are denoted as X = x ( t 0 ) , x ( t 1 ) , , x ( t n s ) , where each sample may activate a distinct subset of local receptive fields within the neural network. The union of all receptive fields activated by the sample set X is represented by b s ( X ) R n l × 1 , where n l denotes the total number of receptive fields. Each entry in b s ( X ) indicates whether the corresponding receptive field is activated by any sample in the dataset.
The activation matrix A ( X ) is defined as:
A ( X ) = b s ( x ( t 0 ) ) Δ q 1 ( t 0 ) b s ( x ( t 0 ) ) Δ q n q ( t 0 ) b s ( x ( t 1 ) ) Δ q 1 ( t 1 ) b s ( x ( t 1 ) ) Δ q n q ( t 1 ) b s ( x ( t n s ) ) Δ q 1 ( t n s ) b s ( x ( t n s ) ) Δ q n q ( t n s ) , A R n s × ( n q n l ) .
where each row of A corresponds to a training sample, and each column corresponds to the influence of a specific receptive field on a particular joint.
The k-th element of incremental task error Δ e is expressed as:
Δ e k , sample = A ( X ) w k ,
where w k R n q n l × 1 denotes the weight vector of the neural network associated with the k-th row of the estimated Jacobian matrix. The vector Δ e k , sample R n s × 1 contains the incremental task errors for all samples in the dataset, and is defined as:
Δ e k , sample = Δ e k ( t 0 ) , Δ e k ( t 1 ) , , Δ e k ( t n s ) .
The weights of neural network are initially estimated by solving the linear regression problem:
w k = ( A T A ) 1 A T Δ e k , sample , k = 1 , 2 , , 10 .
This pre-training strategy effectively initializes the network weights, thus increasing the chances of achieving global convergence.

5.2. Simulations

5.2.1. Performance Analysis

To validate our algorithm, we conducted simulations using a 6-DOF Universal Robot with a camera mounted on its end effector. Recognizing that unsuitable initial weights in the neural network could cause the control error to settle on a local minimum, we implemented pre-training on the neural network.
The initial and final camera positions, as well as the camera’s target position, are shown in Figure 7a. It can be observed that the final camera position coincides with the reference position. Figure 7b displays the motion trajectories of 25 feature points, with “*” denoting their initial positions and hollow circles “O” indicating their target positions. Solid dots “.” represent the trajectories of the feature points, illustrating their gradual convergence towards the reference positions. Furthermore, in Figure 8a, the defined error vector e , as described in Equation (16), also converges to zero. This convergence indicates that as e converges, the image feature error also converges, and the camera’s pose aligns with the reference pose (R = I, t = 0).
λ is a critical parameter controlling the strength of geometric constraints in our model. As shown in Figure 8a,c, selecting an appropriate value of λ significantly influences both the convergence rate and stability of the neural network training process. Figure 8b illustrates the trajectory of feature point centers in image space, while Figure 8c demonstrates the evolution of feature point position errors over time.
When λ = 0 , the network exhibits oscillatory behavior with slower convergence, as evidenced by the fluctuations in Root Mean Square Error (RMSE) shown in Figure 8c. Introducing moderate constraint penalties ( λ = 0.001 to λ = 0.003 ) effectively dampens these oscillations and accelerates convergence, leading to smoother trajectories in image space as depicted in Figure 8b. However, we observe that when λ becomes too large ( λ = 0.006 ), the system exhibits different limitations. While initial error reduction is rapid, when feature point errors become small, the convergence rate significantly decreases, and the system struggles to minimize the residual errors as shown in the latter part of Figure 8c. This trade-off suggests that intermediate values (around λ = 0.003 ) provide the optimal balance between fast convergence, stability, and final accuracy.

5.2.2. Comparisons with Other Studies

Currently, model-free visual servoing systems can be categorized into two approaches. The first category encompasses neural network-based methods [17,18,19,21], which typically use feature point positions as inputs/outputs, resulting in high-dimensional networks when processing numerous feature points. The second category consists of numerical iteration-based methods, such as Broyden’s method [15,16,20], which avoid explicit modeling by iteratively updating the Jacobian matrix. To validate the effectiveness of our proposed method, we conducted a series of simulations in MATLAB R2024a (MathWorks, Natick, MA, USA), comparing our geometry-constrained learning-based visual servoing framework with two representative model-free approaches: Broyden’s update method [20] and a neural network-based visual servoing method [21]. To ensure a fair comparison, all methods were independently tuned to achieve their best performance under the same experimental conditions. In addition, for the neural network-based method [21], we implemented a structure similar to ours, using a CMAC network with four overlapping associative memory layers. Each input dimension was discretized into nine regions, with every layer internally organizing the inputs into three blocks per dimension.
In the simulation environment, we constructed scenarios with 100 feature points and evaluated performance under both noise-free and noisy image conditions. The evaluation metrics included the RMSE of feature positions and the convergence trajectory of the object center in image space.
As shown in Figure 9b, under noise-free conditions, all three methods eventually achieved convergence, but with significant differences in their convergence characteristics. Both our proposed method and Broyden’s method [20] demonstrated superior dynamic performance, rapidly reducing RMSE to near zero within approximately 6 s. Among these, our method achieved the fastest convergence rate. In contrast, the neural network-based method [21], while ultimately converging, exhibited notable fluctuations and required approximately 35 s to fully stabilize.
When image noise was introduced, the differences in robustness among the methods became more distinct. In this test, we applied random noise with a maximum amplitude of 20 pixels to 10 feature points to simulate measurement uncertainties in real-world environments. Under these conditions, our method maintained its rapid convergence characteristics and minimal steady-state error, while the comparative methods showed significant performance degradation as shown as Figure 9d. Specifically, the neural network-based method [21] eventually converged but required longer convergence time. Broyden’s method [20] demonstrated noticeable performance degradation, with its RMSE stabilizing at approximately 100 pixels, indicating high sensitivity to noise.
Figure 9a,c further visualize the object center trajectories in image space under noisy and noise-free conditions, respectively. The results clearly indicate that, in both scenarios, our method produces smoother trajectories and successfully converges to the target position even in the presence of noise.
Through these two simulation comparisons, we can conclude that our proposed algorithm outperforms existing methods in terms of convergence speed, stability, and noise robustness. This performance improvement can be attributed to two key innovations:
(1)
Using a fixed-dimension task space error function as the neural network input effectively reduces network complexity. This smaller input dimension results in fewer learnable parameters, making the learning process more efficient. In contrast, other neural networks use feature point positions as inputs, where the input dimension increases proportionally with the number of feature points. As a result, when the number of feature points is large, these methods require learning an excessive number of parameters, significantly increasing computational complexity.
(2)
Incorporating geometric constraints of feature points to assist network learning, which not only accelerates the learning process, but also ensures the physical feasibility of control outputs. In comparison, although Broyden’s method [20] avoids the problem of high-dimensional neural networks, its stability and convergence performance when processing image noise remain significantly inadequate, limiting its effectiveness in practical applications.
(3)
Leveraging the Projective Homography matrix to enhance noise robustness. Unlike methods that rely on individual feature points, homography-based estimation effectively filters out the influence of measurement noise and ensures a more stable and reliable Jacobian estimation.

5.2.3. Computational Complexity Analysis

To evaluate the computational efficiency of the proposed method, we compare it with two representative data-driven image-based visual servoing (IBVS) approaches: Broyden’s update method [20] and a fuzzy CMAC-based neural network method [21].
The method in [21] builds a fuzzy CMAC controller for each joint and uses the image feature error vector, which has a dimension of 2 P , as input. As a result, both the complexity of the inference and the training process grow linearly with the number of joints n q and the number of feature points P, leading to a complexity per step of O ( n q · P ) .
The method in [20] updates the Jacobian matrix and a dynamic projection matrix during each iteration. Although it avoids direct matrix inversion, the update process still involves matrix multiplications and control computation. This results in a per-step complexity of O ( P · n q + n q 2 ) , which increases significantly when P becomes large.
In contrast, the proposed method uses a compact input vector composed of homography-based task error and joint states, with a fixed dimension that does not depend on the number of feature points. The inference complexity per step is only O ( n q ) . This design reduces computational cost and improves real-time performance in visual tasks with many image features.

5.3. Experiments

As shown in Figure 10, the experimental setup consists of a 6-DOF UR5 collaborative robot with an Intel RealSense D435i RGB-D camera (30 FPS) rigidly mounted on its end-effector. The visual servoing system was implemented using the Robot Operating System (ROS). A visual processing node (20 Hz) performed feature extraction, matching, and control computation. A control interface node (125 Hz) converted the control outputs into joint velocity commands. Communication between nodes was handled via ROS topics over TCP/IP. Computation was performed on a workstation with an Intel i7-9700K CPU, 32 GB RAM, and an NVIDIA RTX 4080Ti GPU.
To validate the effectiveness of our algorithm, we conducted two comparative experiments using different geometric constraint parameters: λ . Both experiments tested robustness by introducing occlusions of partial feature points during robot motion.
In both experiments, the robot was initially programmed to follow a random trajectory near its starting position to collect pre-training data. Feature matching and homography estimation were performed using the LightGlue framework combined with RANSAC, ensuring robust performance under varying lighting conditions and viewpoint changes while effectively filtering out outliers.
In the first experiment ( λ = 0 ), Figure 11a shows the progress of the images captured during online learning compared to the reference images. The sequence displays images at the initial moment and at 1.18 s, 7.08 s, and 28.05 s, respectively. Figure 11c tracks the evolution of the task error e, revealing an initial error increase during the first 1 s due to insufficient neural network learning, followed by eventual convergence as learning progressed.
In the second experiment ( λ = 0.001 ), the system exhibited enhanced performance characteristics, as illustrated in Figure 11b,d. These figures demonstrate accelerated convergence toward the reference image compared to the unconstrained approach. Figure 12 presents the comparison of task error RMSE between both experimental conditions, revealing the advantages of incorporating geometric constraints. The constrained implementation (red line) achieves error reduction at a substantially higher rate.
Throughout both experiments, we deliberately occluded certain feature points at various times, yet the error consistently converged. This demonstrates our algorithm’s robustness in maintaining accuracy even under challenging conditions with varying numbers of visible feature points.
The experimental results confirm that incorporating geometric constraints through our proposed ( λ ) parameter significantly accelerates learning.

6. Conclusions

In this paper, we introduce a geometry-constrained learning-based controller for visual servoing systems based on projective homography. Our approach utilizes a neural network to learn the system’s Jacobian matrix, thereby eliminating the need for precise camera calibration parameters and detailed robot kinematic models. Unlike other neural network approaches, our method utilizes a newly defined error vector related to projective homography as input, ensuring a constant input size irrespective of the number of image feature points. We also demonstrated in Appendix A that the defined error vector e = 0 is a sufficient and necessary condition to achieve R = I and t = 0, which signifies the alignment of the camera with the target. Furthermore, we incorporated geometric constraints between feature points in the network update process. By ensuring that model predictions conform to the fundamental principles of projective geometry, we significantly improved learning efficiency. Through simulations and experiments, we validate that our approach achieves superior performance compared to other model-free visual servoing methods, exhibiting faster convergence rates, enhanced robustness to image noise and partial occlusions.
Looking forward, while the proposed framework is robust and calibration-free, it still assumes sufficient exploration and network convergence during deployment. In practice, especially in unfamiliar or dynamic environments, learning may be incomplete, potentially leading to unsafe control actions. As a future direction, we plan to embed safety constraints into the control law, such as limits on joint velocities, workspace boundaries, and proximity to humans or delicate objects. These constraint-aware mechanisms will help ensure safe robot behavior even under imperfect Jacobian estimation, improving system reliability and enabling deployment in safety-critical applications.

Author Contributions

Y.Z. led the research, including algorithm development, experimental design, and manuscript drafting. A.G. assisted in implementing the methodology and conducting experiments. Y.A. contributed to data processing and analysis. K.J. and S.K. provided technical guidance and helped refine the manuscript. T.K. supervised the research, offering critical insights and oversight throughout the study. All authors reviewed and approved the final version of the manuscript.

Funding

This research was supported by Imdang Scholarship & Cultural Foundation (S20241848000).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

The authors are thankful to the anonymous reviewers whose comments helped us to improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Necessity: e = 0 -> “ R = I and t = 0 ” If e = 0 , we have E i ε di 1 = ε di ε i = 0 . This implies that the feature points in the reference and current images are perfectly aligned. For a given set of four non-collinear points ε ci and ε di ( i = 1 , 2 , 3 , 4 , 5 ), their transformation uniquely determines the projective homography matrix. Therefore, G ¯ = I and γ i = 1 .
Because of G ¯ = G det G 1 / 3 , we can derive G = det G 1 / 3 I , which implies the G is a diagonal matrix with all elements on its diagonal being det G 1 / 3 . According to Equation (4), we obtain:
H = K 1 GK = det G 1 / 3 I
H multiplies the unit vectors i x = 1 0 0 T , i y = 0 1 0 T and i z = 0 0 1 T , resulting in following equations:
H i x = R i x + t n d 1 = det G 1 / 3 i x
H i y = R i y + t n d 2 = det G 1 / 3 i y ,
H i z = R i z + t n d 2 = det G 1 / 3 i z ,
where n d i represents i-th element of the vector n d .
The expressions ( A2 ) n d 2 ( A3 ) n d 1 , ( A3 ) n d 3 ( A4 ) n d 2 and ( A4 ) n d 1 ( A2 ) n d 3 can be simplified as:
R v i = det G 1 / 3 v i
where v 1 = i x n d 2 i y n d 1 , v 2 = i y n d 3 i z n d 2 and v 3 = i z n d 1 i x n d 3 .
Because R rotates the vector v i without changing its length, it ensures that the norm of the rotated vector R v i is equal to the norm of the original vector v i , implying:
det G 1 / 3 = 1
We assume that n d 0 . From of Equations (A5) and (A6), we can further conclude that:
R = I
Substituting Equations (A6) and (A7) into Equation (2) yields:
t = 0
Sufficiency: “ R = I and t = 0 ” -> e = 0 .
It is evident that the desired camera frame { F c d } coincides with the current camera frame { F c } when R = I and t = 0. Consequently, the depth ratio Z d i Z i equals 1, and γ i = 1 . Additionally, we have G ¯ = 1 det K K 1 1 / 3 K K 1 = I . Therefore, e = 0 is proven in term of the Equations (13)–(16).

References

  1. Thomas, J.; Chaumette, F. Positioning in Congested Space by Combining Vision-based and Proximity-based Control. IEEE Robot. Autom. Lett. 2024, 9, 8362–8369. [Google Scholar] [CrossRef]
  2. Leite, G.R.; Araújo, Í.B.Q.d.; Martins, A.d.M. Regularized Maximum Correntropy Criterion Kalman Filter for Uncalibrated Visual Servoing in the Presence of Non-Gaussian Feature Tracking Noise. Sensors 2023, 23, 8518. [Google Scholar] [CrossRef]
  3. Gans, N.R.; Hutchinson, S.A. Stable visual servoing through hybrid switched-system control. IEEE Trans. Robot. 2007, 23, 530–540. [Google Scholar] [CrossRef]
  4. AlBeladi, A.; Ripperger, E.; Hutchinson, S.; Krishnan, G. Hybrid eye-in-hand/eye-to-hand image based visual servoing for soft continuum arms. IEEE Robot. Autom. Lett. 2022, 7, 11298–11305. [Google Scholar] [CrossRef]
  5. García-Aracil, N.; Malis, E.; Aracil-Santonja, R.; Pérez-Vidal, C. Continuous visual servoing despite the changes of visibility in image features. IEEE Trans. Robot. 2005, 21, 1214–1220. [Google Scholar] [CrossRef]
  6. Fang, Y.; Dixon, W.E.; Dawson, D.M.; Chawda, P. Homography-based visual servo regulation of mobile robots. IEEE Trans. Syst. Man, Cybern. Part B (Cybernetics) 2005, 35, 1041–1050. [Google Scholar] [CrossRef]
  7. Hu, G.; MacKunis, W.; Gans, N.; Dixon, W.E.; Chen, J.; Behal, A.; Dawson, D. Homography-based visual servo control with imperfect camera calibration. IEEE Trans. Autom. Control. 2009, 54, 1318–1324. [Google Scholar] [CrossRef]
  8. Chen, J.; Dawson, D.M.; Dixon, W.E.; Behal, A. Adaptive homography-based visual servo tracking for a fixed camera configuration with a camera-in-hand extension. IEEE Trans. Control. Syst. Technol. 2005, 13, 814–825. [Google Scholar] [CrossRef]
  9. Lai, B.; Li, Z.; Li, W.; Yang, C.; Pan, Y. Homography-based visual servoing of eye-in-hand robots with exact depth estimation. IEEE Trans. Ind. Electron. 2023, 71, 3832–3841. [Google Scholar] [CrossRef]
  10. Lei, X.; Fu, Z.; Spyrakos-Papastavridis, E.; Pan, J.; Li, M.; Chen, X. IHUVS: Infinite Homography-Based Uncalibrated Methodology for Robotic Visual Servoing. IEEE Trans. Ind. Electron. 2023, 71, 3822–3831. [Google Scholar] [CrossRef]
  11. Gong, Z.; Tao, B.; Yang, H.; Yin, Z.; Ding, H. An uncalibrated visual servo method based on projective homography. IEEE Trans. Autom. Sci. Eng. 2017, 15, 806–817. [Google Scholar] [CrossRef]
  12. Liu, C.; Ye, C.; Shi, H.; Lin, W. Discrete-Time Visual Servoing Control with Adaptive Image Feature Prediction Based on Manipulator Dynamics. Sensors 2024, 24, 4626. [Google Scholar] [CrossRef]
  13. Aghili, F. Fault-tolerant and adaptive visual servoing for capturing moving objects. IEEE/ASME Trans. Mechatron. 2021, 27, 1773–1783. [Google Scholar] [CrossRef]
  14. Xu, F.; Zhang, Y.; Sun, J.; Wang, H. Adaptive visual servoing shape control of a soft robot manipulator using bezier curve features. IEEE/ASME Trans. Mechatron. 2022, 28, 945–955. [Google Scholar] [CrossRef]
  15. Piepmeier, J.A.; McMurray, G.V.; Lipkin, H. A dynamic quasi-Newton method for uncalibrated visual servoing. In Proceedings of the 1999 IEEE International Conference on Robotics and Automation (cat. no. 99CH36288C), Detroit, MI, USA, 10–15 May 1999; Volume 2, pp. 1595–1600. [Google Scholar]
  16. Piepmeier, J.A.; Lipkin, H. Uncalibrated eye-in-hand visual servoing. Int. J. Robot. Res. 2003, 22, 805–819. [Google Scholar] [CrossRef]
  17. Tokuda, F.; Arai, S.; Kosuge, K. Convolutional neural network-based visual servoing for eye-to-hand manipulator. IEEE Access 2021, 9, 91820–91835. [Google Scholar] [CrossRef]
  18. Gao, J.; Proctor, A.; Bradley, C. Adaptive neural network visual servo control for dynamic positioning of underwater vehicles. Neurocomputing 2015, 167, 604–613. [Google Scholar] [CrossRef]
  19. Tan, N.; Yu, P.; Zheng, W. Uncalibrated and unmodeled image-based visual servoing of robot manipulators using zeroing neural networks. IEEE Trans. Cybern. 2022, 54, 2446–2459. [Google Scholar] [CrossRef]
  20. Xie, Z.; Zheng, Y.; Jin, L. A data-driven image-based visual servoing scheme for redundant manipulators with unknown structure and singularity solution. IEEE Trans. Syst. Man, Cybern. Syst. 2024, 54, 6230–6241. [Google Scholar] [CrossRef]
  21. Hwang, M.; Chen, Y.J.; Ju, M.Y.; Jiang, W.C. A fuzzy CMAC learning approach to image based visual servoing system. Inf. Sci. 2021, 576, 187–203. [Google Scholar] [CrossRef]
  22. Al-Shanoon, A.; Lang, H. Robotic manipulation based on 3-D visual servoing and deep neural networks. Robot. Auton. Syst. 2022, 152, 104041. [Google Scholar] [CrossRef]
  23. Hay, O.A.; Chehadeh, M.; Ayyad, A.; Wahbah, M.; Humais, M.A.; Boiko, I.; Seneviratne, L.; Zweiri, Y. Noise-tolerant identification and tuning approach using deep neural networks for visual servoing applications. IEEE Trans. Robot. 2023, 39, 2276–2288. [Google Scholar] [CrossRef]
  24. DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
  25. Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17627–17638. [Google Scholar]
Figure 1. Camera model.
Figure 1. Camera model.
Sensors 25 02514 g001
Figure 2. Image features.
Figure 2. Image features.
Sensors 25 02514 g002
Figure 3. Projective transformation of feature points with collinearity constraints.
Figure 3. Projective transformation of feature points with collinearity constraints.
Sensors 25 02514 g003
Figure 4. The structure of CMAC network.
Figure 4. The structure of CMAC network.
Sensors 25 02514 g004
Figure 5. Receptive field organization.
Figure 5. Receptive field organization.
Sensors 25 02514 g005
Figure 6. Control flow diagram.
Figure 6. Control flow diagram.
Sensors 25 02514 g006
Figure 7. (a) Camera positions. (b) Feature trajectories.
Figure 7. (a) Camera positions. (b) Feature trajectories.
Sensors 25 02514 g007
Figure 8. (a) Task error. (b) Object center. (c) RMSE.
Figure 8. (a) Task error. (b) Object center. (c) RMSE.
Sensors 25 02514 g008
Figure 9. (a) Position of object center without image noise. (b) RMSE of feature positions without image noise. (c) Position of object center with image noise. (d) RMSE of feature positions with image noise [20,21].
Figure 9. (a) Position of object center without image noise. (b) RMSE of feature positions without image noise. (c) Position of object center with image noise. (d) RMSE of feature positions with image noise [20,21].
Sensors 25 02514 g009
Figure 10. Receptive field organization.
Figure 10. Receptive field organization.
Sensors 25 02514 g010
Figure 11. (a) Experiment 1.(b) Experiment 2. (c) Task error of experiment 1. (d) Task error of experiment 2.
Figure 11. (a) Experiment 1.(b) Experiment 2. (c) Task error of experiment 1. (d) Task error of experiment 2.
Sensors 25 02514 g011
Figure 12. RMSE of feature positions for two experiments.
Figure 12. RMSE of feature positions for two experiments.
Sensors 25 02514 g012
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Ghosh, A.; An, Y.; Joo, K.; Kim, S.; Kuc, T. Geometry-Constrained Learning-Based Visual Servoing with Projective Homography-Derived Error Vector. Sensors 2025, 25, 2514. https://doi.org/10.3390/s25082514

AMA Style

Zhang Y, Ghosh A, An Y, Joo K, Kim S, Kuc T. Geometry-Constrained Learning-Based Visual Servoing with Projective Homography-Derived Error Vector. Sensors. 2025; 25(8):2514. https://doi.org/10.3390/s25082514

Chicago/Turabian Style

Zhang, Yueyuan, Arpan Ghosh, Yechan An, Kyeongjin Joo, SangMin Kim, and Taeyong Kuc. 2025. "Geometry-Constrained Learning-Based Visual Servoing with Projective Homography-Derived Error Vector" Sensors 25, no. 8: 2514. https://doi.org/10.3390/s25082514

APA Style

Zhang, Y., Ghosh, A., An, Y., Joo, K., Kim, S., & Kuc, T. (2025). Geometry-Constrained Learning-Based Visual Servoing with Projective Homography-Derived Error Vector. Sensors, 25(8), 2514. https://doi.org/10.3390/s25082514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop