1. Introduction
Visual servoing utilizes visual feedback to refine robot motion strategies, thereby significantly enhancing adaptability and intelligence. The visual system can be mounted on the end effector of the robot (eye-in-hand configuration), or positioned near the robot (hand-to-eye configuration) [
1]. This study mainly focuses on the eye-in-hand configuration.
Visual servoing is generally divided into two main approaches: position-based visual servoing (PBVS) and image-based visual servoing (IBVS) [
2,
3,
4]. PBVS operates in Cartesian space, enabling global convergence by directly associating image features with the pose of the target in 3D space. However, the PBVS is sensitive to camera calibration errors, inaccuracies in the kinematic model, and noise, often requiring precise calibration, making it challenging for non-experts. In contrast, IBVS defines its error signal in the image space, making it less sensitive to calibration errors that primarily affect convergence speed rather than accuracy. Nevertheless, IBVS is prone to local minima, and the appearance or disappearance of feature points during the control process can significantly disrupt the continuity of the control law [
5]. An alternative approach is homography-based visual servoing (HBVS), which combines the advantages of PBVS and IBVS. HBVS is resilient to the partial occlusion of features and does not require prior information for 3D reconstruction. Studies [
6,
7,
8,
9] have decomposed the homography matrix to obtain the translation vector t and rotation matrix R, converting the image information into motion in the Cartesian space. However, decomposing the homography matrix during the control process increases computational demands and affects the real-time performance of the system. To circumvent this issue, one study [
10] has proposed constructing a cost function directly from the elements of the homography matrix. However, this method often requires vanishing-point detection. To address these challenges, projective homography-based uncalibrated visual servoing (PHUVS) was introduced [
11]. The PHUVS establishes the task function by estimating the projective homography using only image information within a certain range, eliminating the need for vanishing point detection. Moreover, it can produce Jacobian matrices of fixed dimensions, thereby reducing the computational complexity when inverting matrices.
Visual servoing methods typically rely on a deeper understanding of the Jacobian matrix. This matrix describes the mapping between the visual information and manipulator joint angular displacements. It is a local linear approximation of the nonlinear and highly coupled relationship between the visual and motion spaces, making it crucial for control algorithms. Traditional methods usually require accurate system kinematic models or precise estimations of the intrinsic camera parameters. These requirements significantly limit their applicability. To address the challenges of uncalibrated visual servoing, adaptive controllers have been proposed to estimate unknown parameters online [
12,
13,
14]. However, these methods often require linear parameterization of the system modeling, which is complicated for nonlinear systems with multiple degrees of freedom. Consequently, model-free learning methods have been introduced to compensate for these uncertainties without the need for linear parameterization. These methods treat the estimation of a Jacobian matrix as a dynamic parameter identification problem. They employ recursive estimation techniques such as weighted recursive least squares, Broyden’s method, and dynamic Gauss–Newton algorithms [
15,
16]. Neural networks are renowned for their powerful nonlinear function approximation capabilities and have been widely applied in robotic systems [
17,
18,
19,
20,
21,
22,
23]. However, their application in visual servoing systems is relatively limited. In [
17], a method based on perceptron neural networks was proposed to learn the inverse mapping of an unknown interaction matrix; however, it required a large amount of data for offline training. Data-driven online mapping estimation methods have been designed [
18,
19]. In addition, ref. [
20] proposed a data-driven IBVS method that addresses both the target tracking and physical constraints of the robot, such as joint velocity limits and FOV constraints. The study in [
21] presented a robotic control system that uses a CMAC network with a Takagi–Sugeno fuzzy framework, capable of online learning joint velocities directly from image feature errors. Recent works have also explored deep learning-based visual servoing frameworks with improved robustness and noise tolerance. For instance, ref. [
22] employs deep networks for 3D visual servoing without requiring precise calibration, while [
23] introduces a noise-tolerant Jacobian estimation method using neural networks under pixel-level disturbances.
Traditional visual servoing systems often rely on handcrafted features such as SIFT or ORB, which can be sensitive to noise, lighting changes, and viewpoint variations. To improve robustness, learning-based methods like SuperPoint [
24] and LightGlue [
25] have been developed. These methods offer improved feature repeatability and matching accuracy under challenging conditions, making them promising complements to learning-based visual servoing frameworks.
Current visual servoing algorithms face two main challenges: (1) Existing model-free visual servoing approaches are predominantly based on IBVS. The loss of feature points can cause interruptions in the control process. However, due to changes in lighting and partial occlusion of feature points, it is difficult to ensure that every feature point is detected and accurately matched during control. Hence, IBVS is primarily designed for basic visual primitives, such as distinct points or edges. For objects with sparse or low-texture surfaces, feature matching errors become significantly higher, further degrading the reliability of the control process. Furthermore, the size of the image Jacobian matrix is proportional to the number of feature points, resulting in high computational costs when calculating the Jacobian pseudoinverse. When feature points are used as the inputs or outputs of a neural network, an increase in their number significantly expands the network’s dimensionality, thereby affecting training efficiency and computational complexity. This not only increases the number of network parameters but may also lead to optimization issues such as gradient vanishing or gradient explosion, ultimately impacting convergence speed and stability. (2) Current model-free visual servoing methods typically use neural networks or recursive algorithms to learn the Image Jacobian or map visual features to control commands. However, these approaches often ignore the geometric constraints among visual features, resulting in parameter updates that violate physical laws during training. Without explicitly incorporating these constraints, robots may execute unreasonable motions.
We develop a geometry-constrained learning-based control strategy based on PHUVS. The main contributions of this paper include:
Learning-Based Control Strategy: A novel control method is proposed for PHUVS, where the CMAC neural network is employed to estimate the Jacobian matrix online. This eliminates the need for traditional calibration and kinematic modeling.
Fixed-Dimension Visual Error Function via Homography: A new visual error function based on the projective homography matrix is designed. It maintains a fixed dimension and uniform magnitude across all components, thereby improving learning efficiency, reducing computational complexity, and offering robustness against feature point occlusion and detection errors.
Incorporation of Geometric Constraints in Learning: Fundamental geometric relationships among visual features (e.g., collinearity) are embedded into the CMAC network’s learning process, enhancing both accuracy and convergence speed.
The remainder of this paper is organized as follows:
Section 2 provides an overview of the camera/robot model and the projective homography matrix, while clearly defining the control problem.
Section 3 details the developed controller.
Section 4 presents the stability analysis of the proposed control system, providing theoretical guarantees for the controller’s performance. In
Section 5, we analyze the simulation and experimental results, to demonstrate the efficacy of proposed methodology.
Section 6 presents the conclusions and highlights the contributions of the study.
3. Controller Development
3.1. Novel Fixed-Dimension, Uniform-Magnitude Task Error Function
Assume that the camera captures an image at a reference position. The objective is to adjust the pose of the robot to ensure the current image aligns with the reference image, effectively bringing the camera back to its reference position.
This process involves the detecting and matching of image feature points. In
Figure 2, the blue points represent the matched feature point pairs:
, which are used to estimate the projective homography matrix
. To ensure robust estimation of this transformation, we use enough feature points across both images. Having sufficient points provides stability in the homography estimation, as it reduces the sensitivity to noise or small mismatches at individual points. Robust estimation methods, such as random sample consensus (RANSAC), help filter out outliers and produce an accurate homography matrix
.
To achieve a fixed-dimensional Jacobian, we introduce five fixed, non-collinear virtual reference feature points around the object center. The expected position of each virtual feature point in the reference image is defined as:
where
represents the center of the object in the reference image, and
and
are the displacements of the feature points in the u and v directions, respectively.
We define an error matrix as follows:
The actual position of the i-th virtual feature point in the current image is denoted as
. Based on Equations (
5), (
6), (
13) and (
14), the position error can be rewritten as:
where
represents the first two rows of the matrix
.
Furthermore, is our designed error task function for the i-th feature. If the estimation is precise, is equivalent to the feature point position error .
We combine the error vectors for the four virtual feature points into a single error task vector:
Theorem 1. The task error function if and only if and , which is proven in Appendix A. As mentioned earlier, in the traditional IBVS, it is essential to accurately track each image feature point throughout the entire visual servoing process. This requirement can be stringent and poses challenges, particularly in dynamic environments where feature points may become obscured. In contrast, our PHUVS-based control system can compute the projective homography matrix using a local set of feature points. RANSAC is employed to filter outliers and enhance the robustness of the system to noise. Furthermore, even if the i-th virtual feature point is occluded, we can still determine the scale factor
and its corresponding error matrix based on Equations (
13) and (
14). This implies that the loss of certain feature points do not affect the computation of the task function.
If the projective homography matrix is estimated precisely, the vector error is equivalent to the position error for the i-th auxiliary feature point. Each element of is expressed in the same unit (pixels), ensuring uniformity in magnitude across all components. In the subsequent sections, we use as an input signal for the neural network. This consistency simplifies the input quantization process, thus enhancing the effectiveness of the learning process.
We define the system state as:
The mapping between the derivative of the new task function and the robot joint velocities is defined as:
where
is the Jacobian matrix of the entire system.
3.2. Collinearity Constraint
In visual servoing systems, utilizing geometric relationships between feature points enhances control performance. As shown in
Figure 3, we consider two key collinearity relationships: points
,
and
form one collinear set, while
,
and
form another. These collinearity properties remain invariant under projective transformations, ensuring that corresponding points in the current configuration (
,
,
) and (
,
,
) maintain the same geometric relationships. The arrows in the figure represent the velocity vectors of the feature points during the servoing process. Based on these collinearity relationships, we derive velocity constraints that preserve the geometric properties throughout the motion.
For collinear points, the following relationships hold:
where parameter
and
can be calculated as:
Differentiating Equations (
19) and (
20) with respect to time, we obtain:
This velocity relationship indicates that the velocity of point 5, , is a weighted combination of (,) and (, ).
For a static target object,
. In terms of Equations (
15), (
23) and (
24) can be written as:
where
, i = 1, 2,…5. To eliminate the influence of
and
in the last terms, we multiply both sides of the equation by
and
, yielding:
These velocity constraints are incorporated into our learning algorithm, ensuring that the neural network updates respect the underlying geometric principles of projective transformations.
3.3. Model-Free Jacobian Learning
Traditional methods typically rely on precise mathematical models; however, in practical applications, accurate modeling is often difficult to achieve due to system parameter uncertainties and environmental disturbances. This chapter proposes the use of a CMAC neural network to learn and approximate the system’s Jacobian matrix, enabling more robust and adaptive control strategies. The chapter first introduces the CMAC network structure, then explains the weight learning process. Notably, our weight update approach incorporates geometric constraints of visual features to accelerate the learning process.
3.3.1. CMAC Model
The CMAC network is divided into five layers: input layer, association memory space, receptive-field space, weight memory space and output layer, as shown in
Figure 4.
Input space:
In this research, we use the state vector as the input and employ a neural network to learn the unknown Jacobian matrix .
The neural network input dimension is
(
and the robot has six joints:
). The input vector needs to be quantized into discrete regions according to its corresponding range:
where
represents the maximum index value after quantization, and
denotes the floor function, ensuring that
is an integer. And
and
represent the minimum and maximum values of ith element of state vector
.
Figure 5 illustrates a CMAC neural network with two-dimensional input, where each input is discretized into 10 regions:
.
Association memory space:
Each input dimension is uniformly partitioned into several discrete regions. Each complete block includes three adjacent regions. In
Figure 5, input
and
have their regions grouped into blocks (a, b, c, d) and (A, B, C, D), respectively.
The network creates multiple layers by shifting the block boundaries. For example, layer 2 contains blocks D, E, F for
and blocks d, e, f for
, formed by shifting the original blocks by one region. Subsequent layer 3 follows the same pattern with additional shifts. Each block is represented by a Gaussian membership function:
where
k denotes the
k-th block and
represents the total number of blocks for input dimension
. The parameters
and
represent the center and width of the Gaussian function for the
k-th block, respectively.
Receptive-field space:
The receptive field refers to the region in the input space that can activate specific neurons or memory cells in the network. A key feature of the CMAC network is the use of overlapping receptive fields. This means that a single input can simultaneously activate multiple memory cells, creating a distributed representation. The degree of overlap affects the network’s smoothness and generalization capability; highly overlapping receptive fields produce smoother function approximations, while less overlap produces more localized responses. As shown in
Figure 5, point p in space falls within blocks aB, fD, and Ee, activating the corresponding memory cells. The multidimensional receptive-field function is defined as
where
represents the number of layers in the receptive field space.
Weight memory space and output layer:
Each receptive field space is connected to the weighted memory W. The i-th output of the neural network is:
where
and
.
3.3.2. Weight Update of Neural Network
The discrete form of the Equation (
18) can be written as:
where the subscript i represents the i-th time step.
represents the joint increment over the sampling interval. Next, the state vector
is used as input to approximate the Jacobian matrix using the CMAC network.
where
are constant matrices of ideal network weights and
is used to approximate the i-th row and jth column element of the Jacobian matrix
, and
is the vector of receptive-field functions. The approximation error of the neural network is denoted as
.
It is important to note that the estimated incremental task error
consists of five 2D vectors, where each
. To distinguish between these components, we use superscript notation to denote the k-th element of
:
. Thus, we have:
where
and
is the approximation error of the neural network.
, and
represents the
m-th element of
.
is further expressed as shown below:
where
represents the m-th column of
.
is the approximation error term introduced by the neural network model.
and
.
Assumption 1. The approximation error σ is assumed to be upper bounded:where is a finite positive constant. In addition, under a properly designed neural network architecture, σ can be made sufficiently small. However, is unknown. Using the estimated weights, the m-th column and the k-th row of the estimated Jacobian matrix can be given by: where is the estimated value of desired .
The estimated value of
is:
The estimated incremental change in task error is:
To incorporate geometric constraints into the learning process, we leverage the velocity relationships between collinear feature points defined in Equations (
27) and (
28). The Jacobian matrix estimated by our neural network must satisfy the following constraints:
We then define a loss function that incorporates both prediction accuracy and geometric constraints:
Then, the weights of neural network are updated by:
where
represents the learning rate parameter for the neural network weight update process, and
is the estimation error.
3.4. Controller Design
The control law of a robot is defined as:
where
is a positive value.
Figure 6 illustrates the control flow diagram. The process begins with image acquisition and feature point detection. Matched feature pairs
are utilized to estimate the projective homography matrix
using the RANSAC algorithm. Based on
, the task error vector
and a set of virtual feature points
are computed according to Equations (
12)–(
16).
Next, the estimated system state vector, which includes the task error
and the joint position
, is input to the CMAC neural network. The network predicts the task error increment
and estimates the Jacobian matrix. Importantly, the geometry constraint module is activated after this estimation phase. It incorporates
and the virtual feature points to compute constraint terms (see Equations (
39) and (
40)), which are then embedded into the neural network’s weight update rule (Equation (
42)) to preserve geometric consistency, such as point collinearity under projective transformations.
Finally, the joint controller utilizes the estimated Jacobian and task error to compute the control command
, as described in Equation (
43), which is then sent to the robot actuators. The entire procedure operates in a closed-loop fashion, with the visual feedback continually guiding subsequent control iterations.
4. Stability Analysis
In this section, we establish the stability of the proposed geometry-constrained learning-based visual servoing controller.
4.1. Geometric Constraint Relationships
Let us define the weight estimation error as
, where
represents the ideal weights. According to Equations (
35) and (
37), we have:
Similarly, from Equations (
36) and (
38), we can derive:
For the actual error increment
, the geometric constraints must be satisfied, i.e.,
. We establish the relationship between these constraint equations and the estimation error of the task error increment:
Similarly, for the second constraint:
It should be noted that
is a two-dimensional vector containing the
-th and
-th elements of
. The constraints
and
can be reformulated as:
where
and
represent the combined approximation errors in the constraint equations.
and
are the first and second elements of
, respectively. Similarly,
and
are the first and second elements of
.
The vectors
and
are defined as:
The geometric constraints contribute to the neural network weight update rule as shown in Equation (
42). The second term in this equation can be reformulated as:
where
and
represent the combined approximation error effect.
The weight error dynamics can thus be expressed as:
where
.
4.2. Lyapunov Stability Analysis
To analyze the stability of the learning process, we define a positive definite Lyapunov function candidate:
The change in this Lyapunov function between consecutive time steps is:
Substituting the weight error dynamics:
where
and
are both positive semi-definite matrices.
Equation (
55) can be simplified as:
Let us define
and
. Then:
Based on Assumption 1, and noting that both the joint increment
and the activation of the receptive field
are bounded, it follows that
is also bounded. Consequently, there exists a small positive constant
such that:
To rigorously ensure stability, we assume the following Persistent Excitation (PE) condition:
Assumption 2 (Persistent Excitation). There exists a positive constant and a finite time interval , such that for all t: Under this assumption, define the average matrix as: Hence, the averaged matrix is strictly positive definite, ensuring: Selecting the learning rate β to be sufficiently small:the Lyapunov function increment becomes: Therefore, the weight estimation error is uniformly ultimately bounded, and the system error converges to a small neighborhood of the origin, ensuring robust convergence and stability under the persistent excitation condition.
4.3. Convergence to Desired Pose
In practice, the estimated Jacobian
differs from the true Jacobian
due to learning or modeling errors. We define the Jacobian estimation error as:
Assume the control update is given by
. Then, the error propagation becomes:
where we define the matrix
.
Given that the neural network estimation error has been proven to converge, and with an appropriate neural network structure design along with sufficient pre-training data to obtain quality initial weights, it is reasonable to assume that
is bounded. There exists a constant
such that:
Define the Lyapunov function:
To ensure convergence (i.e., ), we require that . This inequality holds for all if the matrix is positive definite. We now derive a conservative bound on to guarantee the positive definiteness of .
For any unit vector
, we have:
Hence:
To ensure
, we require:
When the Jacobian estimation error is relatively small (), the system can maintain stability across a wide range of control gains. When the error becomes too large (), a larger control gain is required to maintain stability, but this may lead to other issues (such as excessive oscillation).
Combined with Theorem 1, which shows that if and only if and , it follows that the camera pose converges to the desired pose. The proposed controller thus ensures robust convergence despite approximation errors in the neural network.
5. Simulation and Experiment Results
5.1. Pre-Training Process
To ensure efficient initialization and promote global convergence during the visual servoing task, a pre-training phase is conducted for the neural network. In this phase, the robot performs small randomized motions around its initial configuration. During these exploratory movements, the system collects training data consisting of system states , the corresponding joint displacements, and the incremental task errors.
The collected system state data are denoted as , where each sample may activate a distinct subset of local receptive fields within the neural network. The union of all receptive fields activated by the sample set is represented by , where denotes the total number of receptive fields. Each entry in indicates whether the corresponding receptive field is activated by any sample in the dataset.
The activation matrix
is defined as:
where each row of
corresponds to a training sample, and each column corresponds to the influence of a specific receptive field on a particular joint.
The k-th element of incremental task error
is expressed as:
where
denotes the weight vector of the neural network associated with the
k-th row of the estimated Jacobian matrix. The vector
contains the incremental task errors for all samples in the dataset, and is defined as:
The weights of neural network are initially estimated by solving the linear regression problem:
This pre-training strategy effectively initializes the network weights, thus increasing the chances of achieving global convergence.
5.2. Simulations
5.2.1. Performance Analysis
To validate our algorithm, we conducted simulations using a 6-DOF Universal Robot with a camera mounted on its end effector. Recognizing that unsuitable initial weights in the neural network could cause the control error to settle on a local minimum, we implemented pre-training on the neural network.
The initial and final camera positions, as well as the camera’s target position, are shown in
Figure 7a. It can be observed that the final camera position coincides with the reference position.
Figure 7b displays the motion trajectories of 25 feature points, with “*” denoting their initial positions and hollow circles “O” indicating their target positions. Solid dots “.” represent the trajectories of the feature points, illustrating their gradual convergence towards the reference positions. Furthermore, in
Figure 8a, the defined error vector
, as described in Equation (
16), also converges to zero. This convergence indicates that as
converges, the image feature error also converges, and the camera’s pose aligns with the reference pose (R = I, t = 0).
is a critical parameter controlling the strength of geometric constraints in our model. As shown in
Figure 8a,c, selecting an appropriate value of
significantly influences both the convergence rate and stability of the neural network training process.
Figure 8b illustrates the trajectory of feature point centers in image space, while
Figure 8c demonstrates the evolution of feature point position errors over time.
When
, the network exhibits oscillatory behavior with slower convergence, as evidenced by the fluctuations in Root Mean Square Error (RMSE) shown in
Figure 8c. Introducing moderate constraint penalties (
to
) effectively dampens these oscillations and accelerates convergence, leading to smoother trajectories in image space as depicted in
Figure 8b. However, we observe that when
becomes too large (
), the system exhibits different limitations. While initial error reduction is rapid, when feature point errors become small, the convergence rate significantly decreases, and the system struggles to minimize the residual errors as shown in the latter part of
Figure 8c. This trade-off suggests that intermediate values (around
) provide the optimal balance between fast convergence, stability, and final accuracy.
5.2.2. Comparisons with Other Studies
Currently, model-free visual servoing systems can be categorized into two approaches. The first category encompasses neural network-based methods [
17,
18,
19,
21], which typically use feature point positions as inputs/outputs, resulting in high-dimensional networks when processing numerous feature points. The second category consists of numerical iteration-based methods, such as Broyden’s method [
15,
16,
20], which avoid explicit modeling by iteratively updating the Jacobian matrix. To validate the effectiveness of our proposed method, we conducted a series of simulations in MATLAB R2024a (MathWorks, Natick, MA, USA), comparing our geometry-constrained learning-based visual servoing framework with two representative model-free approaches: Broyden’s update method [
20] and a neural network-based visual servoing method [
21]. To ensure a fair comparison, all methods were independently tuned to achieve their best performance under the same experimental conditions. In addition, for the neural network-based method [
21], we implemented a structure similar to ours, using a CMAC network with four overlapping associative memory layers. Each input dimension was discretized into nine regions, with every layer internally organizing the inputs into three blocks per dimension.
In the simulation environment, we constructed scenarios with 100 feature points and evaluated performance under both noise-free and noisy image conditions. The evaluation metrics included the RMSE of feature positions and the convergence trajectory of the object center in image space.
As shown in
Figure 9b, under noise-free conditions, all three methods eventually achieved convergence, but with significant differences in their convergence characteristics. Both our proposed method and Broyden’s method [
20] demonstrated superior dynamic performance, rapidly reducing RMSE to near zero within approximately 6 s. Among these, our method achieved the fastest convergence rate. In contrast, the neural network-based method [
21], while ultimately converging, exhibited notable fluctuations and required approximately 35 s to fully stabilize.
When image noise was introduced, the differences in robustness among the methods became more distinct. In this test, we applied random noise with a maximum amplitude of 20 pixels to 10 feature points to simulate measurement uncertainties in real-world environments. Under these conditions, our method maintained its rapid convergence characteristics and minimal steady-state error, while the comparative methods showed significant performance degradation as shown as
Figure 9d. Specifically, the neural network-based method [
21] eventually converged but required longer convergence time. Broyden’s method [
20] demonstrated noticeable performance degradation, with its RMSE stabilizing at approximately 100 pixels, indicating high sensitivity to noise.
Figure 9a,c further visualize the object center trajectories in image space under noisy and noise-free conditions, respectively. The results clearly indicate that, in both scenarios, our method produces smoother trajectories and successfully converges to the target position even in the presence of noise.
Through these two simulation comparisons, we can conclude that our proposed algorithm outperforms existing methods in terms of convergence speed, stability, and noise robustness. This performance improvement can be attributed to two key innovations:
- (1)
Using a fixed-dimension task space error function as the neural network input effectively reduces network complexity. This smaller input dimension results in fewer learnable parameters, making the learning process more efficient. In contrast, other neural networks use feature point positions as inputs, where the input dimension increases proportionally with the number of feature points. As a result, when the number of feature points is large, these methods require learning an excessive number of parameters, significantly increasing computational complexity.
- (2)
Incorporating geometric constraints of feature points to assist network learning, which not only accelerates the learning process, but also ensures the physical feasibility of control outputs. In comparison, although Broyden’s method [
20] avoids the problem of high-dimensional neural networks, its stability and convergence performance when processing image noise remain significantly inadequate, limiting its effectiveness in practical applications.
- (3)
Leveraging the Projective Homography matrix to enhance noise robustness. Unlike methods that rely on individual feature points, homography-based estimation effectively filters out the influence of measurement noise and ensures a more stable and reliable Jacobian estimation.
5.2.3. Computational Complexity Analysis
To evaluate the computational efficiency of the proposed method, we compare it with two representative data-driven image-based visual servoing (IBVS) approaches: Broyden’s update method [
20] and a fuzzy CMAC-based neural network method [
21].
The method in [
21] builds a fuzzy CMAC controller for each joint and uses the image feature error vector, which has a dimension of
, as input. As a result, both the complexity of the inference and the training process grow linearly with the number of joints
and the number of feature points
P, leading to a complexity per step of
.
The method in [
20] updates the Jacobian matrix and a dynamic projection matrix during each iteration. Although it avoids direct matrix inversion, the update process still involves matrix multiplications and control computation. This results in a per-step complexity of
, which increases significantly when
P becomes large.
In contrast, the proposed method uses a compact input vector composed of homography-based task error and joint states, with a fixed dimension that does not depend on the number of feature points. The inference complexity per step is only . This design reduces computational cost and improves real-time performance in visual tasks with many image features.
5.3. Experiments
As shown in
Figure 10, the experimental setup consists of a 6-DOF UR5 collaborative robot with an Intel RealSense D435i RGB-D camera (30 FPS) rigidly mounted on its end-effector. The visual servoing system was implemented using the Robot Operating System (ROS). A visual processing node (20 Hz) performed feature extraction, matching, and control computation. A control interface node (125 Hz) converted the control outputs into joint velocity commands. Communication between nodes was handled via ROS topics over TCP/IP. Computation was performed on a workstation with an Intel i7-9700K CPU, 32 GB RAM, and an NVIDIA RTX 4080Ti GPU.
To validate the effectiveness of our algorithm, we conducted two comparative experiments using different geometric constraint parameters: . Both experiments tested robustness by introducing occlusions of partial feature points during robot motion.
In both experiments, the robot was initially programmed to follow a random trajectory near its starting position to collect pre-training data. Feature matching and homography estimation were performed using the LightGlue framework combined with RANSAC, ensuring robust performance under varying lighting conditions and viewpoint changes while effectively filtering out outliers.
In the first experiment (
),
Figure 11a shows the progress of the images captured during online learning compared to the reference images. The sequence displays images at the initial moment and at 1.18 s, 7.08 s, and 28.05 s, respectively.
Figure 11c tracks the evolution of the task error e, revealing an initial error increase during the first 1 s due to insufficient neural network learning, followed by eventual convergence as learning progressed.
In the second experiment (
), the system exhibited enhanced performance characteristics, as illustrated in
Figure 11b,d. These figures demonstrate accelerated convergence toward the reference image compared to the unconstrained approach.
Figure 12 presents the comparison of task error RMSE between both experimental conditions, revealing the advantages of incorporating geometric constraints. The constrained implementation (red line) achieves error reduction at a substantially higher rate.
Throughout both experiments, we deliberately occluded certain feature points at various times, yet the error consistently converged. This demonstrates our algorithm’s robustness in maintaining accuracy even under challenging conditions with varying numbers of visible feature points.
The experimental results confirm that incorporating geometric constraints through our proposed () parameter significantly accelerates learning.
6. Conclusions
In this paper, we introduce a geometry-constrained learning-based controller for visual servoing systems based on projective homography. Our approach utilizes a neural network to learn the system’s Jacobian matrix, thereby eliminating the need for precise camera calibration parameters and detailed robot kinematic models. Unlike other neural network approaches, our method utilizes a newly defined error vector related to projective homography as input, ensuring a constant input size irrespective of the number of image feature points. We also demonstrated in
Appendix A that the defined error vector e = 0 is a sufficient and necessary condition to achieve R = I and t = 0, which signifies the alignment of the camera with the target. Furthermore, we incorporated geometric constraints between feature points in the network update process. By ensuring that model predictions conform to the fundamental principles of projective geometry, we significantly improved learning efficiency. Through simulations and experiments, we validate that our approach achieves superior performance compared to other model-free visual servoing methods, exhibiting faster convergence rates, enhanced robustness to image noise and partial occlusions.
Looking forward, while the proposed framework is robust and calibration-free, it still assumes sufficient exploration and network convergence during deployment. In practice, especially in unfamiliar or dynamic environments, learning may be incomplete, potentially leading to unsafe control actions. As a future direction, we plan to embed safety constraints into the control law, such as limits on joint velocities, workspace boundaries, and proximity to humans or delicate objects. These constraint-aware mechanisms will help ensure safe robot behavior even under imperfect Jacobian estimation, improving system reliability and enabling deployment in safety-critical applications.