*2.2. Keyframe-Based SLAM Algorithm Fusing GNSS Positioning Information and Co-Location Information*

We deployed a SLAM framework modified and based on ORB-SLAM3 on each agent to run independently, to ensure each agent can complete independent navigation tasks in the situation where communication links are lost. The algorithm's flow is shown in Figure 3. ORB-SLAM3 was modified as follows:

**Figure 3.** The tracking thread processes sensor information in real time and initially estimates the pose of the current frame. In the process of tracking the local map, the track thread determines whether the current frame is used as a keyframe. Different from ORBSLAM3 s keyframe addition strategy, when valid GNSS positioning information completes timestamp alignment, frames aligned with GNSS information are also inserted as keyframes.


Each node represents a keyframe position and pose in the world frame. In Figure 4, the line between two nodes is called an "edge", which represents the constraint of the amount to be optimized in the optimization of the factor graph. The edge between two consecutive nodes is a local constraint, which comes from ORB-SLAM3 s pose estimation. The other edge is the constraint from the co-location results of the server and GNSS satellite positioning information. We used the VIO factor as the local constraints, and GNSS location information and co-location information as global constraints. An illustration of the secondstage global pose graph structure is shown in Figure 4.

**Figure 4.** Schematic diagram of the pose graph structure to be optimized in the second stage. The circle is the state quantity, and the yellow square is the constraint of local observation, that is, the relative pose transformation of VIO from ORBSLAM3. The other colored squares are constrained by global observations.

Construction of residuals with local constraints refer to ORBSLAM3 [1], supposing that there are two continuous keyframes *KFti* and *KFtj* , we define *q<sup>v</sup> <sup>t</sup>* as the attitude quaternion *KFt* under the VIO coordinate system, and *p<sup>v</sup> <sup>t</sup>* as the three-dimensional coordinate under the VIO coordinate system of *KFt*. Similarly, *q<sup>l</sup> <sup>t</sup>* and *p<sup>l</sup> <sup>t</sup>* are the quaternion and coordinates under the global coordinate system. The VIO local factor is derived as:

$$\mathbf{z}^{L} - \mathbf{s}^{L}(\boldsymbol{\chi}) = \mathbf{z}^{L} - \mathbf{s}^{L}(\mathbf{x}\_{t\_{j}}, \mathbf{x}\_{t\_{i}}) = \begin{bmatrix} q\_{t\_{j}}^{\upsilon} - \mathbf{1} \left( \mathbf{P}\_{t\_{i}}^{\upsilon} - \mathbf{P}\_{t\_{j}}^{\upsilon} \right) \\ q\_{t\_{j}}^{\upsilon} - \mathbf{1} q\_{t\_{i}}^{\upsilon} \end{bmatrix} \odot \begin{bmatrix} q\_{t\_{j}}^{l} - \mathbf{1} \left( \mathbf{P}\_{t\_{i}}^{l} - \mathbf{P}\_{t\_{j}}^{l} \right) \\ q\_{t\_{j}}^{l} - \mathbf{1} q\_{t\_{i}}^{l} \end{bmatrix} \tag{1}$$

where z is the measured value, provided by the results of the one-stage position estimation; *χ* is the state prediction; and *s* is the observation equation. *χ* is calculated from the pose transformation between the two moments. The specific method used in this paper uses the relative pose of the current moment and the previous moment obtained from the position estimation in the first stage to add to the position coordinate *xtj* at the previous moment to obtain the position coordinate *xti* at the current moment. is the minus operation on the error state of quaternion. We take the pose covariance matrix generated during SLAM as the covariance of local measurements. The essence of the local factor is the relative change in the pose in two keyframes.

Consider the following two situations where GNSS does not work: the GNSS signal is poor or has interference, resulting in a large error in the GNSS positioning information, resulting in a large drift in an agent's navigation trajectory; the GNSS signal is completely disabled, and there is no GNSS positioning information. Upon receipt of a valid GNSS location, the longitudinal dimension heights of the original measurements will be transformed into x, y, z coordinates in the local Cartesian coordinates system (ENU) during GNSS data preprocessing. *P<sup>G</sup> <sup>t</sup>* = [*x<sup>w</sup> <sup>t</sup>* , *y<sup>w</sup> <sup>t</sup>* , *z<sup>w</sup> t* ] *<sup>T</sup>* is the coordinate of the GNSS information in the transformed ENU coordinate system. *P<sup>G</sup> <sup>t</sup>* is the measured value of GNSS at that moment, *t*. The uncertainty of measurement is assumed to be a Gaussian distribution with mean and covariance. *Pt <sup>w</sup>* is the assumed GNSS estimate. The GNSS factor is derived as:

$$\mathbf{z}\_t^G - \mathbf{s}\_t^G(\chi) = \mathbf{z}\_t^G - \mathbf{s}\_t^G(\mathbf{x}\_t) = P\_t^G - P\_t^w \tag{2}$$

When each agent receives a new pose about a keyframe's collaborative positioning from the server, we introduce the result of collaborative positioning as a new measured value into the process of secondary optimization. We assume that the result obtained by *KFt* in the first-stage pose estimation at that moment in the VIO coordinate system is {*qv*, *pv*}. The pose obtained by *KFt* after the co-location is transformed into {*qc*, *pc*} on the local framework of the agent. The co-location factor is derived as:

$$\mathbf{z}\_t^{\mathbb{C}} - \mathbf{s}\_t^{\mathbb{C}}(\boldsymbol{\chi}) = \mathbf{z}\_t^{\mathbb{C}} - \mathbf{s}\_t^{\mathbb{C}}(\mathbf{x}\_t) = \begin{bmatrix} q\_\upsilon^{l-1} \left( P\_\varepsilon^l - P\_\upsilon^l \right) \\ q\_\upsilon^{l-1} q\_\varepsilon^l \end{bmatrix} \tiny{(\circ)}\begin{bmatrix} q\_\upsilon^{w-1} \left( P\_\varepsilon^w - P\_\upsilon^w \right) \\ q\_\upsilon^{w-1} q\_\varepsilon^w \end{bmatrix} \tag{3}$$

Lastly, in order to prevent VIO from drifting too much, the transformation of the VIO coordinate system to the global coordinate system should be updated after each optimization. The operation must make the position after fusion overlap with the results of GNSS positioning and cooperative positioning as much as possible, and the difference between the two frames should be as equal as possible with VIO. The data from VIO does not impose constraints on the absolute position after fusion, and only requires that the incremental error of the position after fusion and the incremental error of VIO should be as small as possible.

#### *2.3. Communication*

The communication modules on the agents and server was based on The ROS communication infrastructure [21]. It was used for message passing over a wireless network. The communication interface supports two-way communication between the agents and the server, we applied it to update the keyframe pose at a given moment for single-platform SLAM. The server uses it to receive map information from each agent. With this module, the objects waiting to be sent were first serialized/deserialized. Since ROS is not a real-time communication system, time stamps needed to be attached to packets.
