*3.2. Gradient Fusion Algorithm*

Ma et al. proposed a Kalman-filter-based fusion method for network parameters updating [6]. Based on this theory, we performed some improvements when we used the MSGD algorithm to calculate the updating gradients. Considering that after the sensor has been selected and the measurement position has been determined, noise in the acquired signal cannot be avoided. Therefore, we considered further analysis of the gradients during the gradient updating process. Signal noise is assumed to be hidden in the gradient information of each sample. Thus, the gradient information is fused inside the minibatch using a Kalman filter.

When calculating the gradient inside the batch, we fuse the gradient information using the Kalman filter on each of the batch size gradients.

We use *k* and *k*−1 moments as an example, and the Kalman filter-based gradient fusion process is described as follows.

$$w(k) \, = \, Fw(k-1) + \delta(k). \tag{7}$$

$$z(k) = H(k)w(k) + \gamma(k). \tag{8}$$

*w*(*k*) is the network parameter at the moment *k*. *F* is the state transfer matrix. *δ*(*k*) is the state error, *γ*(*k*) is the measurement error and *<sup>δ</sup>*(*k*), *γ*(*k*) conform to a Gaussian distribution. Then according to the Kalman filter formula [34], the follow-up process in the iterative process is as follows.

First, an a priori estimate of the gradient is calculated.

$$
\hat{w}^-(k) = F\hat{w}(k-1) + \delta(k). \tag{9}
$$

*w*ˆ −(*k*) is an a priori estimate of the moment *k*. *w*<sup>ˆ</sup>(*k* − 1) is the optimal estimate at moment *k* − 1. Next, we update the a priori estimated covariance.

$$P^{-}(k) \ = \ FP(k-1)F^{T} + Q(k). \tag{10}$$

*<sup>P</sup>*−(*k*)is the a priori estimated covariance, which will be used when calculating the Kalman gain. *P*(*k* − 1) is the posterior estimated covariance at moment *k* − 1. *Q*(*k*) = *δ*(*k*)*δ<sup>T</sup>*(*k*) is the covariance of the state error. The measured value at the *k* moment is calculated according to Equation (8), and then the Kalman gain is updated [6].

$$K(k) \;=\;P^{-}(k)H(k)^{T}[H(k)P^{-}(k)H(k)^{T}+R(k)]\;^{-1}.\tag{11}$$

*R*(*k*) = *γ*(*k*)*γ<sup>T</sup>*(*k*) is the covariance of the measurement error. Then, the optimal estimate of the gradient at moment *k* can be found.

$$\begin{split} \mathfrak{w}(k) &= \mathfrak{w}^-(k) + K(k)[z(k) - H(k)\mathfrak{w}^-(k)] \\ &= [1 - K(k)H(k)]\mathfrak{w}^-(k) + K(k)z(k). \end{split} \tag{12}$$

[1 − *K*(*k*)*H*(*k*)] is the confidence level of the estimate. *K*(*k*) is the confidence level of the measured value.

Finally, the posterior estimated covariance is updated.

$$P(k) \, = \, \left[ I - K(k)H(k) \right] P^{-}(k) . \tag{13}$$

In specific applications, let the batch size be *NB*, so *k* = {1, 2 . . . *NB*}.

*w*(*k*) is the updated gradient value corresponding to the *NB* samples. *F* is the state transfer matrix, it is set to *I* in the paper.

*H*(*k*) is the measurement matrix with the measurement values matching the state values. *H*(*k*) is set to *I*. We update the Kalman gain as in Equation (14).

$$\begin{split} K(k) &= P^-(k)[P^-(k) + \mathcal{R}(k)]^{-1} \\ &= [P(k-1) + Q(k)][P(k-1) + Q(k) + \mathcal{R}(k)]^{-1} .\end{split} \tag{14}$$

In practical terms, the value of *R* can be set to a smaller value if the measurement error of the sensor is small, which means that we are more likely to believe that the gradient of each sample is the true value. Conversely, the value of *R*(*k*) can be increased appropriately. The value of *Q*(*k*) is adjusted according to the range of the gradient values during actual optimization.
