*Article* **Optimization of the Compressive Measurement Matrix in a Massive MIMO System Exploiting LSTM Networks**

**Saidur R. Pavel and Yimin D. Zhang \***

Department of Electrical and Computer Engineering, Temple University, Philadelphia, PA 19122, USA **\*** Correspondence: ydzhang@temple.edu

**Abstract:** Massive multiple-input multiple-output (MIMO) technology, which is characterized by the use of a large number of antennas, is a key enabler for the next-generation wireless communication and beyond. Despite its potential for high performance, implementing a massive MIMO system presents numerous technical challenges, including the high hardware complexity, cost, and power consumption that result from the large number of antennas and the associated front-end circuits. One solution to these challenges is the use of hybrid beamforming, which divides the transceiving process into both analog and digital domains. To perform hybrid beamforming efficiently, it is necessary to optimize the analog beamformer, referred to as the compressive measurement matrix (CMM) here, that allows the projection of high-dimensional signals into a low-dimensional manifold. Classical approaches to optimizing the CMM, however, are computationally intensive and time consuming, limiting their usefulness for real-time processing. In this paper, we propose a deep learning based approach to optimizing the CMM using long short-term memory (LSTM) networks. This approach offers high accuracy with low complexity, making it a promising solution for the real-time implementation of massive MIMO systems.

**Keywords:** massive MIMO; hybrid beamforming; compressive measurement matrix; long short-term memory network

#### **1. Introduction**

In recent years, the massive multiple input multiple output (MIMO) technology has emerged as a highly promising solution for modern wireless communication. With the growing demand for high-speed data transfer and low latency, the implementation of massive MIMO has become increasingly important, especially in millimeter wave (mmWave) communication, which is a crucial aspect for the future of 5G networks. The central idea behind massive MIMO is to equip base stations with a large number of antennas, which allows multiple users to be served at the same time in the same frequency band. This results in a significant increase in both capacity and spectral efficiency compared to traditional MIMO systems. The high number of antennas in a massive MIMO system enables it to provide much higher data rates than traditional MIMO systems [1–6]. As a result, the system is able to better utilize the available bandwidth and effectively mitigate the effects of fading and interference. In mmWave communication, massive MIMO systems address the problem of severe propagation attenuation and make efficient use of the signal bandwidth [7–9]. Additionally, massive MIMO is becoming increasingly popular in radar sensing due to its ability to enhance target detection and tracking accuracy, reduce false alarms, increase capacity, and improve coverage [10,11].

Despite the numerous benefits offered by massive MIMO systems, their practical implementation is a challenging endeavor. In a typical MIMO system, each antenna is equipped with its own radio frequency (RF) chain, composed of components, such as a band-pass filter, a low-noise amplifier, a mixer, a low-pass filter, and a high-resolution analog-to-digital converter (ADC). With the implementation of massive MIMO systems, the

**Citation:** Pavel, S.R.; Zhang, Y.D. Optimization of the Compressive Measurement Matrix in a Massive MIMO System Exploiting LSTM Networks. *Algorithms* **2023**, *16*, 261. https://doi.org/10.3390/a16060261

Academic Editors: Xiang Zhang and Xiaoxiao Li

Received: 31 March 2023 Revised: 15 May 2023 Accepted: 18 May 2023 Published: 23 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

number of antennas and RF chains required at each base station is significantly increased, thereby leading to an increase in cost, complexity, and power consumption. To make the implementation of a massive MIMO system practical, one approach is to adopt the hybrid beamforming technique. Hybrid beamforming addresses this limitation by reducing the number of RF chains required in a massive MIMO system. It accomplishes this by splitting the beamforming process into two parts: a digital part and an analog part. In the analog part, the signals from multiple antennas are combined before they are passed through a reduced number of the RF chain. To achieve this effectively, a compressive measurement matrix (CMM), which projects the high-dimensional array received signal onto a low-dimensional signal considering the sparsity nature of the signals.

Numerous studies have investigated the design of beamformer and precoder matrices for MIMO systems [12–14]. The approach in [12] involves alternating optimization to optimize the transmit and receive beamformers using a minimum mean square error (MMSE) criterion between the received signal and the transmitted symbol vectors. Reference [13] considers the optimization of the precoder matrix based on the singular vectors of the channel matrix. In [14], a beamformer is optimized for MIMO-integrated sensing and communication (ISAC) scenarios, where the beamforming matrix is designed to achieve the desired radar beampattern, while maintaining a signal-to-interference-plus-noise ratio constraint for communication users. However, the aforementioned method requires knowledge of the signal directions of arrival (DOAs), which may not be available in many scenarios and is a parameter that needs to be estimated in our problem. Several papers have also explored compressive sampling-based DOA estimation techniques, such as [15,16]. In [15], a sparse localization framework for the MIMO radar is proposed by randomly placing transmitting and receiving antennas, and a random measurement matrix is used for target localization. Similarly, Ref. [16] develops a compressive sampling framework for 2D DOA and polarization estimation in mmWave polarized massive MIMO systems using a Gaussian measurement matrix. However, this type of random selection can lead to information loss and performance degradation as demonstrated in [17,18].

Information theory is another widely used framework for optimizing the CMM in massive MIMO systems. These principles of information theory provide a mathematical framework for quantifying the amount of information that can be transmitted over a communication channel. In [18,19], the CMM is optimized by maximizing the mutual information between the compressed measurement and the signal DOAs. This approach is based on considering the availability of a coarse prior distribution of the DOAs. By reducing the dimension of the received signal, the required number of front-end circuits is effectively reduced with minimal performance loss. Reference [20] extends this idea by developing a general compressive measurement scheme that combines the CMM and the sparse array. The framework can consider any arbitrary sparse array as the receive antennas and use the CMM to compress the dimension. As a result, it can effectively reduce both the number of physical antennas and the front-end circuits. They also optimize the CMM by maximizing the mutual information of the compressed measurements and DOA distribution, while considering the availability of the prior distribution of DOAs. In many practical cases, however, the required a priori distribution may not available. To address this issue, an iterative optimization approach is developed in [21]. Starting with no prior information on the DOA distribution, the CMM is optimized based on mutual information maximization and then used to estimate the DOA spectrum. The estimated normalized DOA spectrum is subsequently used as the prior information for the next iteration, thus iteratively improving the accuracy of the estimated DOA spectrum.

Optimizing the CMM in a sequential adaptive manner may lead to better performance compared to non-adaptive schemes [22,23]. However, using optimization techniques, such as projected gradient descent or simplified versions of projected coordinate descent, to obtain the desired CMM can be computationally expensive [24]. On the other hand, codebook-based methods, such as the hierarchical codebook developed in [23] and the hierarchical posterior matching (hiePM) strategy developed in [5], can reduce the computational burden. Nonetheless, the performance of codebook-based methods relies heavily on the quality of the codebooks and may be inferior to codebook-free approaches.

Recently, deep learning methods have emerged as a popular approach for effectively addressing complex optimization problems in various wireless communication and signal processing applications, including massive MIMO beamforming [25,26], intelligent reflecting surface [27,28], DOA estimation [29,30], and wireless channel estimation [31–33]. In a prior study [34], we developed a deep learning method for sequentially updating the CMM. Specifically, we trained a neural network without any prior information to obtain the optimized CMM, which was then used to update the posterior distribution of signal DOAs by leveraging the subsequent measurement. However, this approach faces two challenging issues. First, for each snapshot of the impinging signal, the CMM must be updated, the compressed measurement computed, and the posterior distribution updated. As such, it incurs high-computational costs, especially for updating the posterior distribution at each snapshot. Second, the posterior update relies on the accuracy of the estimated spatial spectrum, and any inaccuracies in this estimation can lead to performance degradation and slow convergence. Conversely, any inaccuracy or change in the posterior estimation will affect the spectrum estimation performance. In [35], LSTM neural networks are used in various communication system problems, including adaptive beamformer design for mmWave initial beam alignment applications. However, this study was limited to single-channel and single-user scenarios.

In this paper, we propose to exploit an LSTM network for sequentially designing the CMM matrix. LSTMs are a class of recurrent neural networks (RNNs) that are well suited for handling time-series and other sequential data due to their inherent architecture [36–42]. The previous work used a fully connected deep neural network (FCDNN), where the received signal in each time snapshot was treated independently. However, in real-world scenarios, adjacent time samples of the signal have strong correlations with each other. Therefore, we use an LSTM network to sequentially process data by retaining temporal dependencies between the input data points. Preserving time-dependent information enables more effective optimization of the CMM in each time snapshot, leading to faster convergence and better DOA estimation performance.

Notations: We use bold lower-case and upper-case letters to represent vectors and matrices, respectively. Particularly, we use *<sup>I</sup><sup>N</sup>* to denote the *<sup>N</sup>* <sup>×</sup> *<sup>N</sup>* identity matrix. (·)<sup>T</sup> and (·)<sup>H</sup> respectively represent the transpose and Hermitian operations of a matrix or

vector. The notations <sup>÷</sup> and (·) <sup>2</sup> are used to represent element-wise division and squaring, respectively. Additionally, |·| denotes the determinant operator. The operator <sup>E</sup>[·] represents statistical expectation, whereas <sup>R</sup>(·) and <sup>I</sup>(·) respectively extract the real and imaginary parts of a complex entry. <sup>C</sup>*M*×*<sup>N</sup>* denotes the *<sup>M</sup>* <sup>×</sup> *<sup>N</sup>* complex space.

#### **2. Signal Model**

#### *2.1. Array Signal Model*

Consider *D* uncorrelated signals that impinge on a massive MIMO system equipped with *N* receive antennas from directions *θ* = [*θ*1, *θ*2, ··· *θD*] T. The analog RF array received signal at time *t* is modeled as

$$\begin{split} \mathbf{x}^{\text{RF}}(t) &= \sum\_{d=1}^{D} \mathbf{a}(\theta\_{D}) s\_{d}(t) e^{j\omega\_{\text{c}}t} + \mathfrak{n}(t) \\ &= \mathbf{A}(\theta) \mathbf{s}(t) e^{j\omega\_{\text{c}}t} + \mathfrak{n}(t), \end{split} \tag{1}$$

where *<sup>A</sup>***(***θ***)** = [*a*(*θ*1), *<sup>a</sup>*(*θ*2), ··· , *<sup>a</sup>*(*θD*)] <sup>∈</sup> <sup>C</sup>*N*×*<sup>D</sup>* denotes the array manifold matrix whose *<sup>d</sup>*th column *<sup>a</sup>*(*θd*) <sup>∈</sup> <sup>C</sup>*<sup>N</sup>* represents the steering vector of the *<sup>d</sup>*th user with DOA *<sup>θ</sup>d*, *<sup>s</sup>*(*t*)=[*s*1(*t*),*s*2(*t*), ··· ,*sD*(*t*)]<sup>T</sup> <sup>∈</sup> <sup>C</sup>*<sup>D</sup>* represents the signal waveform vector, *<sup>ω</sup><sup>c</sup>* denotes the angular frequency of the carrier, and *<sup>n</sup>*(*t*) ∼ CN (**0**, *<sup>σ</sup>*<sup>2</sup> <sup>n</sup> *IN*) represents the zero-mean additive white Gaussian noise (AWGN) vector.

Figure 1 depicts the block diagram of the receiver antenna array of a massive MIMO system without performing compressed measurement. In this receiver array, each antenna is assigned with a dedicated front-end circuit, which converts the received analog RF signal to the digital base-band by performing down conversion and analog-to-digital conversion. However, dedicating a separate front-end circuit to each antenna in a massive MIMO system may be impractical, considering the hardware cost, power consumption, and computational complexity.

**Figure 1.** Block diagram of an antenna array without performing compression [18].

#### *2.2. Compressive Array Signal Model*

The number of antennas in a massive MIMO system is typically much higher than the number of users or targets. Consequently, the impinging signals can be considered sparse in the spatial (angular) domain. Such sparsity property allows us to design an optimal CMM that projects the array receive signal to a lower-dimensional manifold with no or negligible information loss. In this manner, the array receive signal can be compressed significantly in the analog domain as shown in Figure 2. As a result, the number of frontend circuits in the analog domain and the computation burden in the digital domain can be significantly reduced.

In this compressive sampling scheme, *M N* linear projections of the RF received signal *<sup>x</sup>*RF(*t*) are taken along the measurement kernels represented as row vectors *<sup>φ</sup><sup>m</sup>* = [*φm*,1, *<sup>φ</sup>m*,2, ··· , *<sup>φ</sup>m*,*N*] <sup>∈</sup> <sup>C</sup>1×*<sup>N</sup>* with *<sup>m</sup>* <sup>=</sup> 1, ··· , *<sup>M</sup>*. The *<sup>m</sup>*th compressed measurement of the RF received signal *y*RF *<sup>m</sup>* (*t*) is the linear projection of the RF received signal *x*RF(*t*) in the *m*th measurement kernel *φm*, i.e.,

$$\left\langle y\_{m}^{\text{RF}}(t) = \left\langle \Phi\_{m'}, \mathbf{x}^{\text{RF}}(t) \right\rangle \right\rangle = \sum\_{n=1}^{N} \phi\_{m,n} \mathbf{x}\_{n}^{\text{RF}}(t), \tag{2}$$

where *x*RF *<sup>n</sup>* (*t*) is the *n*th element of vector *x*RF(*t*).

Stacking all *M* measurement kernels forms the CMM **Φ** = [*φ*<sup>T</sup> <sup>1</sup> , *<sup>φ</sup>*<sup>T</sup> <sup>2</sup> , ··· , *<sup>φ</sup>*<sup>T</sup> *M*] T. Matrix **Φ** is designed to be row orthonormal, i.e., **ΦΦ**<sup>H</sup> = *IM*, to keep the noise power unchanged after applying the compression.

Denote *x*(*t*) as the baseband signal corresponding to *x*RF(*t*). Note that vector *x*(*t*) is not observed in the underlying system and is introduced solely for notational convenience. Then, the *<sup>M</sup>* compressed measurements in baseband yield *<sup>y</sup>*(*t*)=[*y*1(*t*), *<sup>y</sup>*2(*t*), ··· , *yM*(*t*)]<sup>T</sup> <sup>∈</sup> C*M*, which is given as

$$\mathbf{y}(t) = \Phi \mathbf{x}(t) = \Phi A(\theta) \mathbf{s}(t) + \Phi \mathbf{n}(t), \tag{3}$$

where **<sup>Φ</sup>***A*(*θ*) <sup>∈</sup> <sup>C</sup>*M*×*<sup>D</sup>* represents the compressed array manifold with significantly reduced dimension compared to *A*(*θ*).

#### *2.3. Probabilistic Signal Model*

Consider signal DOA *θ* as a random variable with a probability density function (PDF) *f*(*θ*). In [18,19], it is assumed that coarse knowledge of *f*(*θ*) is available. In this case, according to the law of the total probability, the PDF of the compressed measurement vector *y* is expressed as

$$f(\boldsymbol{y}) = \mathbb{E}\_{\theta} \{ f(\boldsymbol{y}|\theta) \} = \int\_{\theta \in \Theta} f(\boldsymbol{y}|\theta) f(\theta) d\theta,\tag{4}$$

where Θ is the angular region of the observations. We discretize the PDF *f*(*θ*) into *K* angular bins with an equal width of Δ¯ *θ* so that the probability of the *k*th angular bin is approximated as probability mass function *pk* <sup>≈</sup> *<sup>f</sup>*(¯ *θk*)Δ¯ *<sup>θ</sup>* with <sup>∑</sup>*k*∈K *pk* <sup>=</sup> 1, where ¯ *θ<sup>k</sup>* is the nominal angle of the *k*th angular bin and K = {1, 2, ··· , *K*}. As a result, the PDF of *y* can be reformulated as

$$f(\boldsymbol{y}) \approx \sum\_{k \in \mathcal{K}} p\_k f(\boldsymbol{y}|\bar{\theta}\_k) \,. \tag{5}$$

That is, the PDF of *y* is approximated as a Gaussian mixture distribution consisting of *<sup>K</sup>* zero-mean Gaussian distributions *<sup>y</sup>*|¯ *θk*. Considering a signal *s*(*t*) impinging from the *k*th angular bin with a nominal DOA ¯ *θk*, the compressed measurement vector is given as

$$\|y\|\_{\theta=\bar{\theta}\_k}(t) = \Phi(a(\bar{\theta}\_k)s(t) + \mathfrak{n}(t)),\tag{6}$$

and the corresponding conditional PDF is

$$f(y|\bar{\theta}\_k) = \frac{1}{\pi^M |\mathcal{C}\_{yy|\bar{\theta}\_k}|} e^{-y^\text{fl}\mathcal{C}\_{yy|\bar{\theta}\_k}^{-1}y} \, \, \, \, \tag{7}$$

where

$$\mathcal{L}\_{yy|\boldsymbol{\theta}\_k} = \Phi(\sigma\_s^2 a(\bar{\theta}\_k) a^\mathcal{H}(\bar{\theta}\_k) + \sigma\_\mathbf{n}^2 \mathbf{I}) \Phi^\mathcal{H} \tag{8}$$

is the covariance matrix of the compressed measurement vector *y*|*θ*<sup>=</sup> ¯ *θk* (*t*) and *σ*<sup>2</sup> *<sup>s</sup>* is the signal power. Additionally, define *Cyy* = **Φ***A*(*θ*)*SA*(*θ*)H**Φ**H, as the covariance matrix of the compressed measurement with *S* = diag([*σ*<sup>2</sup> *<sup>s</sup>* , *σ*<sup>2</sup> *<sup>S</sup>*, ··· ]) is the source covariance matrix.

**Figure 2.** Block diagram of a compressive sampling antenna array [18].

#### **3. Motivation for Using LSTM Network to Design the CMM**

The objective of this paper is to design the beamforming matrix in a sequential manner. Specifically, we aim to optimize the CMM **Φ** at each time sample *t* = 1, 2, ··· , *T* in an adaptive manner such that the CMM **Φ** at time sample *t* + 1 can be regarded as a function of all prior observations, denoted by *y*(1 : *t*) and **Φ**(1 : *t*), i.e.,

$$
\Phi(t+1) = \mathcal{F}(y(1:t), \Phi(1:t)),
\tag{9}
$$

where F is a function that is exploited to map the past observations and past CMMs to design the next CMM.

However, the dimension of the past observations increases as the time index increases, rendering it impractical to optimize the CMM **Φ** using all prior observations. Therefore, a significant challenge of this sequential optimization process is to summarize all of the historical observations.

In [34], instead of using all past observations, the posterior distribution of signal DOA at time *t* is considered a sufficient statistic to design the CMM **Φ** at time *t* + 1. However, this approach may be prone to robustness issues. For instance, if the posterior *p*(*θk*) for a signal containing the angular bin *θ<sup>k</sup>* becomes small due to an estimation error during any time iteration, the error will propagate through the time iteration, resulting in inaccurate DOA estimation. Furthermore, in each time instant, it involves performing analog beamforming and spectrum estimation, which are computationally expensive, particularly for a large number of iterations. In addition, the paper uses a fully connected neural network, which does not well exploit the temporal correlation of the received data.

To address this issue, we propose an LSTM framework that can provide a tractable solution. LSTM is a type of recurrent neural network that can retain information over time in a variable known as the cell state. Moreover, to maintain the scalability of prior observation, LSTM incorporates a gate mechanism that controls which information to be discarded and which to be incorporated into the cell state, retaining only the relevant information from historical observations that are necessary for the given task.

Figure 3 illustrates a unit of the proposed LSTM framework at time *t*. At this time instant, the input to the deep learning unit comprises the current compressed measurement *y<sup>t</sup>* and the cell and hidden states from the previous time samples, denoted as *ct*−<sup>1</sup> and *ht*−1, respectively. The LSTM unit has four gates, namely the forget gate (*ft*), input gate (*it*), cell gate (*gt*), and output gate (*ot*), which respectively perform the following operations:

• Forget gate (*ft*): This gate combines the current input *y*(*t*) and the previous hidden state *h*(*t* − 1) to decide which information to forget and which to remember from previous cell state. The operation is given by

$$f\_t = \sigma \left( \mathbf{W}\_f \left[ \mathbf{y}^\mathrm{T}(t) \, \boldsymbol{h}^\mathrm{T}(t-1) \right]^\mathrm{T} \right), \tag{10}$$

where *σ*(·) denotes the sigmoid function, and *W <sup>f</sup>* is a weight matrix corresponding to the forget gate.

• Input gate (*it*): This gate combines the current input *y*(*t*) and the previous hidden state *h*(*t* − 1) to decide which information to store in the cell state. The operation is given by

$$\dot{\boldsymbol{u}}\_{t} = \sigma \left( \mathbf{W}\_{i} \Big[ \boldsymbol{y}^{\mathrm{T}}(t) \, \boldsymbol{h}^{\mathrm{T}}(t-1) \Big]^{\mathrm{T}} \right), \tag{11}$$

where *W<sup>i</sup>* is a weight matrix corresponding to the input gate.

• Cell gate (*gt*): This gate combines the current input *y*(*t*) and the previous hidden state *h*(*t* − 1) to compute the actual representation that will go into the cell state. The operation is given by

$$\mathcal{g}\_t = \tanh\left(\mathcal{W}\_\mathbb{g} \left[\boldsymbol{y}^\mathrm{T}(\mathbf{t}) \, \boldsymbol{h}^\mathrm{T}(\mathbf{t}-1)\right]^T\right),\tag{12}$$

where tanh(·) denotes the hyperbolic tangent function and *W<sup>g</sup>* is a weight matrix corresponding to the cell gate.

• Output gate (*ot*): This gate combines the current input *y*(*t*) and the previous hidden state *h*(*t* − 1) to decide how much to weight the cell state information to generate the output of the LSTM cell, which is also denoted as hidden state *ht*. The operation is given by

$$\boldsymbol{\sigma}\_{t} = \sigma \left( \boldsymbol{\mathcal{W}}\_{o} \left[ \boldsymbol{\mathfrak{y}}^{\mathrm{T}}(t) \, \boldsymbol{h}^{\mathrm{T}}(t-1) \right]^{\mathrm{T}} \right), \tag{13}$$

where *Wo* is a weight matrix corresponding to the output gate.

Finally, the cell state is updated according to

$$\mathbf{c}\_{t} = f\_{t}\mathbf{c}\_{t-1} + i\_{t}\mathbf{g}\_{t\prime} \tag{14}$$

which combines the amount of information from the previous cell state regulated by the forget gate and the amount of updated information. The output of the LSTM cell, i.e., the hidden state for time *t*, will be the filtered version of the current cell state regulated by the output gate, i.e.,

$$h\_t = o\_t \tanh(c\_t) \tag{15}$$

The preservation of historical observations by the cell state *c<sup>t</sup>* over time is evident from Equation (14). Additionally, the cell state does not exhibit growth as the time index increases; rather, it adaptively updates its information content. We, therefore, use the cell state information as a mapping of historical observations. At each time sample, these historical observations are exploited to optimize CMM **Φ** using another DNN. At the end of all time iterations, the minimum variance distortionless response (MVDR) spatial spectrum estimation method is employed to estimate the signal DOAs.

**Figure 3.** Proposed deep learning unit for time *t*.

#### **4. Proposed LSTM Based Optimization of the CMM Φ**

Figure 4 illustrates the end-to-end architecture of the proposed framework for realizing the equation presented in Equation (9). In the following subsections, we discuss the details of the proposed approach for the optimization of CMM **Φ**.

**Figure 4.** End-to-end deep learning framework for optimizing CMM **Φ**.

#### *4.1. Data Pre-Processing*

Using the array received signal vector at the massive MIMO *<sup>x</sup>*(*t*) ∈ C*<sup>N</sup>* at time *<sup>t</sup>*, we form a tensor denoted by *<sup>X</sup>*(*t*) ∈ C*B*×*N*×<sup>1</sup> by concatenating the array received signal vectors for *B* different DOA scenarios. Collecting all time snapshots then produces the training tensor *<sup>X</sup>* ∈ C*B*×*N*×*T*. At the beginning, with a randomly initialized CMM **<sup>Φ</sup>**, we perform analog beamforming to obtain the compressed measurement tensor *Y*(*t*) = **Φ**(*t* − 1)*X*(*t*) at time *<sup>t</sup>* <sup>=</sup> 1, where *<sup>Y</sup>*(*t*) ∈ C*B*×*M*×1. Separating the real and imaginary parts of *<sup>Y</sup>*(*t*), we concatenate them to form the input tensor *Y*ˆ(*t*) for the LSTM unit as illustrated in Figure 3.

#### *4.2. Implementation Details of the Deep Learning Framework*

The proposed deep learning framework comprises a series of LSTM units and FCDNNs. An LSTM unit summarizes the historical observations into a fixed-dimensional cell state vector *c*(*t* − 1), which serves as a sufficient statistic for optimizing the CMM in the subsequent time instance *t*. For a particular time snapshot *t*, the input tensor *Y*ˆ(*t*), along with the cell and hidden state tensors *C*(*t* − 1) and *H*(*t* − 1), serves as input to the LSTM unit. The tensors *C*(*t* − 1) and *H*(*t* − 1) are formed by concatenating the vectors *c*(*t* − 1) and *h*(*t* − 1) for all *B* scenarios and all layers of the LSTM network. Based on the gate mechanism described in Equations (10)–(15), the cell and hidden states are updated adaptively. We then employ an *L*-layer FCDNN to map the cell state information *C*(*t*) to design the CMM **Φ**(*t*) at time instant *t*. The DNN output at time *t* is expressed as

$$\Phi(t) = \mathcal{A}\_L(\mathbf{W}\_L \mathcal{A}\_{L-1}(\cdots \mathcal{A}\_1(\mathbf{W}\_1 \mathbf{G}(t-1) + \mathbf{b}\_1) \cdots) + \mathbf{b}\_L),\tag{16}$$

where *Wl*, *bl*, A*<sup>l</sup>* are the weight, bias, and nonlinear activation function corresponding to the *l*th layer of the DNN, respectively. **Φ**˜ (*t*) is the real valued representation of the complex valued CMM matrix at time *<sup>t</sup>*, i.e., **<sup>Φ</sup>**˜ (*t*)=[R(**Φ**(*t*) <sup>I</sup>(**Φ**(*t*)].

#### *4.3. Post-Processing*

We first reconstruct the complex valued **Φ**(*t*) from its real representation, where the real and imaginary parts of **Φ**(*t*) correspond to the left and right halves of **Φ**˜ (*t*), respectively. The measurement kernels *φm*, *m* = 1, 2, ··· , *M* are generally implemented using a series of phase shifters. Therefore, it is desirable for the CMM to satisfy a constant modulus constraint. In order to achieve this constraint, we set the activation function of the final layer as

$$\mathcal{A}\_{L}(\mathcal{R}(\Phi)(t)) = \begin{bmatrix} \mathcal{R}(\Phi) \oplus \sqrt{\mathcal{R}(\Phi)\mathfrak{D} + \mathcal{Z}(\Phi)\mathfrak{Q}} \\\\ \mathcal{I}(\Phi)(t)) = \begin{bmatrix} \mathcal{I}(\Phi) \oplus \sqrt{\mathcal{R}(\Phi)\mathfrak{D} + \mathcal{Z}(\Phi)\mathfrak{Q}} \end{bmatrix}. \tag{17}$$

Subsequently, the obtained **Φ**(*t*), along with the updated *C*(*t*) and *H*(*t*), will be utilized to generate **Φ**(*t* + 1), and this process will continue until the time snapshot *t* = *T*.

#### *4.4. Loss Function and Back Propagation*

In the underlying massive MIMO context, where the CMM **Φ** is optimized to enhance the accuracy of the DOA estimation, it is crucial to specify a suitable loss function that enables a comparison between the true DOAs and those estimated using the optimized **Φ**. Once the sequential updating of the CMM **Φ** is completed, the optimized **Φ** is used to find the compressed measurements *<sup>Y</sup>* ∈ C*B*×*M*×*<sup>T</sup>* from the input tensor *<sup>X</sup>* as *<sup>Y</sup>* <sup>=</sup> **<sup>Φ</sup>***X*. Using these compressed measurements, we use the MVDR spectrum estimator to obtain

the spatial spectrum. To do so, we first estimate the sample covariance matrix for the *b*th compressed measurement *<sup>Y</sup><sup>b</sup>* ∈ C*M*×*<sup>T</sup>* as

$$
\hat{\mathbf{R}}\_{yb} = \frac{1}{T} \mathbf{Y}\_b \mathbf{Y}\_b^H \tag{18}
$$

for *b* = 1, 2, ··· , *B*. The MVDR spectrum is obtained as

$$\boldsymbol{\hat{\rho}}\_{b}(\boldsymbol{\theta}) = \frac{\boldsymbol{a}^{\mathrm{H}}(\boldsymbol{\theta})\boldsymbol{\Phi}^{\mathrm{H}}(\boldsymbol{t})\boldsymbol{\Phi}(\boldsymbol{t})\boldsymbol{a}(\boldsymbol{\theta})}{\boldsymbol{a}^{\mathrm{H}}(\boldsymbol{\theta})\boldsymbol{\Phi}^{\mathrm{H}}(\boldsymbol{t})\boldsymbol{\hat{\mathcal{R}}}\_{yb}^{-1}(\boldsymbol{t})\boldsymbol{\Phi}(\boldsymbol{t})\boldsymbol{a}(\boldsymbol{\theta})}.\tag{19}$$

We consider the DOA estimation problem as a multiclass classification problem, where in each angular bin, we make a binary decision whether a signal is present in the bin or not. To do so, we employ the binary cross entropy loss function between the estimated MVDR spectrum ( ˆ*pb*) and the true DOA location (*pb*) expressed as

$$\text{Loss} = -\frac{1}{B} \sum\_{b=1}^{B} [p\_b \log \hat{p}\_b + (1 - p\_b) \log(1 - \hat{p}\_b)],\tag{20}$$

where *B* is the batch size of the training data.

#### **5. Simulation Results**

We consider a massive MIMO system consisting of *N* = 50 receive antennas arranged in a uniform linear fashion and separated by a half wavelength. We choose the compression ratio to be *N*/*M* = 5, which yields the dimension of the compressed measurement *M* = 10. The number of impinging sources in the massive MIMO system is considered between 1 and 9. The sources impinge from angular bins discretized by a Δ¯ *θ* = 0.1◦ interval and within an angular range between −90◦ and 90◦. As a result, there are 1801 components in the Gaussian mixture model. The number of snapshots is assumed to be *T* = 100.

We consider a 4-layer LSTM unit with 200 nodes in each layer, and a DNN with 3 layers and 500 nodes. The selection of the number of layers and nodes for both models is made to achieve a good balance between the predictability and generalization capability of the networks. A training dataset is created with 10,000 scenarios, each containing 1 to 9 sources randomly sampled from a uniform distribution ranging between −90◦ and 90◦. The input signal-to-noise ratio (SNR) is randomly selected between 0 dB and 20 dB for each scenario. The test dataset consists of 1000 scenarios, which are generated using a similar approach.

We evaluate the performance of the proposed model against two related approaches as described in [21,34]. The non-neural network approach presented in [21] optimizes CMM **Φ** iteratively based on mutual information maximization, while the approach described in [34] uses an FCDNN to update the posterior distribution of the DOAs of the impinging signals. To compare these methods, we consider a test example with nine sources and their corresponding signal DOAs are −55◦, −48◦, −44◦, −20◦, 8◦, 20◦, 31◦, 41◦, and 45◦. Figure 5 shows the estimated spectra obtained from the methods where the input SNR is 5 dB. As demonstrated in this figure, the proposed method, depicted in (a), shows a cleaner spectrum compared to [21,34], as illustrated in (b) and (c), in a low SNR scenario. Figure 6 demonstrates the reduction in the loss function as the number of epochs increases. It is evident from the figure that the model converged well within the first 200 epochs.

**Figure 5.** Comparison of the estimated spatial spectra. (**a**) Proposed method. (**b**) Method in [21]. (**c**) Method in [34].

**Figure 6.** Loss vs. number of epochs.

In order to assess the methods' performance under different conditions, including varying input SNR levels, number of snapshots, and dimension of compressed measurement (number of front-end circuits), we compared their performance using the root mean squared error (RMSE), defined as

$$\text{RMSE} = \sqrt{\frac{1}{QD} \sum\_{q=1}^{Q} \sum\_{d=1}^{D} (\hat{\theta}\_{q,d} - \theta\_d)^2} \tag{21}$$

where *Q* is the number of trials and ˆ *θq*,*<sup>d</sup>* is the estimated DOA for the *d*th source of the *q*th Monte Carlo trial. In total, 1000 Monte Carlo trials are used to compute the RMSE values. Figure 7 presents the RMSE values with respect to input SNR, number of snapshots, and dimension of compressed measurement, and clearly shows that the proposed LSTM-based approach outperforms the other methods. Additionally, the Cramer–Rao bound (CRB) is included in Figure 7 for comparison.

To obtain the CRB, we first denote the unknown parameters in this problem, which include the signal DOAs and power of *D* sources as *θ* = [*θ*1, ··· , *θD*] <sup>T</sup> and *p* = [*σ*<sup>2</sup> <sup>1</sup> , ··· , *<sup>σ</sup>*<sup>2</sup> *D*] T, respectively. We also define the noise power as *σ*<sup>2</sup> n, and *ω* = [*ω*1, ··· , *ωD*] <sup>T</sup> as the spatial frequencies, where *ω<sup>d</sup>* = sin(*θd*)/2 is the spatial frequency of the *d*th source. Then, the unknown parameter vectors are grouped as *ψ* = [*w*<sup>T</sup> *p*<sup>T</sup> *σ*<sup>2</sup> n] T. Since we are interested in obtaining the CRB of the signal DOAs, we partitioned the unknown parameters as *<sup>ψ</sup>* = [*w*T|*p*<sup>T</sup> *<sup>σ</sup>*<sup>2</sup> n] T.

The CRB can be obtained as the inverse of the Fisher information matrix (FIM), which is defined as

$$[F]\_{\mathfrak{u},\upsilon} = -\mathbb{E}\left[\frac{\partial^2 \ln p(y|\psi)}{\partial \psi\_u \psi\_{\upsilon}}\right],\tag{22}$$

where *ψ<sup>u</sup>* is the *u*th element of *ψ*, with *u*, *v* ∈ 1, 2, ··· , 2*D* + 1.

The FIM can also be expressed as [43]

$$\frac{1}{T}\mathbf{F} = \begin{bmatrix} \Delta\_{\mathcal{w}} \\ \Delta\_{\mathcal{o}} \end{bmatrix}^{\mathrm{H}} \begin{bmatrix} \Delta\_{\mathcal{w}} & \Delta\_{\mathcal{o}} \end{bmatrix} = \begin{bmatrix} \Delta\_{\mathcal{w}}^{\mathrm{H}} \Delta\_{\mathcal{w}} & \Delta\_{\mathcal{w}}^{\mathrm{H}} \Delta\_{\mathcal{o}} \\ \Delta\_{\mathcal{o}}^{\mathrm{H}} \Delta\_{\mathcal{w}} & \Delta\_{\mathcal{o}}^{\mathrm{H}} \Delta\_{\mathcal{o}} \end{bmatrix} \tag{23}$$

where Δ*<sup>w</sup>* = (*C*<sup>T</sup> *yy* <sup>⊗</sup> *Cyy*)<sup>−</sup> <sup>1</sup> 2 *<sup>∂</sup><sup>r</sup> ∂w*<sup>1</sup> , ··· , *<sup>∂</sup><sup>r</sup> <sup>∂</sup>wD* and Δ*<sup>o</sup>* = (*C*<sup>T</sup> *yy* <sup>⊗</sup> *Cyy*)<sup>−</sup> <sup>1</sup> 2 *∂r ∂σ*<sup>2</sup> 1 , ··· , *<sup>∂</sup><sup>r</sup> ∂σ*<sup>2</sup> *D* , *<sup>∂</sup><sup>r</sup> ∂σ*<sup>2</sup> n with *r* = vec(*Cyy*). Then, the CRB of the signal spatial frequencies is obtained as [43]

$$\text{CRB} = \frac{1}{T} (\Delta\_{\omega}^{\text{H}} \Pi\_{o}^{\perp} \Delta\_{\omega})^{-1} \text{ }^{\prime} \tag{24}$$

where Π⊥ *<sup>o</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> <sup>Δ</sup>*o*(Δ<sup>H</sup> *<sup>o</sup>* Δ*o*)−1Δ<sup>H</sup> *o* .

**Figure 7.** Performance comparison. (**a**) RMSE versus input SNR. (**b**) RMSE versus number of snapshots. (**c**) RMSE versus compressed dimension.

Next, we considered a scenario where nine sources move with an initial position of −20◦, −15◦, −10◦, −5◦, 0◦, 5◦, 10◦, 15◦, and 20◦ in the positive direction with the same angular rate. They move 1 degree per 20 snapshots. The result of the proposed method is compared with the result of the method described in [34] because both are sequential methods, namely, **Φ** is sequentially updated. As shown in Figure 8, as the source positions change, the performance of the method described in [34] degrades. This is because this method uses the posterior from the previous time instant as a sufficient statistic of all past observations. Therefore, as each new measurement differs from the previous ones, this method cannot adapt well. In contrast, the proposed method, as depicted in Figure 9, does not have this limitation, resulting in improved DOA estimation performance as the sequential updating continues.

**Figure 9.** *Cont.*

**Figure 9.** Estimated spectra for moving sources using the proposed method. (**a**) Initial position. (**b**) Next position from (**a**). (**c**) Next position from (**b**).

#### **6. Conclusions**

In this paper, we developed an LSTM-based framework to optimize the CMM in a massive MIMO setting. The inherent architecture of an LSTM network is well suited to preserve relevant historical observation, which is useful to design the CMM in a sequential manner. The resulting optimized CMM can then be used to compress high-dimensional received data, which can effectively reduce the number of front-end circuits. Our proposed method exhibits superior DOA estimation performance compared to the existing literature as demonstrated by the simulation results.

**Author Contributions:** Conceptualization, S.R.P. and Y.D.Z.; methodology, S.R.P.; validation, S.R.P.; writing—original draft preparation, S.R.P.; writing—review and editing, Y.D.Z.; supervision, Y.D.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** All data used to support the findings of this study are included within the article.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Shadab Anwar Shaikh 1,\*, Harish Cherukuri 1,\* and Taufiquar Khan <sup>2</sup>**


**Abstract:** In engineering design, oftentimes a system's dynamic response is known or can be measured, but the source generating these responses is not known. The mathematical problem where the focus is on inferring the source terms of the governing equations from the set of observations is known as an inverse source problem (ISP). ISPs are traditionally solved by optimization techniques with regularization, but in the past few years, there has been a lot of interest in approaching these problems from a deep-learning viewpoint. In this paper, we propose a deep learning approach—infused with physics information—to recover the forcing function (source term) of systems with one degree of freedom from the response data. We test our architecture first to recover smooth forcing functions, and later functions involving abruptly changing gradient and jump discontinuities in the case of a linear system. Finally, we recover the harmonic, the sum of two harmonics, and the gaussian function, in the case of a non-linear system. The results obtained are promising and demonstrate the efficacy of this approach in recovering the forcing functions from the data.

**Keywords:** physics informed neural network; dynamic force identification; deep learning; duffing's equation; spring mass damper system; non-linear oscillators

#### **1. Introduction**

Inverse problems are a special class of mathematical problems where the focus is on inferring causal relationships from the set of observations. These problems are often illposed and suffer from various numerical issues [1], however, are encountered extensively in different fields of science and engineering. In the past few decades, there has been a plethora of research on solving these problems [2–4].

A subclass of inverse problems, where, the interest is in estimating the right-hand side, or "source term", of a governing equation, is known as inverse source problems (ISP). ISPs arise frequently in several domains of physics and engineering, a few noteworthy examples are the following: Optical Molecular Imaging (OMI), where the spatial distribution of bio-luminescent and fluorescent markers in the human tissues is reconstructed from light-intensity measurements [5,6]; Radiative Heat Transfer, where temperature distribution of a medium is reconstructed from radiation intensity measurements and medium properties [6]; Magnetoencephalography (MEG) and Electroencephalography (EEG), where surface electrical and magnetic current measurements on the head are used to determine the source of brain activity [7,8].

In this paper, we study one such ISP known as the dynamic load identification problem. Here, we attempt to recover the 'forcing function' or 'excitation force' of linear and nonlinear oscillators from the dynamic response data. This problem can be solved both in the time and frequency domains; however, in this study, we adopt the time domain approach owing to its simplicity and straightforwardness.

**Citation:** Shaikh, S.A.; Cherukuri, H.; Khan, T. Recovering the Forcing Function in Systems with One Degree of Freedom Using ANN and Physics Information. *Algorithms* **2023**, *16*, 250. https://doi.org/10.3390/a16050250

Academic Editor: Frank Werner

Received: 10 March 2023 Revised: 3 May 2023 Accepted: 8 May 2023 Published: 12 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In the past few decades, plenty of research has been published discussing various approaches to solving this problem, and it will be difficult to enumerate them all given the scope of this paper; nonetheless, a few notable mentions are as follows. Huang [9] used the conjugate gradient method to estimate the time-dependent forces in a non-linear oscillator. Ma et al. [10] developed a recursive estimator based on the Kalman filter to determine the impulsive load from the measurement data for the single and multi-degreeof-freedom systems. In another interesting work by Azam et al. [11], the authors proposed a dual Kalman filter for estimating the full states of a linear multi-degree of freedom system with unknown input from a limited number of noisy acceleration measurements and a known physical model. Ref. [12] formulated the force identification problem of the duffing oscillator as a Volterra-type integral equation of the first kind and used the regularization technique to stabilize the solution. Feldman [13] proposed a method for predicting forces only from response data without the need for any parametric or governing equation information using the Hilbert transform. Ref. [14] solved the non-linear force identification problem in the frequency domain using ordinary least squares with Tikhonov regularization and its variants. Liu et al. [15] solved the non-linear vibration problem by transforming the non-linear ordinary differential equations into parabolic ordinary differential equations due to their robustness against large noise. Recently, Rice et al. [16] proposed a calibration-based integral formulation for estimating the forcing function in the spring mass damper system from response data. For a detailed review of past and present literature on dynamic load identification techniques, interested readers are advised to refer to [17].

In recent years, there has been a significant interest in applying machine learning and deep learning techniques for load identification. Pravin and Rao [18] proposed a technique for recovering the input forces from acceleration time history using dynamic principal component analysis. Zhou et al. [19] used a deep Recurrent Neural Networks (RNNs) technique with two variants of Long Short Term Memory (LSTM) to recover the impact loads on nonlinear structures from response data. They tested their architecture on a damped duffing oscillator subjected to an impact load expressed by normal distribution function and on a composite plate. Another work [20] proposed RNN with different architecture, but this work was mainly focused on recovering the forces on beam structure excited by harmonic, impact, and random forces. In another work by Luca Rosafalco et al. [21], the authors implemented a deep learning based autoencoder for load identification, for structural health monitoring, from multivariate structural vibration data. They employed residual learning and inception modules in their autoencoder network. Ref. [22] proposed an ANN based on Bayesian Probability Framework to estimate the forces from the displacement responses.

In spite of the massive success of deep learning techniques in tackling a variety of problems owing to their ability to explore vast design space and to manage ill-posed problems, deep learning predictions are oftentimes physically inconsistent and generalize poorly [23]. However, this behavior can be alleviated to some extent by embedding various biases; one way of achieving this is by infusing the governing equation in the loss function of a neural network as proposed by [24], known as a "physics-informed neural network (PINNs). Recently, PINNs have been used to solve inverse source problems; one such account is the paper by He et al. [25]. In this work, the author utilized PINNs to predict the spatially and temporally varying heat source from the simulated temperature data with good accuracy. In this work, we use the PINNs approach for estimating the forcing function of one degree of freedom system.

Recently, two studies [26,27] have been published where the authors utilized machine learning and physics information to solve the vibration problem. The former used the Hilbert transform and a variant of the least-squares method to estimate the non-linear restoring force in a bi-stable structure, and the latter used PINNs to solve forced vibration and plate vibration problems.

Haghighat et al. [27] also used PINNs to solve forced spring mass damper systems similar to ours, but their work was mostly about predicting the displacements for a future time step and natural frequency, whereas our approach is more focused on estimating the excitation forces. We propose PINNs to estimate harmonic or non-harmonic and periodic or aperiodic forcing functions for systems with one degree of freedom subjected to various initial conditions.

Although in this work our attention is on oscillators as a mechanical system, to our understanding, this work also has the potential to be applied to any systems governed by linear or non-linear ordinary differential equations in different domains.

The remainder of the paper is organized as follows: In Section 2, we talk about the mathematical model of duffing's equation followed by Section 3 where we discuss the structure of our neural network and share details about the training process. Later, in the following Section 4, we share our findings and finally conclude this paper with Sections 5 and 6 with discussions and conclusions, respectively.

#### **2. Mathematical Model**

Duffing's equation is a nonlinear ordinary differential equation used to model the approximate behavior of various physical systems, such as nano-mechanical resonators [28], ultrasonic cutting systems [29], piezo-ceramics under influence of a magnetic field, and, the flight motor of an insect [30], to name a few. One formulation of Duffing's equation is given by

$$
\ddot{\mathbf{x}} + \delta \dot{\mathbf{x}} + \alpha \mathbf{x} + \beta \mathbf{x}^3 = f(t). \tag{1}
$$

Here, *x*(*t*) is the solution to a differential equation. Initial conditions are given by *x*(0) = *x*<sup>0</sup> and *x*˙(0) = *x*˙0, *δ* is the amount of damping, *α* is linear stiffness, *β* is the amount of non-linearity, and *f*(*t*) is the forcing function. By rearranging and fixing different values of coefficients, i.e., *α*, *δ*, *β* in Equation (1), the governing equation of various linear and nonlinear oscillators can be derived. For a detailed mathematical treatment and understanding of Duffing's equation, interested readers are advised to refer to [31].

In this work, we are going to recover *f*(*t*) from the simulated measurement of *x*(*t*), its derivative *x*˙(*t*), and initial conditions using artificial neural network (ANN) and governing equation information. This is different than solving in a forward manner, where we typically solve the differential equation analytically or numerically, to get the solution *x*(*t*) given *f*(*t*) and initial conditions.

#### **3. Methodology**

In this section, we discuss the structure of the neural network (NN) that was used, followed by details on the loss function, and later, sum up the section by shedding some light on the training algorithm and process that was employed.

#### *3.1. Structure of NN*

The structure of NN is shown in Figure 1 and mathematically is represented by

$$
\hat{f}, \hat{\mathbf{x}}, \hat{\mathbf{x}} = \Phi^L(t; \mathbf{W}, \mathbf{b}) \tag{2}
$$

where the function <sup>Φ</sup>*<sup>L</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>3</sup> represents the neural network with *<sup>L</sup>* number of layers; *<sup>t</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is the input and *<sup>x</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>, <sup>ˆ</sup> *<sup>x</sup>*˙ <sup>∈</sup> <sup>R</sup>, <sup>ˆ</sup> *<sup>f</sup>* <sup>∈</sup> <sup>R</sup> are the outputs; **<sup>W</sup>** <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* and **<sup>b</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* are the neural network parameters. The architecture was defined in this way since it makes the differentiation of NN output with respect to input more manageable. Differentiation was performed using Automatic Differentiation (AD) with the help of the TensorFlow [32] library functions.

The NN is feed-forward in a sense, such that the first layer is an input to the second, and the second to the next, and so on until the last layer. This can be represented by the composite equation below,

$$\mathbf{x}^{j} = \sigma^{j}(\mathbf{W}^{j} \cdot \mathbf{x}^{j-1} + \mathbf{b}^{j}), \qquad j \in \{0, \dots, L\} \tag{3}$$

where *<sup>j</sup>* is the layer number, *<sup>σ</sup><sup>j</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*<sup>n</sup>* is the activation function which adds nonlinearity to the NN, and **W***<sup>j</sup>* and **b***<sup>j</sup>* are weights and biases of the specific layer

For example, a 4-layer neural network, i.e., *L* = 4, can be represented by

$$\begin{aligned} \mathbf{x}^1 &= \sigma^1 \left( \mathbf{W}^1 \cdot \mathbf{x}^0 + \mathbf{b}^1 \right) \\ \mathbf{x}^2 &= \sigma^2 \left( \mathbf{W}^2 \cdot \mathbf{x}^1 + \mathbf{b}^2 \right) \\ \mathbf{x}^3 &= \sigma^3 \left( \mathbf{W}^3 \cdot \mathbf{x}^2 + \mathbf{b}^3 \right) \\ \mathbf{x}^4 &= \mathbf{W}^4 \cdot \mathbf{x}^3 + \mathbf{b}^4 \end{aligned} \tag{4}$$

where **x**<sup>0</sup> = *t* and **x**<sup>4</sup> = [ ˆ *f* , *x*ˆ, ˆ *x*˙]. The output of NN, ˆ *f* is constrained by the physical model and *x*ˆ, ˆ *x*˙ are constrained by the displacement and velocity data, respectively. This is discussed in more detail in the Section 3.2.

**Figure 1.** Proposed neural network architecture with input *t* and output *x*ˆ, ˆ *x*˙ and ˆ *f* .

The proposed NN architecture was developed using the Keras [33] library with a TensorFlow backend. It consists of *L* = 10 layers with [1,15,30,60,120, 240,120,60,30,15,3] units each. The batch normalization layer is present alternately after every dense layer and, each dense layer is passed through the eLU activation function, which adds the non-linearity to the network.

The optimal hyper-parameters were determined by performing systematic hyperparameter tuning, which involved exploring different combinations of neural network architectures, initialization methods, activation functions, learning rates, and number of epochs. Initially, a shallow network with a smaller number of trainable parameters and ReLU activation function was used, but this did not yield satisfactory results. Subsequently, other activation functions were experimented with, and it was found that the eLU activation function produced better results. Finally, a deeper architecture with eLU activation function was employed. Similar experiments were conducted to identify other optimal hyper-parameter choices. Additional optimal hyper-parameters choices used in study are discussed in the Section 3.3.

#### *3.2. Loss Function*

The workhorse of our approach is the way the neural network loss function is defined. The total loss L*total* is composed of the data term L*data*, L*IC* and the physics loss term L*physics*:

$$
\mathcal{L}\_{\text{total}} = \mathcal{L}\_{\text{data}} + \mathcal{L}\_{\text{IC}} + \lambda \mathcal{L}\_{\text{physics}} \tag{5}
$$

such that

$$\mathcal{L}\_{data} = \frac{1}{N} \sum\_{i=1}^{N} (\mathbf{x}\_i^\* - \hat{\mathbf{x}}\_i)^2 + \frac{1}{N} \sum\_{i=1}^{N} (\dot{\mathbf{x}}\_i^\* - \hat{\mathbf{x}}\_i)^2 \tag{6}$$

and

$$\mathcal{L}\_{IC} = (\mathfrak{x}\_0 - \mathfrak{x}(0))^2 + (\mathfrak{x}\_0 - \hat{\mathfrak{x}}(0))^2. \tag{7}$$

Here, *x*∗ *<sup>i</sup>* and *x*˙ ∗ *<sup>i</sup>* are the displacement and velocity from the data, *<sup>x</sup>*ˆ*<sup>i</sup>* and <sup>ˆ</sup> *x*˙*<sup>i</sup>* are displacement and velocity predictions from the neural network, *λ* represents the regularization term, *x*<sup>0</sup> and *x*˙0 are the initial conditions.The task of L*data* and L*IC* is to constrain the neural network predictions using the data.

The physics loss L*physics* term is where the physics information is infused into the neural networks and is given by,

$$\mathcal{L}\_{\text{physics}} = \frac{1}{N} \sum\_{i=1}^{N} \left( \frac{D\hat{\mathbf{x}}\_i}{Dt\_i} + \delta \hat{\mathbf{x}}\_i + a\mathbf{x}\_i + \beta \mathbf{x}\_i^3 - f\_i \right)^2. \tag{8}$$

This equation is obtained by rearranging Equation (1) and replacing velocities and displacements with their equivalent neural network predictions. For calculating the acceleration from velocity prediction, we make use of automatic differentiation, which is represented by *<sup>D</sup> Dti* in the above equation. The job of <sup>L</sup>*physics* is to force the <sup>ˆ</sup> *fi* to take values that obey the governing equation.

#### *3.3. Training*

The objective of the proposed NN architecture is to recover the forcing function, *f*(*t*), from the displacement and velocity data. The training algorithm is shown below (refer to Algorithm 1). Inputs to the algorithm are *t*, *x*∗,*x*˙ ∗, i.e., time, displacement, and velocity data. The NN takes in *t* and outputs ˆ *f* , *x*ˆ, ˆ *x*˙, i.e., forcing function, displacement, and velocity predictions, respectively. The weights of the neural network are initialized using He-Normal initialization.

The network was trained on 500 data points in batches of 250 points on NVIDIA GTX 2060 GPU for 60,000 epochs. The training time for all the training instances was around 3 to 3.5 h approximately. The learning rate *η* was chosen as 0.001 and the regularization term *λ* was chosen as either 0.1 or 0.01 depending on which provided a better result.

At each epoch, the L*total* is calculated from the data and neural network predictions. Later, Adam optimizer [34] takes in L*total* and calculates its gradients with respect to NN parameters and propagates them to the network using the back-propagation algorithm. This algorithm uses these gradients to adjust the weights and biases of the network at every epoch. A snapshot of L*total*, L*data* and L*physics* progression with respect to epochs for one training instance is shown in the Figure 2. Ideally, for the neural network to learn successfully L*total* → 0, which can be observed in the Figure 2 below.

#### **Algorithm 1** Training Algorithm

**Require:** *t*, *x*∗, *x*˙ ∗ **Ensure:** L*total* → 0 *n* ← no. of epochs *η* ← learning rate *N* ← batch size *λ* ←regularization **while** *<sup>n</sup>* <sup>&</sup>gt; <sup>0</sup> **do** <sup>ˆ</sup> *f* , *x*ˆ, ˆ *<sup>x</sup>*˙ <sup>←</sup> <sup>Φ</sup>*L*(*t*; **<sup>W</sup>**, **<sup>b</sup>**) L*data*,L*IC*,L*physics* -L*total* ← L*data* + L*IC* + *λ*L*physics* **W**∗, **b**<sup>∗</sup> ← Adam(*η*,L*total*) **W**, **b** ← **W**∗, **b**<sup>∗</sup> *n* ← *n* − 1 **end while**

This is calculated using (6)–(8)

(**c**)

**Figure 2.** (**a**–**c**) shows the total, data, and physics loss w.r.t the training epochs for one training instance.

#### **4. Results**

In this section, we share our findings that were obtained by performing various numerical experiments on our proposed architecture. We start by discussing the results of spring mass damper systems excited by different types of forces and initial conditions and later sum up the section on the results of our experiments with the non-linear oscillator.

The data for training the neural network were generated by simulating Equation (1) using the ode45 routine of MATLAB [35] for different coefficients *α*, *β*, *δ* and initial conditions *x*0, *x*˙0, for all the simulations, *t* ∈ [0, 50 s]. A snapshot of data instance that was generated by simulating ODE is shown in Figure 3.

**Figure 3.** Figure shows the training sample that was obtained by simulating Equation (1) by setting *α* = 0.4, *β* = 0.9, *δ* = 0.5 with initial conditions *x*<sup>0</sup> = 2.4, *x*˙0 = 0.7 subjected to forcing function given by Equation (14) with *γ*<sup>1</sup> = 5, *γ*<sup>2</sup> = 4, *ω*<sup>1</sup> = 0.4, *ω*<sup>2</sup> = 0.7.

#### *4.1. Linear Case*

We convert Equation (1) to a linear ODE by setting *β* = 0. Later, by fixing *α* = *k*/*m*, *δ* = *c*/*m* and *f*(*t*) = *f*(*t*)/*m* the equation reduces to an equivalent spring mass damper system with mass *m*, stiffness *c* and spring constant *k*. For the remainder of this section, we consider *m* = 1, *c* = 0.2 and *k* = 0.9 or equivalently *α* = 0.9, *δ* = 0.2. Finally, data are generated by solving the linear ODE subjected to sinusoidal, piece-wise, and stepforcing functions.

The neural network was trained on generated data instances and was tested to determine if it can recover the forcing functions from these data. The following sections provide more details of the results that were obtained after training.

#### 4.1.1. Sinusoidal Function

To test whether the neural network can recover a smooth periodic function, we train it on the data generated by subjecting the spring mass damper system to a harmonic excitation given by

$$f(t) = \gamma \cos \omega t \tag{9}$$

with *x*<sup>0</sup> = 4.9, *x*˙0 = −2.2, *ω* = 0.4, and *γ* = 3. The result obtained is promising and is shown in Figure 4a. It can be observed that the NN prediction and actual function are in excellent agreement.

**Figure 4.** Figure shows the agreement between neural network predictions and actual forcing functions: (**a**) sinusoidal (**b**) sinusoidal with increased non-linearity and frequency (**c**) sum of two sinusoidal (**d**) impulse for duffing's equation.

#### 4.1.2. Piece-Wise Function

Here, we subject the spring mass damper system to piece-wise forcing functions represented by equations,

$$f(t) = \begin{cases} 0 & 0 \le t < 5\\ t - 5 & 5 \le t < 10\\ \frac{5}{20}(30 - t) & 10 \le t < 30\\ 0 & 30 \le t \le 50 \end{cases} \tag{10}$$

and,

$$f(t) = \begin{cases} t^2 & 0 \le t < 10 \\ -2t + 120 & 10 \le t < 30 \\ \frac{1}{200}t^3 - \frac{1}{2}t^2 + \frac{25}{2}t & 30 \le t \le 50 \end{cases} \tag{11}$$

the former represents a triangular function, consisting of linearly increasing and decreasing functions at consecutive intervals and the latter is a combination of parabolic, linear, and cubic functions. The functions are characterized by abrupt variations in gradient magnitudes.

Figure 5b,c demonstrate our findings where the network was trained on data instances generated by simulating the spring mass damper system for *x*<sup>0</sup> = 4.4, *x*˙0 = −4.4, and *x*<sup>0</sup> = 2.2, *x*˙0 = −3.8 initial conditions subjected to forces represented by Equations (10) and (11), respectively.

**Figure 5.** Figure shows the agreement between neural network predictions and actual forcing functions: (**a**) sinusoidal (**b**) combination of parabolic, linear and cubic (**c**) triangular (**d**) step for spring mass damper system.

As observed, the actual forcing function and neural network predictions match very well; however, the network experiences some challenges in predicting the values at the cusp of both functions. For the triangular function, it under-predicts, and for the combination of linear, parabolic, and cubic it over-predicts. In addition, for the interval with a zero value of the function, marking the start of the triangular function, some oscillations are observed in the network predictions.

#### 4.1.3. Step Function

After testing our architecture to recover the piece-wise forcing function in the previous section. In the present section, we attempt to solve a problem that is much more challenging. We try to determine if our architecture can recover functions involving discontinuities. To answer this question we train the neural network on the data generated by simulating the spring mass damper system with *x*<sup>0</sup> = −3.4, *x*˙0 = −2.4 and step function below,

$$f(t) = \begin{cases} 0 & 0 \le t < 5 \\ 10 & 5 \le t < 10 \\ -10 & 10 \le t < 30 \\ 0 & 30 \le t \le 50 \end{cases} \tag{12}$$

A step function is marked by constant values on specific intervals followed by sudden jumps in values at the point of transition.

Figure 5d shows our findings. It can be observed NN was able to recover the majority of the function from the data. The constant values of the step function are complemented with oscillatory predictions. These oscillations resemble the Gibbs phenomenon, which is observed when approximating functions with jump discontinuities using the Fourier series.

#### *4.2. Non-Linear Case*

After our success with linear ODE in the previous sections, we now evaluate the effectiveness of our proposed architecture on non-linear ODE. We subject Equation (1) to different smooth forcing functions such as sinusoidal, the sum of two sinusoidal, and impulse functions. We share details and findings of our numerical experiments in the following section

#### 4.2.1. Sinusoidal Function

We solve the governing Equation (1) by subjecting it to the sinusoidal forcing function given by,

$$f(t) = \gamma \cos \omega t \tag{13}$$

fixing *α* = 1, *δ* = 1, *β* = 0.5, *γ* = 3, *ω* = 0.4 and initial conditions *x*<sup>0</sup> = 0.6 *x*˙0 = 1.4 to generate the response data.

Figure 4a demonstrates the performance of our network after training it on the generated data. It can be observed that the neural network was successful in accurately recovering the sinusoidal function from response data without much difficulty; nonetheless, there exists some instability at the starting point.

We now increase the difficulty by setting *α* = 0.6, *δ* = 0.3, *β* = 1.2, *γ* = 3, *ω* = 0.7 and initial conditions *x*<sup>0</sup> = 0.2, *x*˙0 = 0.7. An interesting thing to note here is that the non-linearity *β* = 1.2 and frequency *ω* = 0.7 are now increased versus the previous case.

The finding for this case is shown in Figure 4b. Although seen network predictions are in good agreement with the actual function, for the most part, the neural network over-predicts at the peaks and valleys with the presence of some instabilities at the start of the function.

#### 4.2.2. Sum of Two Sinusoidal Functions

In this section, we attempt to recover the forcing function represented by the sum of two sinusoidal functions represented by the following equation,

$$f(t) = \gamma\_1 \cos \omega\_1 t + \gamma\_2 \cos \omega\_2 t \tag{14}$$

The neural network was trained on the data that were generated by fixing the values of *α* = 0.4, *β* = 0.9, *δ* = 0.5 and solving the duffing equation subjected to Equation (14) with *γ*<sup>1</sup> = 5, *γ*<sup>2</sup> = 4, *ω*<sup>1</sup> = 0.4, *ω*<sup>2</sup> = 0.7 and *x*<sup>0</sup> = 2.4*x*˙0 = 0.7 initial conditions. This is somewhat more difficult compared to the previous case involving just one sinusoidal function. The results are shown in Figure 4c, it can be seen that the NN was able to predict the forcing function from response data with acceptable accuracy. Although NN does under-predict and over-predict at certain peaks and valleys of the function, nevertheless, overall the actual function and predictions are almost perfectly aligned.

#### 4.2.3. Impulse Function Idealized by Normal Distribution

Finally, in the present section, we evaluate the efficacy of our network in predicting an impulsive function expressed by a normal distribution given by,

$$f(t) = \frac{e^{-(t-\mu)^2/(2\sigma^2)}}{\sigma\sqrt{2\pi}}.\tag{15}$$

We set *μ* = 25, and *σ* = 2 in Equation (15) and use this to produce the data by simulating Equation (1) with *α* = 0.4, *β* = 1.4, *δ* = 0.6 and *x*<sup>0</sup> = 1, *x*˙0 = −2.2. This case was used to test if the network can predict an impact excitation force of non-linear oscillators. For the purpose of smoothness, the impact force was expressed by a normal distribution equation. As seen in Figure 4d, the network predictions start with numerical oscillations that later damp out, and, for the most part, the predictions are in good alignment with the actual forcing function that was used for generating the data.

#### **5. Discussion**

In Section 4 we shared our findings that demonstrated the effectiveness of our proposed neural network approach for solving the inverse source problem of dynamic load identification by incorporating physics information. The neural network structure and best working hyper-parameter choices were obtained by performing systematic hyperparameter tuning. The network was later trained on different data instances generated by simulating spring mass damper systems subjected to different types of forcing functions, including smooth, abrupt changes in gradient, and jump discontinuities.

The neural network predictions in all cases were excellent, with slight overshoot and undershoot at the cusp of both piece-wise functions and some small oscillations at the start of the triangular function. However, some numerical oscillations were observed in step function predictions that resemble the Gibbs phenomenon. The findings suggest that the proposed neural network approach was effective in predicting different types of forcing functions.

The study also tested the network to recover the forcing functions of non-linear oscillators from the data. The data were generated by solving duffing equations subjected to various smooth forcing functions. The network was able to predict sinusoidal functions and sinusoidal functions with increased non-linearity and frequency with small amounts of instability and minor overshoot and undershoot at the peak and trough of the periodic function.

Finally, the network was used to predict functions given by the sum of two sinusoidal functions and an impulse, and the network prediction was in close agreement with the actual function. However, some under and over-predictions with small oscillations were observed at the peaks and valleys of the functions in the case of the sum of two sinusoidal functions. In the case of a forcing function involving an impulse, numerical oscillations were observed at the start that dampened out in the later stages of predictions.

Overall, the findings of this study demonstrate that the proposed technique works well in predicting a variety of forcing functions from response data, although, only smooth functions were considered in the case of non-linear oscillators. Additionally, the analysis was based on simulated data without any noise used to train the neural network. Future studies should investigate the effectiveness of this technique using real-world data with noise and compare its performance with other established techniques for dynamic load identification.

#### **6. Conclusions**

In this paper, we presented an approach for solving the dynamic load identification problem using neural networks and physics information. We started our analysis by testing the efficacy of our architecture in recovering the forcing functions of the spring mass damper system and finally extending it to non-linear oscillators.

In our analysis of the spring mass damper system, we trained our neural network to recover different types of functions from the data, and it was found that our network was able to seamlessly recover them without much difficulty. Later on, we tried the same for the non-linear ODEs. In the case of non-linear ODEs, we primarily focused on smooth functions, and it was observed that our method was able to recover almost all of the functions, but with minor numerical oscillations at different places.

Though this work was predominantly focused on predicting the source terms of ODEs with mechanical systems in mind, to the best of our understanding, this has the potential to be applied to any system where the interest is in finding the source term from response data.

In the future, this work can be extended by testing whether the architecture can recover discontinuous forcing functions of non-linear ODEs and if similar predictions can be made from data that are corrupted by noise. Also, a similar study can be undertaken for recovering both smooth and discontinuous forcing functions in a multi-degree of freedom system. Another possibility is to test if our neural network architecture can predict the forcing function just from one set of data, i.e., displacement or velocity.

Finally, this work was focused on recovering the forcing function of a specific ODE from its data instances. However, a surrogate model can also be developed that can be trained on huge sets of data and, after training, can predict the forcing function for any instance of a linear or non-linear ODE from its response time histories and initial conditions.

**Author Contributions:** Methodology, implementation, and writing of manuscript—S.A.S.; supervision, review and editing—H.C. and T.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data that support the findings of this study are available from the corresponding author upon request.

**Conflicts of Interest:** The authors declare no competing conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
