1. Introduction
Structural health monitoring (SHM) has emerged as a beneficial system for assessing structural performance under ambient and forced vibrations either in laboratory or field environments, especially in monitoring civil infrastructure. Key civil infrastructure systems like bridges are mostly equipped with SHM systems to allow monitoring of traffic, wind, and other environmental loading as well as natural hazard incidents such as earthquakes. In bridge monitoring, most SHM systems use wireless network systems or even mechanical sensors with wires that may interfere with traffic and practically provide data at limited locations. Their installment and implementation are also challenging, especially for some locations such as under bridges, bridges over water, or highly elevated bridges. High sampling rates that are mostly used by these types of sensors also require more data processing to capture the low-vibration frequency of flexible-type bridges or long-span bridges. Moreover, when considering SHM for reconnaissance efforts, rapid bridge evaluation that will reveal whether to open or close it after the earthquake; this decision strongly depends on the bridge’s condition in the aftermath. These quick assessments are challenging using current monitoring inspection procedures since they are unable to capture the global behavior of the bridge during earthquakes, and they are further incapable of monitoring the progress of more permanent forms of damage such as plastic deformation, rotation, and drifts.
Advanced SHM technologies using cameras, i.e., vision-based technology, offer advantages over their SHM counterparts [
1,
2,
3,
4,
5]. A larger field of view that depends on the camera type and setting enables the monitoring of a broader bridge area and provides options for assessing the deck and bent of the bridge all in one setting. Lower camera video sampling rates also allow for the dynamic characterization of a bridge under a very low to high excitation, from low ambient vibration to the high amplitude of earthquakes. The global response of bridges recorded at the bent, deck, base, and abutment levels generate valuable data for estimating permanent damage to bridges. Moreover, an intelligent robotic system provided by an unmanned aerial vehicle (UAV) or drone, which is also integrated with camera technology, is capable of recording bridges and other infrastructure during earthquakes or any vibration events; it does so without disturbances from ground movement due to its airborne operation [
6,
7,
8,
9,
10].
The vast development of computer vision (CV) algorithms greatly supports either laboratory or field implementations of vision-based SHM. Tracking algorithms using well-known template matching techniques such as digital image correlation (DIC) have been proven one of the most effective procedures in generating structural displacement, strain mapping, and other structural dynamic characteristics [
3,
4,
11,
12]. As this technique requires stamping on the surface area of the specimen in the form of spray-painting or artificial targets, it is challenging to implement in real-life structures, especially for long-term monitoring. Therefore, advancement in CV feature detection, extraction, and tracking algorithms offers advantages in object recognition using the natural features of these structures. Multiscale feature detection and description algorithms such as the scale-invariant feature transform (SIFT) [
13], speeded-up robust features (SURF) [
14], and KAZE [
15] algorithms are among the most popular algorithms that demonstrate high repeatability and distinctiveness with various forms of image transformation such as noise and blurring. Matching algorithms, such as the greedy nearest neighbor, optimal fair or full, or exact algorithm, are selected to match features between images depending on the matching goal. The returned matching features may not be exact; therefore, it is necessary to filter the outliers using an optimization algorithm such as random sample consensus (RANSAC) or its variants, namely, the M-estimator SAC (MSAC) [
16], progressive SAC (PROSAC) [
17], or maximum likelihood estimator SAC (MLESAC) [
18] algorithm, of which selection is based on its accuracy, speed, robustness, or optimality [
19].
Performing feature detection and description on digital images is primarily to extract specific features based on which descriptor is used. Among the available detectors like edge or corner detectors, the blob-based detector is the simplest method that aims to analyze the shape features of an object in the image that contrast their backgrounds in color, brightness, or other properties. The SURF and SIFT algorithms are among the frequently used operators that can extract blob-based features in the image and have been implemented in bridge SHM [
20,
21]. Meanwhile, the KAZE algorithm is not commonly selected in bridge monitoring; however, previous work has reported its implementation in wind turbine monitoring [
22]. Their matching accuracies, i.e., the number of tracked and matched features after the detection and extraction steps, have barely been reported in previous works. Therefore, the objectives of this study are to explore the effect of refined matching algorithms on blob-based features in improving their accuracies and to implement the proposed algorithms on large-scale bridges tested under seismic loads using vision-based SHM. The major contributions of this study are a detailed procedure that exploits the impact of selecting detection and matching operators to improve blob-based feature performance and a CV-oriented procedure for seismic SHM using vision-based sensors. The remainder of this paper is organized as follows. In
Section 2, selected blob-based feature detectors and matching and refined matching algorithms are explained. The testing setup of two-span accelerated bridge construction (ABC) bridge for implementing the proposed algorithms is given in
Section 3. Their results and important constraints are presented and their accuracies verified in
Section 4, and conclusions are drawn in
Section 5.
2. Methodology
Computer vision algorithms cover a wide range of operators that can detect, extract, and match features within image sequences recorded from tests. Their general procedures are adopted in this study, as shown in
Figure 1, together with the selected operators. They start by detecting blob-based features and continue with the matching procedure. Then, the proposed refined matching algorithm is applied to improve the number of correctly matched pairs. In
Figure 1, three blob-based feature detectors, i.e., SURF, SIFT, and KAZE, are used to detect blob features in test images. These features are then matched with the nearest neighbour (NN) algorithm. Four operators are proposed to return more exact matching, i.e., the least median of squares (LMEDS), least trimmed square (LTS), random sample consensus (RANSAC), and M-estimator sample consensus (MSAC) algorithms. These algorithms have commonly been used in distance or matching problems in previous works [
23,
24,
25,
26,
27,
28]. The details of the proposed procedure are provided in the next subsections.
2.1. Blob-Based Feature Detection Algorithm
In CV image registration, there are five general stages of relating image sequence characteristics so the image datasets can be transformed into a single unified coordinate system. They are feature detection and description, matching, refined matching to filter outliers, image transformation, and reconstruction. Feature detector operators are algorithms that detect features in the image that can be in the form of edges, corners, lines, blobs, etc. The three blob feature descriptors selected in this study are SURF, SIFT, and KAZE. SURF [
14] detectors are based on the determinant of the Hessian matrix
as shown in Equation (1) in point
at scale
. They use integral images to improve the algorithm speed, relying on Gaussian space analysis where
is the convolution of the Gaussian second-order derivative of image
in point
and for
and
. Specific bin dimensions are used to describe the detected blob features with Haar wavelet distribution within certain regions. The dimensions can be extended to 64-D or even 128-D depending on the change of image perspective.
The most well-known feature descriptor is SIFT [
13], which is based on the difference of Gaussian (DoG) operator, i.e., an approximation of Laplacian of Gaussian. Local maxima on an image are searched using DoG to detect feature points at various scales. It is strongly invariant to image scale and rotations with affine variations; however, it has a high computational cost. SURF, on the other hand, has a low computational cost compared to SIFT. Equation (2) describes the convolution of difference
between two Gaussians at different scales within image
in which
corresponds to the Gaussian function.
The KAZE detector is also computed at multiple scale levels based on the normalized determinant of the Hessian matrix. It uses non-linear diffusion filtering that benefits blurred image processing as it reduces noise and maintains the boundaries of regions in subject images simultaneously. A moving window is used to select the maxima of the detector. Feature points are detected by discovering major orientations in circular regions around each detected feature. Similar to SIFT and SURF, KAZE features are also invariant to rotation, scale, and limited affine. This detector has more uniqueness at varying scales; therefore, it also increases the computational time. Equation (3) shows the typical nonlinear diffusion formula with divergence
conductivity function
gradient operator
and image luminance
2.2. Matching Algorithm
Following the blob-based feature detection, the next step is to search for the most similar matches of blob-based features within image sequences. This is the most computationally expensive segment of y computer vision algorithms as it involves searching and tracking for the most similar matches to high-dimensional vectors. Therefore, a robust yet efficient algorithm is required to perform fast searching and tracking in such large datasets. Nearest neighbor (NN) is selected in this study as it provides speedy computation in several orders of magnitude. The NN problem consists of pre-processing a set of points
such that the operation in Equation (4) can be performed efficiently. Equation (4) describes the NN search in a metric space. It is defined as follows: for a set of feature points
in a metric space
with a query point
, the element NN
is searched such that the match is the closest to
with respect to a metric distance
2.3. Proposed Refined Matching Algorithm
It is known that working with high-dimensional features will mostly produce incorrect searches and matches. Many practical applications of the NN algorithm return approximate matches with some outliers, meaning that some results are estimated yet still close to the exact matches. Therefore, in image registration, it is common that NN is just a part of complete procedures combined with applications of other CV algorithms that contain other approximations. The refined matching procedures used to filter the outliers selected in this study are the least median of squares (LMEDS) [
29], least trimmed square (LTS) [
30], random sample consensus (RANSAC) [
31], and M-estimator sample consensus (MSAC) [
18] algorithms. Both the LMEDS and LTS estimators are common regression estimators. Considering a sample with
observations consisting of inliers and outliers, the sample with the least maximal squared residual is called LMEDS, while the sample with the highest residual sum of squares is defined as LTS. LMEDS estimator estimates the parameters by solving the nonlinear minimization problem and reduces the median of squared standardized residuals for the entire dataset. Meanwhile, LTS consists of finding a subset of cases whose deletion from the dataset would lead to the regression with the smallest residual sum of squares. It is regression-, scale-, and affine-equivariant and is used as a general-purpose high-breakdown method.
The RANSAC algorithm is widely used to detect unique transformation by random sampling. The corresponding transformation is estimated, and the adequacy of the transformation to the rest of the data is then tested. The transformation that yields the most effective consensus is then kept. RANSAC is a robust and fast algorithm, yet several important parameters need to be set in the analysis. The RANSAC step relates to the theory of a minimal sample set, which the initial set is randomly selected from the input. It is followed by computing the model parameters, after which RANSAC requires the setting of a certain threshold that iteratively checks which observations of the entire dataset are consistent with the hypothetical model. It specifies the maximum distance from a point of interest to a hypothetical model. When it fits the criterion, the point is treated as a hypothetical inlier; otherwise, it is treated as an outlier. The estimated model is exact when an adequate amount of points which are classified as exact observations (inliers) are reached. MSAC is a generalization of the RANSAC estimator that is principally used to robustly estimate multiple new relations from point correspondences. MSAC implements the same sampling approach as RANSAC to generate exact solutions. However, it selects the solution that maximizes the likelihood rather than just the number of inliers.
3. Testing Setup
The proposed methods in
Section 2 are experimentally evaluated on a two-span accelerated bridge construction (ABC) through a shake-table test. ABC is a bridge construction that uses state-of-the-art design, materials, and construction methods safely and cost-effectively. ABC not only improves site constructability and projects’ construction schedule but also reduces impacts on traffic and project delays due to weather conditions. The tested ABC bridge shown in
Figure 2 is a one-third-scale two-span reinforced concrete bridge with seismic connections. The length of each span (in the longitudinal direction) is 10.4 m with a 3.4 m width in the transverse direction. It has a two-column pier in the middle, which is spaced at 1.8 m with a height of 2.1 m. The bridge sits on a seat-type abutment at both ends. The bidirectional ground motion that was recorded during the 1994 Northridge earthquake was scaled according to a design-level (DE) seismic demand. Three tests with increasing seismic intensity from 20% and up to 75% of the DE level are selected to investigate the effect of each applied algorithm on the feature matching as well as the seismic response of the ABC bridge system. More details about the design, construction, and studies of the tested ABC bridge can be found in [
32,
33]. The bridge is placed on three biaxial shake tables manufactured by MTS. Each table measures 4.3 × 4.5 m with a stroke of ±300 mm. It can reach a peak velocity of 1000 mm/s) and an acceleration of 1 g with a 50-ton payload (about 445 kN). All three tables are constrained to act together as a single large table, with the option to be operated individually with independent motions depending on the loading requirements. The specification of the SHM system is given in
Table 1. The camera lens is 35 mm with a CMOS sensor. The image is set to monochrome to accelerate the later image processing using the proposed algorithms. The camera’s full ROI is used, with 2560 × 2048 image pixels and a record duration of 30 s with 30 Hz sampling rates or 30 frames-per-second rates. The reference sensor
as shown in
Figure 2c, is a string potentiometer recorded at 256 Hz; therefore, adjustment of displacement is required later to enable the comparison.
Vision-based SHM of the ABC bridge has been previously reported in [
3]. The difference between this study and previous works lies in its methods, i.e., the CV algorithms applied to image sequences that later affect the seismic data. Prior work needed a stereophotogrammetry technique to generate the three-dimensional coordinates of the features shown in
Figure 2d. Then, these features were tracked using tracking algorithms to detect the change in each feature coordinate within the image sequences. Therefore, previous work required a set of photogrammetry images and test images, also involving two steps in processing the vision-based data. Meanwhile, this work simply uses test data without the prerequisite of taking photogrammetric images. The proposed algorithms are directly applied to the test images, and the displacement data are generated using the scale factor method [
34].