2.1. Characterization of Laboratory Samples
Cracks are fractures or abruptions that occur in building materials like concrete, metals, rocks and other solids. They appear as a result of structural stresses, temperature changes, chemical reactions or mechanical damage. Cracks, in terms of size, can range from microcracks to large holes and can greatly affect the strength and durability of a material. Cracks are mostly random, quite often unpredictable and therefore difficult to model and evaluate. However, they develop linearly or as a set of lines with a grid, breaking the uniformity of the surface and internal structure of concrete. The appearance and development of cracks in the concrete of building structures can lead to surface defects, structural disturbances, water penetration, reduced thermal and noise characteristics and many other consequences up to loss of bearing capacity and destruction. Therefore, in order to prevent and mitigate the detrimental effect of cracks on the characteristics of concrete and products made from it, it is necessary to be able to analyze and predict the causes and mechanisms of their formation [
46].
The characteristics of the CEM I 52.5 N cement used in the study (Lipetskcement, Lipetsk, Russia) are presented in
Table 1.
The chemical composition of microsilica MS-85 [
47,
48,
49,
50] (Novolipetsk Iron and Steel Works, Lipetsk, Russia) used as a modifier for hardened cement paste samples is shown in
Table 2.
The largest proportion of microsilica was made up of particles ranging in size from 5 to 25 μm.
Glass fiber pretreated with a surfactant (Armplast, Nizhny Novgorod, Russia) was used as a reinforcing component.
Table 3 shows the characteristics of the fiber.
Characteristics of the components of the samples are provided by the manufacturers.
In the study, cube samples were made with a rib length of 100 mm of hardened cement paste of 3 modifications:
- (1)
Control: cement and water in the proportion of 25% by weight of cement;
- (2)
Control + GF: cement, water (26% by weight of cement), glass fiber (GF) (3% by weight of cement);
- (3)
Control + GF + MS: cement, water (28% by weight of cement), microsilica (10% by weight of cement); fiberglass (3% by weight of cement).
These compositions were selected based on the normative proportions for the preparation of standard consistency cement paste [
51], as well as on the basis of previous studies [
52], which considered the structure of a hardened cement paste containing microsilica and glass fibers, that is, the optimal dosages of these components obtained in these works.
For each modification, 6 samples were made, in total 18 samples.
The production of samples of the control composition was carried out as follows: dosing of all raw materials; mixing cement with water in a mixer for 30 s; stopping and collecting the cement paste from the walls of the mixer into the total mass and further mixing for 90 s; putting the paste into the mold and compacting on the vibrating platform within 60 s; the mold with the samples was kept for a day, then the samples were removed from the molds and kept in a bath with water in the “forming surface up” position for 27 days so that they did not touch each other and with the water level above the samples by at least 20 mm. The water temperature during storage of the samples was maintained in the range of (20 ± 1) °C. For the control + GF and control + GF + MS compositions, the dosed components—cement, fiber and microsilica—were mixed in a dry form, then mixing water was introduced and mixed in a similar way. The molding and storage conditions for the control + GF and control + GF + MS formulations are the same as for the control formulation.
Three samples of each modification were subjected to static loading on a P-50 press (PKC ZIM, Armavir, Russia). A load of 10 tons was applied to each of these samples and fixed for 30 s. After that, the load was reduced to 0, and the samples were sawn into three equal parts. The load was set at about 50% of the expected collapse load. It was important to simulate the onset of formation and development of cracks in the structure of the samples. The magnitude of the applied load will affect the number and proportion of defects in the samples but will not affect the results of detection and calculation of the areas of these cracks. From the central third (middle) of the sample, a sample was sawn out for microstructure examination using an electron microscope (Carl Zeiss Microscopy, Jena, Germany). The samples were examined with a magnification of 100–4000 times. The remaining three samples of each modification were tested on a press until destruction and fixation of the compressive strength.
Figure 1 shows defects on a sample of hardened cement paste cracks that appeared after applying a load of 10 tons to the sample. Such defects can also occur in case of violation of production technology, improper transportation or non-compliance with storage conditions. Identification of a defect of this kind is one of the key criteria in the study of the structure of materials, products and structures.
To automate the process of detecting cracks and calculating their areas during visual inspection, it is proposed to develop a computer vision algorithm that performs the functions of a process engineer.
2.2. Development of an Intelligent Algorithm Based on a Convolutional Neural Network
There are a number of other image processing technologies besides CNNs, for example traditional image processing techniques such as filters, transforms and computer vision algorithms. Such methods require specialized image processing skills and fine-tuning of parameters for each individual task. In turn, the CNN is the most modern, versatile and fast way to process and analyze images at the current moment in the development of science and technology.
U-Net is an architecture developed in 2015 to solve the problem of biomedical data segmentation (
Figure 2).
U-Net is a neural network architecture built on the basis of convolutional layers and poolings. Convolutional layers are non-linear transformations of input matrices obtained by applying non-linear activation functions to linear combinations of elements of the original matrix, selected in a special order. Pullings, on the other hand, are the aggregation of information from the original matrix using the functions of choosing the maximum from a subset of the elements of the original matrix. The input and output of both operations is a matrix. Both of these operations can lead both to a decrease in the dimension of the matrix and to its increase. The trained parameters of the neural network are the coefficients of a linear combination of elements of the original matrix, which are contained in the so-called convolution kernel. U-Net is a sequence of transformations of the original matrix that describes the image pixel by pixel, using the sequential alternate application of convolutions and poolings, which are usually considered in terms of an encoder and a decoder.
The U-Net neural network consists of two consecutive blocks: an encoder and a decoder. The task of the encoder is to find a low-dimensional description of the original image in the latent space of high-level features. The encoder selects the most significant features of the image for the task under consideration and encodes them in a non-interpretable way. The task of the decoder is to decipher the features generated by the encoder and determine, based on these latent high-level features, which areas of the original image should be selected as segmentation elements. The encoder of the U-Net model is a sequence of convolutional layers that sequentially compress the original image by 2n times, where
n is the so-called encoder depth. In our work,
n = 5. This value of n is often used in practice, since it has been empirically found that it is optimal for a wide range of problems. The decoder is a mirror image of the encoder; it consists of the same number of convolutional layers and poolings, applied in reverse order to obtain the original size image at the output. The convolutional blocks of the decoder and encoder are connected using skip connections, which helps to deal with the fading gradient problem that is relevant for many deep neural networks, especially in computer vision [
53]. If there is a large dataset and there are no restrictions on computing resources, the encoder and decoder are trained in pairs from scratch; that is, they are initialized with some random weights. In the case of a lack of training data, the transfer learning technique is used. The idea is to use an encoder from another neural network trained to solve a multi-class classification problem on a large dataset. This approach makes it possible to use an encoder that is able to extract high-quality high-level features in advance and then train the decoder to solve the segmentation problem based on such features. In our work, we used an encoder from the ResNet-50 neural network, pre-trained on the ImageNet dataset (
https://image-net.org/, accessed on 10 August 2023). The data are available to researchers free of charge for non-commercial use.
At the output of the U-Net CNN, we obtain an image of the original size, in which each pixel is marked with a color corresponding to a certain class. In our case, there are only 2 classes: a defect (crack) and a background.
In this study, two models are considered, which are based on the architecture of the U-Net neural network:
- (1)
Model 1 is a U-Net CNN, where augmentation will be probabilistic; that is, for each batch sample for training, we will apply the following transformations: random cropping of images; image rotation by 90°, vertical rotation/horizontal rotation/rotation by a random angle with a probability of 0.75, adding Gaussian noise sampled from a normal distribution with a probability of 0.7. There is no augmentation on the validation and test sets; preprocessing is reduced to the possibility of using paddings if necessary. This approach allows minimizing the negative effects of retraining the model, as well as minimizing the computational resources required for its training.
- (2)
Model 2 is a U-Net CNN, the input of which is a set of 1000 images, divided in the ratio 70/20/10 into training, validation and test sets, created using the author’s augmentation code [
45].
When preparing data for inputting both models, it is necessary to carry out the process of marking images (image annotation)—this is one of the key stages in creating an effective computer vision system. This process converts the information into a format that can be understood by the image analysis algorithm. During the marking process, the original image is supplemented with metadata about the location of the defect in case it is fixed by an expert technologist [
54,
55].
Figure 3 shows an image annotated with the VGG image annotator. It is worth noting that a number of images have the effect of “blur” (visible in the area of crack No. 4), which in the future will give resistance to this external factor when using the developed algorithm in real conditions.
An important step in the implementation of a convolutional neural network of the U-Net architecture is the selection of parameters during its training. The main parameters for both models are presented in
Table 4.
The models were trained by optimizing the Dice Loss loss function. This loss function is based on a segmentation quality metric called the Dice coefficient (1). The Dice coefficient is twice the ratio of the number of pixels with a correctly identified segmentation mask to the sum of the number of pixels that are either identified as a segmentation element by our model or are actually related to them.
Based on such a metric, a loss function called
Dice Loss (2) is built:
where
X is the set of image pixels that, according to the markup, belong to the crack area, and
Y is the set of image pixels that belong to the crack area, according to the segmentation of the model. The additional term
Smooth is used to smooth the calculation result.
To train the models, the Adam stochastic optimization method [
56] was used, which has demonstrated its effectiveness in many problems. The peculiarity of this method is that it simultaneously uses the adaptation of the gradient descent step taking into account the accumulated gradients, as the Adagrad, Adadelta, Rmsprop and similar methods do, and the idea of momentum accumulation, as Momentum and Nesterov Adaptive Gradient (NAG) do. The value 10
−4 was used as the learning rate parameter. The training process took 300 epochs, was carried out on the basis of NVIDIA Tesla T4 accelerators and took 180 min for the first model and 220 min for the second. The gradient descent step was performed on a batch of size 10, applying probabilistic augmentation to each image within the batch in the case of model 1 and on a batch of size 10 in the case of model 2.
Figure 4 and
Figure 5 show the training graphs of the CNN for model 1 and model 2 on the training and validation sets, where the OX axis shows the learning epochs, and the value of the loss function on the training and validation sets, respectively, is plotted on the OY axis.
Analyzing the graph, we can conclude that the optimization algorithm has reached convergence, as evidenced by small changes in the loss function from epoch to epoch at the end of training on the training set. At the same time, there was no increase in the value of the loss function on the validation set, which in turn indicates the absence of overfitting.