3.1. Proposed Method
The paper proposes a neural network system for recognizing images exposed to random-valued impulse noise; the system has been built taking into account the stage of preliminary processing of incoming images for recognition by the system.
Figure 1 shows the scheme of the proposed neural network system. The input was a color or grayscale image and entered the distorted pixel detector. Each pixel was examined for the presence of a pulse. Information about the location of noisy pixels in the form of a noise map G entered the cleaning system based on an adaptive median filter. A 3 × 3 window was used, but if all the pixels in this small window were noisy, then the filter was applied iteratively. After calculating the median of the neighbors of the distorted pixel, this value was assigned to it. Furthermore, the already cleaned image entered the neural network recognizer, and the result is displayed as a percentage between 10 categories, since the network was trained and tested on the basis of CIFAR10.
Recognizing and cleaning up random-valued impulse noise was a more difficult task than removing salt and pepper noise. A distinctive feature of the proposed scheme is the presence of a detector. The distorted pixel detector was specially designed and tested on various noise efficiencies and showed high cleaning results.
The developed detector, which is part of the image recognition neural network complex, is a 3 × 3 filter, which, by comparing the brightness of the central pixel and the brightness of its neighbors, revealed whether the pixel is noisy. The original brightness values may be close to the pulse values, and the distorted pixels may be nearby. The detection depends on the Euclidean distance between pixels and their brightness difference and calculates a weighted average value over a certain neighborhood. The similarity between any two pixels can be set as follows [
45]:
where
—coefficient of influence of the geometric distance between pixels,
—coefficient of influence of brightness difference between pixels,
and
denote the pixel coordinates in the local area Ω,
and
—parameters of standard deviation of coefficients. The sum
θ is calculated as follows:
Assuming that
T is a threshold value selected empirically, and the array
—map of noisy pixels, then from (8):
Thus, the value obtained from (8) is compared with the threshold T.
In known approaches, the threshold is usually selected experimentally, based on an empirical analysis of the results of the detector operation. In the proposed method, we will apply a similar approach for this purpose. We recommend choosing a threshold according to
Table 1. It can be concluded that as the noise intensity in the image increases, the threshold value
T decreases.
The proposed method used the distance between pixels and the difference in brightness values in the local window to determine the similarity between pixels. First, the pixel in question must be defined as noisy or clean. An estimate of the difference in the distance between pixels in the local window was calculated using (5). The figure represents the squared distances according to the Euclidean metric
between the central pixel and its neighbors located in the window. The set of pixels, the coordinates of which are removed from the central pixel by a distance not greater than the specified R, represent the local window Ω of the detector. The distance
between pixels
and
in the
metric is determined according to:
In
Figure 2, for a local window Ω with a radius of 4, the distance between the central and selected pixels is
.
For the convenience of visual perception of the result of cleaning the image from noise, the Euclidean metric
was used. The
metric is sufficiently suitable for calculating filter masks for cleaning images from impulse noise using adaptive median filtering [
46]. When using pixel values from the previous and next frames, it was possible to create a mask to clean video data from random-valued impulse noise using the same metric
[
47].
The function of the absolute difference between the processed pixel and other pixels in the local window Ω is used to estimate the difference in brightness values between pixels:
where
—is the bit depth of the image pixels. The smaller the value of
is, the greater the brightness difference. Dividing by
keeps
non-negative. The logarithmic function indicates the most significant bit of the difference between pixels and is well suited to explain the digital nature of the data and to align with the visual perception of the human eye [
48].
The second step is to sort the array
in ascending order and sum the first
elements of the sorted array, where
is the number of elements in the local window Ω:
The similarity score between pixels W is calculated from (5) and (12):
To determine the presence of pixel distortion, a threshold is introduced:
If the array is a map of noisy pixels and if an element in is 1, the corresponding pixel in the image is noisy. The size of the local window Ω depends on the noise level. Experiments carried out within the framework of the proposed work showed that the direct relationship between the intensity of impulse noise and the size of the mask for its purification is the most effective.
For the recognition task, a neural network with a pretrained VGG16 architecture was used [
49]. The 16-layer model, pretrained on the ImageNet data set, showed high results in the recognition accuracy of the CIFAR10 base, which is inferior to ImageNet in the amount of training data, that is 60,000 images versus 14,000,000 images. The model is an improved version of the first convolutional network AlexNet [
50], in which large filters (sizes 11 and 5 in the first and second convolutional layers, respectively) are replaced by several filters of size 3 × 3, one after the other. Despite significant drawbacks, such as the heavy weight of the architecture and the low learning rate, the network is easily implemented and surpasses many other models in accuracy.
Let us present the results of the experiment of the influence of image cleaning methods from random-valued impulse noise on the accuracy of recognition of objects in the image by a neural network.
3.2. Experiment
Each RGB channel of the noise-exposed image was denoised separately from the other two at a fixed intensity 1%, 10%, 25%, or 50%. The 1, 10, 25, and 50 percent mean that the number of pixels in each image were randomly distributed. The experiment included recognition using a pretrained network of both individual noisy and clean images, and the entire test database.
Table 2 shows the accuracies of pretrained architects to justify the choice of the basis of the image recognition program. A total of 20 epochs of each architectural network was trained and the highest value was chosen. The table shows that the best learning outcome was the VGG16 architecture. For 13 training epochs, the network recognized 8799 out of 10,000 test images, and thus the network reached a maximum accuracy of 87.99%, and the error at this epoch was 0.5190 (
Figure 3). The learning rate was chosen as 0.01, the training batch had 64 images, and the test batch had 1000 images. The convolution filter 3 × 3 was used.
To solve the recognition problem, a program modeled in the Jupiter Notebook environment on the Conda core was used. First, the necessary libraries and the databases on which the network will be tested, noisy or cleared, must be downloaded. When the VGG16 architecture was loaded, the last line layer was changed to match the 10 classification classes, and the weights of the architecture network trained on CIFAR10 were imported. The network was trained on an HP Laptop 15s-fq1xxx with an Intel(R) Core(TM) i5-1035G1 CPU @ 1.00 GHz 1.20 GHz, 8.00 GB RAM and a 64-bit operating system. Full-color images 32 × 32 are fed to the input of the first convolutional layer. The next step involved image processing with several convolution layers with 3 × 3 receptive fields. This choice was due to the fact that this is the smallest filter size for determining orientations in the image. Next in the architecture are three fully connected layers, two with 4096 channels each and one with 10 channels by the number of dataset output classes. Accordingly, the soft-max layer performed the classification. All hidden layers were equipped with ReLU. The network did not contain a normalization layer to reduce memory consumption and training time. Also, when testing an actual assembled base, it is necessary to load a file with classes in order to display names instead of class numbers. Then, the testing process takes place on the loaded image database and the result is given as a percentage, by using the confusion matrix.
Table 3 shows the recognition results of the CIFAR10 test bases. The proposed method showed good results in image cleaning. It should be noted that at 1% noise, images during recognition show an accuracy higher than known methods, which is explained by a greater degree of image blur when cleaning noise with these methods.
Various criteria and metrics were used to assess the recognition accuracy of the image database:
F1
score is calculated as:
where
—true positive result,
—false positive result and
—false-negative recognition result. The range of
F1 is in [0, 1] where 1 is the ideal classification.
The Matthews Correlation Coefficient (
MCC) belongs to the range [−1, 1] and has the form:
where
—true-negative recognition result.
Confusion matrices for the recognition of the CIFAR10 base, noisy with 25% random-valued impulse noise, are shown in
Figure 4. The matrix corresponding to the noisy image base shows that the neural network was not able to qualitatively recognize objects in the images. This confirms the importance of image preprocessing before loading into the neural network classifier.
Table 4,
Table 5,
Table 6 and
Table 7 present the values of the
F1
score criterion and the Matthews correlation coefficient obtained by cleaning images from random-valued impulse noise of different intensities. The results reflect that the higher quality of image cleaning by the proposed method—the recognition accuracy of the cleaned base is higher by 56.94% compared to the noisy image, and also higher by 6.38%, 1.26%, and 1.17% compared to the methods from [
25,
27,
28], respectively. In some areas that use computer vision for image classification, such gains in accuracy of 1.26% and 1.17% can be of great importance.
We chose the original “Cat_ Maine_Coon “ image (
Figure 5a), we then reduced it to 32 × 32 and noised it with random-valued impulse noise of various intensities.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show the results of the recognition of images noisy with impulse noise, which were cleaned by known methods [
25,
27,
28] and the proposed one. As noted earlier, at 1% noise, the recognition accuracy did not differ by a large amount, from which it is concluded that it does not always make sense to remove noise of such intensity from the image. A noise intensity of 10% had already significantly spoiled the image—the neural network showed the class “deer” for this image even after cleaning with known methods. At 25% noise, the neural network showed similar results, except that the proposed method outperformed the next hypothesized category by only 17.1%. Noise with an intensity of 50% or more could not be completely removed for high-quality recognition by the system, for all cleaning methods could not cope with such an intensity, but such a dense noise very rarely appears when transmitting images. In general, when cleaning by the proposed method, the neural network recognized the image cleaned by the proposed method as being better by 13.62% and higher.