Texture identification in images, from different materials, is a key problem in computer vision. A texture can be defined as a basic pattern (molecular) arrangement that repeats itself in a structured way on object surfaces creating recognizable patterns in the images. Therefore, each pattern defines inherent and specific texture patterns depending on its molecular structure, allowing the discrimination of different materials through the images they produce. Typical materials with textures are, for example, wood fibers, granite plastic or metal grids, textiles, etc.
In industrial production and manufacturing processes, involving textured materials, defects in the final product becomes a serious handicap, generating significant economic losses due to drop in product quality. In this regard, the design of mechanisms to address this problem, the identification of defects, expressed as distortions, ruptures or even disappearance of patterns, is critical. These defects can be captured with images, and the application of automatic computer vision-based techniques appear as promising approaches. An important problem in defect detection, based on computer vision approaches, is that defects can appear under whimsical shapes with irregular and unstructured patterns and without specific localizations. Therefore, recently there has been a great interest towards the development of algorithms that focus on the detection of “outliers” in the texture under inspection, or what is the same, the accomplishment of the classification of a class, in which any defect that does not fit to the learned patterns, for a specific texture, will be classified as not belonging to the basic class of the correct texture.
This research work proposes a solution, for the defect detection in textures, that combines two machine learning-based approaches as classifiers: (a) convolutional autoencoders (CA) in deep learning, and (b) one class support vector machines (SVM). Both methods are trained by only using fault (defect) free textured images for each type of texture, allowing the automatic labeling of samples for the SVMs. With this proposal, only correct images are used for training both classifiers, achieving an important advantage against existing methods that avoids the need to select defective textures, which are always difficult to define due to the capricious appearance of defects.
The convolutional autoencoder is used for two fundamental operations, firstly for obtaining the reconstruction error from a given textured image to identify inherent defects present in surfaces and embedded in the image, and secondly, for the compression of the texture information in its latent layer. This compressed information will later be used to train the SVM.
In this regard, the main contribution of this research work is based on two image processing streams stablished through the CA: (1) the first processes the incoming image from the input to the output, producing a reconstructed image, from which a measurement of correct or defective image is obtained; (2) the second processes the same image through the coder, just until the latent layer, where it is compressed and mapped as a latent vector, which is the input to the SVM to produce a measurement of classification. The use of only the autoencoder is justified under the assumption that early convolutional layers extract general, low-level features such as patterns that characterize the textures. Both measurements are conveniently combined, making an additional contribution, i.e., a hybrid approach. As explained above, only defect free samples will be used for training both algorithms, and hence, the SVM training labels automatically all the samples in its training stage as belonging to the defect free class.
1.1. Methods Guided by Structured Patterns
In the study of textures with the aim of detecting defects, it is important that the proposed solution is invariant to the orientation and scaling of the texture [
3] to ensure the robustness of the results.
Within classical methods of texture analysis, three different categories of methods can be distinguished depending on the approach used: statistical, structural, and model-based. Statistical methods were the first developed. In 1981, Davis [
4] defined a tool he called a polarogram, which allows to obtain invariant characteristics for textures. The polarogram is defined as a graph in polar coordinates that represents statistics of a texture with respect to its orientation. Results for textures with 16 different orientations are reported.
Like Davis, Mayorga and Ludeman [
5] also used polar frames, thus achieving invariance in rotation, but unlike Davis, they used texture edge data by deriving them in certain directions. This method presented problems of variability according to the method used to obtain edges of the texture, with high dependency on the edges defining the texture.
In the study by Pietikainen [
6], the texture image is presented based on the symmetric autocorrelation of the center, the local binary pattern and the gray density. The characteristics obtained were mostly invariant to rotation. The main problem of this method is that it is only efficient for highly ordered textures, but not in unstructured textures where results are ineffective.
Local binary patterns (LBP) [
7] is one of the most classical methods within the classification of textures in computer vision. A binary histogram is obtained by considering the neighborhood of the pixel under study, obtaining a characteristic vector after normalization. This method is combined with the Histrogram of Oriented Gradients (HOG) [
8] to increase its performance. The invariant moments of an image allow also the recognition of their patterns [
9]. In unstructured texture patterns these approaches are not appropriate because no HOG patterns can be derived.
Of all existing moments, Teh and Chin [
10] show in their study that Zernike’s moments are the best results when applied to textures. The orthogonality of these moments gives them an invariance against rotation. Wang and Healey [
11] used these moments in multispectral functions and established relationships to obtain invariance under different illumination conditions. This is not the specific problem for defect detection in textures.
Another alternative method applied is harmonic expansion. This method achieves the characteristics of textures by decomposing their harmonic components into their polar form, and their subsequent projection, obtaining invariant coefficients to the rotation with which the texture pattern can be defined. The implementation carried out in 1985 by Alapati and Sanderson [
12] includes these types of characteristics, but the method developed was only rotationally invariant, i.e., not suitable for defects, that does not include rotational patterns.
In 1992, Tsatsanis and Giannakis [
13] used high level descriptors obtained through cumulative and multiple correlation. A major disadvantage of this method is its high computational cost and that the high order descriptor may have lost too much information in order to correct texture discrimination.
Another group of alternatives is model-based methods, where the texture image is modeled as a probabilistic distribution or a linear combination of a set of basic functions. The coefficients of these models are used to characterize the textured image. The key issue of these methods is how to estimate the coefficients of these models and how to choose the correct model for the selected texture.
Bovik et al. [
14] used Gabor filters, obtaining results of 90% accuracy, combining these filter sets with elementary transformations of invariant textures. Studies on Gabor filters such as the one carried out by Randen [
15] indicate that this type of method outperforms others in terms of complexity and error rate. Texture defects do not display patterns as defined by Gabor filters.
Cohen [
16] used Markov models to model textures. Other implementations with this model were carried out by Chen and Kundu [
17] where texture descriptors were obtained by using multichannel sub-band decomposition and the hidden Markov.
Another model-based method widely used in the literature is the SAR one. This model is rotationally invariant, and Kashyap and Khotanzad [
18] developed an autoregressive circular model (CSAR) considering circular neighborhoods. Mao and Jain [
19] improved the previous approach with a model they call (RISAR), in which the weighted grey values are separated into several circles; so that, after applying the rotation they are approximately equal. However, some problems arise, such as how to choose an appropriate size of neighborhood or how to select a window size in which the texture is considered homogeneous.
Finally, in structural methods, the full texture pattern is divided into texture elements arranged according to placement rules, so that structural properties can be derived.
Goyal [
20] used the perimeter and compactness of the basic structural elements of a given texture assuming invariance. He converted the original histogram of the image into an invariant one, taking into account the number of structural elements existing in the case under study.
Another algorithm based on structural methods is morphological decomposition. Lam and Lin [
21] use invariant iterative morphological decomposition (IMD) for classification. By means of this method, the texture is decomposed into a set of composite images and some statistical characteristics (mean, variance, normalized variance and gradient) are obtained for each component.
Finally, Eichmann and Kasparis [
22] base their analysis on the Hough transform. They consider the rows as the points in the plane of the transform and since the Hough transform is performed on the binary image, they use the Radon transform to implement the Hough transform for the non-binary textures. This method can, therefore, be understood as a topological texture descriptor.
As mentioned in the title of this section, all the above methods are based on structured patterns defining the texture. Thus, when trying to identify defects, characterized by unstructured textured patterns without geometric relations, as expressed before, the above techniques are difficult to apply. This would require establishing the structural and geometric relationships between correct and defective patterns to study the differences, which is inappropriate. Consequently, the most desirable option is to apply global strategies without the need to define specific strategies for each type of textured pattern, and this is what can be achieved with our proposal based on autoencoders.
1.2. Deep Learning Algorithms for Anomalies Based on Texture Analysis
In order to evaluate the performance of deep learning-based approaches, several datasets are available, such as MNIST [
23], ImageNet [
24], COCO [
25], PASCAL VOC [
26] or CIFAR 10 [
27]. When trying to detect anomalies, there are two common approaches using any of these datasets as the basis. The first one consists in labeling some of the classes present in the dataset as anomalies, so that the proposed method is trained with the classes labeled as normal and its detection capacity is evaluated against the anomalies [
28,
29,
30]. The second alternative is to extend a chosen dataset with images containing anomalies, and then train the method with the aim to work on anomaly detection [
31]. The main drawback of these approaches is the definition of images with anomalies which are to be supplied for training.
Within deep learning, different methods have been proposed for texture anomaly detection [
32], but two of them are mainly used: antagonistic generative networks (AGN) and convolutional autoencoders.
Schlegl [
31] proposed to model the training data using an AGN model trained only with defect-free images. These networks work with a generator and a discriminator. In this way, a latent sample that reproduces the input image is sought, achieving to deceive the discriminator. The anomaly present in the texture is obtained by comparing the input image and the generated image pixel by pixel. This approach is close to the one proposed in this paper with regard to image processing until reconstruction, but this method does not consider the identification of anomalies under the assumption that defects produce high variability in the intensities of the images in the early layers in the network, diluting this effect across the AGN model.
CA [
32] are also used to detect anomalies. Generally, the anomaly is detected by considering the reconstruction error obtained by the network when it tries to reconstruct an anomaly not previously trained. To obtain information about the reconstruction error at pixel level, it is necessary to go through the image pixel by pixel. Bergmann [
33] points out the disadvantages of this way of comparing the image with the original and proposes incorporating the spatial information of the local regions using structural similarity [
34] with the aim of improving the segmentation results. There are also several extensions of CA, such as the variational autoencoders [
35] that have been used by Baur et al. [
36] for the segmentation of brain MRI anomalies. In this work, Baur et al. use different CA architectures to reconstruct the original brain images used in their study, concluding that these architectures with dense bottlenecks cannot reconstruct these types of images with the required efficiency, not being of practical use for anomaly detection. This leads us to think that the same will happen in our case for defect detection.
In [
37], Minhas and Zelek also use autoencoders to create a semisupervised algorithm for anomaly detection, focusing the study on two different datasets, the first being a synthetic one, and the second centered in railway images. They conclude that, while the synthetic dataset performs well, the real dataset presents noisy results. In addition, this approach has been applied to the detection of only one type of defect in the dataset, not being applicable to our case, which presents five different types of defects for five different textures.
Another example of practical use of convolutional neural networks for anomaly detection can be found in the work of Staar et al. [
38]. In this case, they use a specific network architecture called triplet network to detect surface defects. The main drawback of this work is that it is a supervised method that needs precise labeling of defects in the dataset to provide effective results. This requirement is not easily fulfilled in industrial products, because the defects considered in our approach contain unpredictable and fanciful forms.
Finally, Maldonado et al. [
39] focused the work on the capacity of transferring trained parameters in an unsupervised domain to other problems where the dataset is supervised. The considerations of this paper are valuable in processes of fault detection, where in many cases the error cannot be categorized correctly due to the lack of samples with defects available.
In short, when compared to these previous works, our proposal uses a real industrial dataset, with up to five different textures coming from many other products that allows us to design a procedure applicable to a great variety of textures for inspection problems. Additionally, we do not need any type of labeled dataset, and by combining two different approaches for anomaly detection like CA and one class SVM, the proposed solution outperforms the results obtained using CA only, which are aligned with the approaches mentioned above.
Erfani et al. [
40] deal with anomaly detection by combining a deep belief network approach that extracts features, which are used as inputs for one class SVM classifier. The features are obtained in the compressed latent space. This is a combined approach where anomalies are detected by the SVM and the network provides the high dimensionality in the input space. Our proposal goes further, in the sense that it also exploits this idea but also uses the network, in this case the autoencoder, to classify textures for anomalies detection and free of defects. Then, it combines both results for a final decision, in what we have called hybridization, making a major difference from such work. Additionally, the experiments reported in this work indicate that the performance of nonlinear kernels in SVM can be replaced by linear ones because of the generation of high dimensionality vectors. We also exploit this fact verifying the same behavior in our experiments. This justifies the use of the linear kernel in the proposed SVM.
Anomaly detection in images has also been considered in Beggel et al. [
41]. They use the latent space in an autoencoder to model a likelihood for discriminating textures with and without anomalies, based on what they call density estimation, i.e., determining the degree of occurrence of data in the vector space. High concentrations would indicate fault-free textures and deviations from these concentrations would imply anomalies. This work, together with the previous ones, confirms the importance of the use of latent space in the detection of anomalies, which is also exploited in our approach.
This paper is organized as follows:
Section 2 presents in a concise manner the main contributions of this paper over the state-of-the-art.
Section 3 presents the theoretical design of the solution proposed, focusing on the design principles of the autoencoder and the SVMs, also establishing the hybridization method to optimally combine the complementary outputs of both approaches.
Section 4 presents the public dataset used for performance evaluation of the algorithms and the results obtained, giving detail about the different classes presented in it, type of errors and number of samples of each class and errors. This section also gives detail about the training procedures of both algorithms and results obtained for the case of use presented in this research work. Finally, in
Section 5 conclusions extracted from this work are summarized, justifying the suitability of the solution proposed among other alternatives.